Velior's Corporate Blog about Translation and Translation Industry



Video: Merging Segments in OmegaT

May 27th, 2013, Roman Mironov

Have you ever felt frustrated about a segment split into two in OmegaT? Do you want to be able to fix this kind of problem, but just don’t know how to go about it?

Watch this video to understand how to merge segments in OmegaT:

Script:

Many OmegaT users admit that the current approach to segmentation in the program isn’t exactly straightforward. I agree. It’s not rocket science either, however. Indeed, to build smart rules, you need to understand regular expressions, which might be a challenge. But if you just want a quick fix for your current segment, knowing regular expressions is often unnecessary.

  1. So let’s start by merging segments with a “quick-fix” rule. This sentence was incorrectly segmented on the period in an abbreviated word. Here in the note, you can see how the segment is supposed to look. I need to merge the two segments right after the period. After I open the Segmentation Setup window, I can select the English set of rules (since my source language is English) and add my “quick-fix” rule to this set. But I can also click Add and create a new set of rules that I will name “Quick-Fix Rules.” For the language pattern, I will enter “EN-US.” By doing so, I make sure that the rules in this new set take priority over all the rules both in the English set and the Default set. None of the rules in those other sets can interfere with my “quick-fix” rules.
  2. I am ready to add my new rule now. I click Add in the section below. After a blank rule appears, I enter “Int.” as the “Pattern Before” and “space 10” as the “Pattern After.” Make sure you don’t forget this space when you add rules. Since I need to merge segments at this point, I leave the checkbox unchecked. As soon as I reload the project, I get the correct segmentation.
  3. Now, what I did isn’t the most efficient approach, but that isn’t the point with any “quick-fix” rule. The point is that it’s easy and it works. Its main drawback is that if I have a lot of similar segmentation issues in a project, I’ll have to make a rule for each instance. For example, in this case, I will need to add separate rules for “Int. 20” and “Int. 30.”
  4. Okay, I’m now ready to take this to the next level. I see a pattern here, so I want to add a rule that will cover the entire pattern instead of adding rules on a case-by-case basis. I delete the old rule first, so that it doesn’t interfere with the new rule. This new rule can either go into the “Quick-Fix Rules” set again or it can go into the English set if I think it could be of value in future projects. If it does, I need to save it to the English set or even the Default set. In this case, it doesn’t, so I’ll just put it into the “Quick-Fix Rules.” The pattern I want to cover with my new rule should be the most general one. I want to have each abbreviated word “int.” merged with the following text, whether that is a number, a letter, or a symbol. The rule will look like this. “Int\.” as the “Pattern Before” is just “Int” with a period. The backslash before the period ensures that the period is not treated like a regular expression. OmegaT will only apply this rule if “Int” is followed by a period, but not any other character. “\p{Zs}“ in the “Pattern After” means simply any kind of a space; it’s just a better representation of a space than the one I used before. Again, I’m not enabling the checkbox. In other words, I am telling OmegaT not to make a segment when “Int.” occurs before a space, which means it’s an abbreviated word. The segmentation is correct now in all the three instances, because the new rule covers them all.
  5. But wait, what’s happening here? The same “int.” gets a sentence segmented anyway. The reason is that the rule that I added covered just the upper-case letter “I,” while this word starts with a lower-case letter in this segment, so the rule doesn’t apply. Well, no big deal. I can adjust it to make sure the rule includes both a lower and an upper-case letter. One way to do so is to add the “(?i)” regular expression to the “Pattern Before.” This regular expression enables case-insensitive matching. As a result, OmegaT will apply the rule to any “int.”, whether it’s written in lower or upper-case letters. The segmentation is correct now. I can take a break, I guess.
  6. Not so fast. This same abbreviated word also occurs as a part of another word at the end of a sentence! Now, my carefully crafted rule merges two completely “innocent” sentences. I need to adjust it even further. A simple way to do this is to add a regular expression that represents a word border. It looks like this: “(?i)\bint.” It means that for this rule to work, “int.” must occur as a standalone word. After tweaking this rule, I get correct segmentation.

That’s about it. Stay tuned for the next video, where I’ll show how to split segments in OmegaT. In the meantime, feel free to ask any questions in the comments.

If you missed the previous video on this topic, click this link to learn about the basics of segmentation in OmegaT.

Tags: ,

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Contact Us

Phone

+7 (962) 155-89-07
+7 (4932) 23-87-23

info@velior.ru
velior@list.ru