Video: Merging Segments in OmegaT


May 27th, 2013, Roman Mironov

Have you ever felt frustrated about a segment split into two in OmegaT? Do you want to be able to fix this kind of problem, but just don’t know how to go about it?

Watch this video to understand how to merge segments in OmegaT:

Script:

Many OmegaT users admit that the current approach to segmentation in the program isn’t exactly straightforward. I agree. It’s not rocket science either, however. Indeed, to build smart rules, you need to understand regular expressions, which might be a challenge. But if you just want a quick fix for your current segment, knowing regular expressions is often unnecessary.

  1. So let’s start by merging segments with a “quick-fix” rule. This sentence was incorrectly segmented on the period in an abbreviated word. Here in the note, you can see how the segment is supposed to look. I need to merge the two segments right after the period. After I open the Segmentation Setup window, I can select the English set of rules (since my source language is English) and add my “quick-fix” rule to this set. But I can also click Add and create a new set of rules that I will name “Quick-Fix Rules.” For the language pattern, I will enter “EN-US.” By doing so, I make sure that the rules in this new set take priority over all the rules both in the English set and the Default set. None of the rules in those other sets can interfere with my “quick-fix” rules.
  2. I am ready to add my new rule now. I click Add in the section below. After a blank rule appears, I enter “Int.” as the “Pattern Before” and “space 10” as the “Pattern After.” Make sure you don’t forget this space when you add rules. Since I need to merge segments at this point, I leave the checkbox unchecked. As soon as I reload the project, I get the correct segmentation.
  3. Now, what I did isn’t the most efficient approach, but that isn’t the point with any “quick-fix” rule. The point is that it’s easy and it works. Its main drawback is that if I have a lot of similar segmentation issues in a project, I’ll have to make a rule for each instance. For example, in this case, I will need to add separate rules for “Int. 20” and “Int. 30.”
  4. Okay, I’m now ready to take this to the next level. I see a pattern here, so I want to add a rule that will cover the entire pattern instead of adding rules on a case-by-case basis. I delete the old rule first, so that it doesn’t interfere with the new rule. This new rule can either go into the “Quick-Fix Rules” set again or it can go into the English set if I think it could be of value in future projects. If it does, I need to save it to the English set or even the Default set. In this case, it doesn’t, so I’ll just put it into the “Quick-Fix Rules.” The pattern I want to cover with my new rule should be the most general one. I want to have each abbreviated word “int.” merged with the following text, whether that is a number, a letter, or a symbol. The rule will look like this. “Int.” as the “Pattern Before” is just “Int” with a period. The backslash before the period ensures that the period is not treated like a regular expression. OmegaT will only apply this rule if “Int” is followed by a period, but not any other character. “p{Zs}“ in the “Pattern After” means simply any kind of a space; it’s just a better representation of a space than the one I used before. Again, I’m not enabling the checkbox. In other words, I am telling OmegaT not to make a segment when “Int.” occurs before a space, which means it’s an abbreviated word. The segmentation is correct now in all the three instances, because the new rule covers them all.
  5. But wait, what’s happening here? The same “int.” gets a sentence segmented anyway. The reason is that the rule that I added covered just the upper-case letter “I,” while this word starts with a lower-case letter in this segment, so the rule doesn’t apply. Well, no big deal. I can adjust it to make sure the rule includes both a lower and an upper-case letter. One way to do so is to add the “(?i)” regular expression to the “Pattern Before.” This regular expression enables case-insensitive matching. As a result, OmegaT will apply the rule to any “int.”, whether it’s written in lower or upper-case letters. The segmentation is correct now. I can take a break, I guess.
  6. Not so fast. This same abbreviated word also occurs as a part of another word at the end of a sentence! Now, my carefully crafted rule merges two completely “innocent” sentences. I need to adjust it even further. A simple way to do this is to add a regular expression that represents a word border. It looks like this: “(?i)bint.” It means that for this rule to work, “int.” must occur as a standalone word. After tweaking this rule, I get correct segmentation.

That’s about it. Stay tuned for the next video, where I’ll show how to split segments in OmegaT. In the meantime, feel free to ask any questions in the comments.

If you missed the previous video on this topic, click this link to learn about the basics of segmentation in OmegaT.

Tags: ,

Комментарии:

  1. Great post, as usual, Roman. It’s worth saying that you could also have performed a global search-replace on the source files, replacing Int.[SPACE] with Int.[NON-BREAKING SPACE]. Sometimes, using non-breaking spaces is the easiest way to fix segmentation issues without even touching the rules. Of course, this only works when you need to merge 2 segments, not the other way around. :)
    Bests
    Marco

  2. In non-structured text where TAB is not used as a delimiter, you can use TAB to split segments. Of course, it’s not as universal as using non-breaking space to merge segments, but still sometimes it permits you to get your segments straight without touching the rules.

    • Hi Kos,
      Another good tip, thanks!
      Did you know it was you who opened my eyes to the world of opportunity of using segmentation rules in OmegaT? :)
      Best,
      Roman

  3. S.Kumar

    Sir,

    I am in India.I have installed omegaT(LATEST VERSION of 3.1.0). I want to
    be a Translator. I want to translate ENGLISH TO TAMIL AND TAMIL TO ENGLISH
    documents. If I open a word document through omegaT, it shows line by line
    as segment. I want to merge (English and Tamil) three or four lines
    (segments) into single segment wherever necessary. How to do that, please?

    There is a SCANNED TEXT PDF file containing both English and
    Tamil Languages. I want to import them all into OmegaT.Is it possible?
    Opening a Microsoft word2007 file containing both English and Tamil texts
    is possible?

    Thanks,
    S.Kumar

    • Hello,

      I am in India.I have installed omegaT(LATEST VERSION of 3.1.0). I want to
      be a Translator. I want to translate ENGLISH TO TAMIL AND TAMIL TO ENGLISH
      documents. If I open a word document through omegaT, it shows line by line
      as segment. I want to merge (English and Tamil) three or four lines
      (segments) into single segment wherever necessary. How to do that, please?

      To make text appear “merged” across lines in OmegaT requires doing so in the source file first. Open your source file and delete carriage returns between the respective lines.

      To make sure translation appears under the source text in the target document, simply leave all the source text in the segment (press Ctrl+Shift+R to insert it if necessary), then put the cursor at the end of the source text, and press Ctrl+Enter to create a new line. Put the translation into that line.

      There is a SCANNED TEXT PDF file containing both English and
      Tamil Languages. I want to import them all into OmegaT.Is it possible?
      Opening a Microsoft word2007 file containing both English and Tamil texts
      is possible?

      Of course. OmegaT does not care what language is in your file, it just shows whatever text there is.

      Best wishes,
      Roman

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

  • RSS

    Subscribe by E-mail

  • © 2005-2014 Velior

    Contact Us

    Phone

    +7 (962) 155-89-07
    +7 (4932) 23-87-23

    info@velior.ru
    velior@list.ru