GET A FREE CONSULTATION OR SAMPLE TO GET YOUR PROJECT GOING.

Yes, I want my consultaion

OmegaT Segmentation: DIY and Have Fun

How to make the segmentation rules specific to a single project in OmegaTThis article builds on the previous post about segmentation in OmegaT to explain how you can take advantage of these rules. This post is more technical, though, so you’ll probably need to be more patient than usual to get the most value from it. Let’s start by looking at a current default rule so that you can get a grasp of what a rule looks like:

Default segmentation rules

Undesirable segmentation (3 sentences in one segment):

I have a solution. Do you want to validate the solution? Yes, I do!

This is the default rule that takes care of this very basic segmentation instance:

Break or Exception Pattern before Pattern after
Break (checkbox enabled). Break means splitting a segment. [.?!]+ p{Zs}

What the rule does: Segments after any period, question mark, or exclamation mark followed by any type of space.

Result:

I have a solution.

Do you want to validate the solution?

Yes, I do!

Adding a custom segmentation rule to merge segments

Here is a simple process of merging an incorrectly split segment:

Undesirable segmentation:

Int.

10 should be replaced.

where “Int.” is an abbreviated word.

The sentence got segmented on the period, which represents abbreviation. I need to merge it right after the period. I can do this by adding a simple exception rule into the language-specific rules to ensure it applies before the default one:

Break or Exception Pattern before Pattern after
Exception (checkbox disabled). Exception means merging two segments. Int.
(the backslash means that the period is really a period and not a regular expression that matches any character)
s (space)

Result:

Int. 10 should be replaced.

But then, I may realize that this “Int.” also occurs as “int.” with a lower-case letter:

Undesirable segmentation:

The int.

10 should be replaced.

Since my rule has “Int.” beginning with an upper-case letter and the rules are case-sensitive, my rule fails to work in this instance. I can adjust it by indicating that “i” can be both a lower and upper-case letter:

Break or Exception Pattern before Pattern after
Exception (checkbox disabled) (?i)int. s

Result:

The int. 10 should be replaced.

As if that weren’t challenging enough, I also notice that this same abbreviated word may actually occur as a part of a word at the end of a sentence! As a result, my rule merges two “innocent” sentences:

Undesirable segmentation:

She might faint. Or she might get furious.

To avoid this, I need to adjust my rule even further. One way to do this is simply to indicate that “int.” should be merged with the next sentence only if it’s a standalone word like this:

Break or Exception Pattern before Pattern after
Exception (checkbox disabled) b(?i)int.
(b means word border)
s

Result:

She might faint.

Or she might get furious.

Adding a custom segmentation rule to split segments

I also want to be able to split two sentences merged into one. To do this, I can build my rule around what’s joining those two sentences. And the other pattern can be a period that represents any character.

Undesirable segmentation:

Open the Settings window.< br>Open the Files tab.

In this case, the pattern joining the segments is a plain-text tag between them. This tag is what I’m going to use as a basis for my rule:

Break or Exception Pattern before Pattern after
Break (checkbox enabled) < br> . (period)
Break (checkbox enabled) . (period) < br>

The first rule segments after the tag to create the second sentence a separate segment.

Open the Settings window.< br>
Open the Files tab.

The second rule moves the tag out of the first segment so that it’s a separate segment as well.

Open the Settings window.
< br>
Open the Files tab.

Summary

Even though these things might seem challenging, believe me this is no rocket science. As soon as you make it a habit to optimize segmentation, you’ll even start having fun, I promise. An additional benefit is that you’ll learn how to use the regular expressions. You can then also employ them for searching the text in OmegaT or other applications such as your favorite text editor.

Please let me know in the comments whether these explanations make sense at all. And please get back to me if you actually start using the segmentation rules in OmegaT as a result of reading this article!

One comment

  • Kos Ivantsov says:

    Да уж, воистину регулярные выражения — это самая сильная магия в деле превращений текста. Спасибо за статью.

Add comment


About the Author

Roman Mironov
Roman Mironov
CEO & Founder

As the founder of Velior, Roman has had the privilege of being able to turn his passion for languages into a business. He has over 15 years of experience in the translation industry. Roman has helped dozens of clients increase sales by making their products appealing for speakers of other languages.