Based on the readers’ feedback, I made a short video explaining the very basics of segmentation in OmegaT. You can find the video and the script below:
- Let’s start by looking at the paragraph versus sentence-level segmentation. The sentence-level segmentation is more practical because you normally don’t want to have an entire paragraph in a segment. What you want is a sentence. This is an example where three sentences belong to the same paragraph in the source file. But they appear as three separate segments in OmegaT. This is the default setting.
- Let’s try to change this setting to the paragraph-level segmentation and see what happens. Now, these three sentences appear as a single segment. Translating them like this is never a good idea.
- Let me restore the default setting. This is much better now.
- But wait. Here’s an example where the paragraph-level segmentation actually comes in handy. I have a German source file where each piece of the text is on a separate line and is, therefore, a paragraph. But the file also includes multiple cases of abbreviated words that cause incorrect segmentation—a segment split into two. Instead of correcting each instance using a custom segmentation rule, I can simply resort to the paragraph-level segmentation. As a result of switching to the paragraph-level segmentation, I got correct segmentation.
- So far, so good. Now, let’s see how the main default rule works. This is the rule. As many other rules, this one is based on a regular expression. This rule splits the text after any period, question mark, or exclamation mark followed by a space. The checkbox means splitting the text.
- Let’s go back to the editor. In this case, this rule splits the text on the period. But in the next instance, it fails to split the text after the question mark here because it’s not followed by a space. It’s followed by a non-breaking space. This means that I need to adjust my rule to make sure it covers a non-breaking space as well.
- Let’s go back to the rule. This regular expression represents a regular space. I will replace it with another one that represents any kind of a space, including a non-breaking one. There you go.
That’s about it. Thank you for your time. Please let me know in the comments if you have any questions.