Technology

How Machine Translation Adapts With Double-Byte Languages

UTF-8 alongside neural networks boost machine translation performance when dealing with Double-Byte and Multi-Byte languages.

Thalita Lima

8 minutes, 48 seconds

Each language in the world belongs to a family that explains its origin and spread, but did you know that according to data system parameters, all spoken languages can be separated into two groups: Single-byte and Double-byte languages?

Double-byte languages refer to character encoding systems where a character is represented by 2 bytes (16 bits) to represent a character.

This happens because these languages have a large character set, requiring more space to store them than Single-byte (8 bits) encoding systems.

Double-byte languages include Chinese (simplified and traditional), Japanese, Korean, Vietnamese (in some older encodings), and many others across the globe.

Support systems like UTF-16 and UTF-8 that can switch between one, two, or even more bytes are necessary for Double-Byte and Multi-Byte languages.

Important: Double Byte (DBCS) languages are often mistakenly identified as Multi-Byte Character Set (MBCS) because they have similar concepts.

Let’s look at the details in this article, focusing on the relation of this encoding difference in machine translation!

1. Character Mapping Systems for Double-Byte Languages

The encoding process became easier after the invention of UTF (the Unicode Transformation Format).

1.1) The Systems That Came About Before Unicode Became Popular – DBCS (Double-Byte Character Set)

The creation of DBCS was for languages that need many characters, mainly Chinese, Japanese, and Korean (CJK).

Examples: Shift JIS (Japanese), Big5 (Traditional Chinese), EUC-KR (Korean).

With 2 bytes (16 bits), it can represent up to 65,536 characters (2¹⁶).

‍

1.2) Advanced Systems: UTF-16 and UTF-8

UTF is an encoding scheme that converts Unicode characters into binary formats; thus, computers and software systems can effectively present and share text from many languages and scripts.

It is an international standard under which each character is assigned a unique number (code point), irrespective of its belonging to any language or writing system.

UTF describes the process of converting these code point numbers into a byte stream that is understandable by a computer.

‍

Examples of UTF:

UTF-16: 2 or 4 bytes for each character.

It is still in use in specific systems that deal with Asian characters.

For example, some versions of Windows use UTF-16 internally.

‍

UTF-8: each character is 1-4 bytes worth, depending on the symbol.

Latin alphabets (English, Spanish, Portuguese) - 1 byte; special characters and Asian languages - 2-4 bytes.

Also, UTF-8 is the most widely used encoding nowadays, on the web, in databases, and in modern applications.

‍

1.3 The SBCS Doesn't Work for Double-Byte Languages. Why?

SBCS (Single-Byte Character Set) is a system that has a maximum of 256 characters (1 byte = 8 bits = 2⁸ = 256 possibilities). It is fine for languages with smaller alphabets, English for example, Spanish or French, that can be represented within this limit.

With languages using a thousand characters, SBCS lacks the space for Double-Byte Languages!

‍
Chinese has more than 50,000 characters, although 3,000-5,000 of these are in everyday use;

Japanese combines kanji (Chinese logograms) with hiragana and katakana and requires far more characters than SBCS can contain.

So that’s the reason Double-byte languages need appropriate systems.

2. Double-byte Languages in Machine Translation

There are some salient features these languages have, that machines have to get through:

2.1) Support for Encoding

The vast majority of language machine tools available currently can perform efficiently with UTF-8 and UTF-16, as these are versatile and represent highly complex characters.

UTF-8 is more accepted than the rest of the encoding formats because it allows for English speakers (who use 1 byte) as well as Japanese and Chinese speakers (who require multiple bytes).

This is optimal when considering the competitive world business which is dominated by countries of English and Mandarin speakers.

2.2) Segmentation of Text

In Spanish or Portuguese, spaces are used to segment each word, making isolating words in a sentence very easy.

In German or Japanese, word delimiting, or text segmentation, where lexical units are demarcated, is undertaken by machines before any translation is done, because spaces as delimiters are absent.

2.3) Ambiguity and Context

A character in numerous Asian languages can have various definitions contextual to the situation.

Take for instance “银行”, in Chinese, more specifically “yínháng”, which translates to “bank”, could mean a financial institution or the bank of a river.

Systems nowadays like DeepL, Google Translator, Microsoft Translator, and Papago (Naver) use neural networks* to predict context and then decide the best sentence out of existing options.

*Artificial neural networks are computational models that mimic the human brain. The massive data is processed via artificial neuron layers, looking for patterns and learning to make decisions with or without any predetermined rules.

In machine translation neural networks consider the context on a sentence level rather than word-by-word translation, which aligns with our goal of making more natural and better translations.

2.4) Word Order

The differences in grammatical structures among languages are huge…

Example:

I eat an apple.

Japanese:「I apple eat」 (「リンゴ　を　食む)

Machine translation has to rearrange words properly so that the meaning of the sentence is not lost.

2.5) Translation of Phrasal Expressions of Native Idioms

Idioms can be tricky to translate directly.

Eg: “Even monkeys fall from trees” naturally translates to the Japanese idiom:猿も木から落ちる (“Even experts make mistakes”).

3. Are DBCS and MBCS the Same Thing?

Double-Byte (DBCS) and Multi-Byte (MBCS) should be differentiated from each other.

Double-Byte Characters Set (DBCS) → At first, the encoding systems that manage double byte or 16 bits for a character is the double-byte Characters Set (DBCS).

Example: Big5 (Traditional Chinese), Shift JIS (Japanese), EUC-JP(Korean)

These had to be systems that exploited pre-unicode paradigms.

Multi-Byte Character Set (MBCS) → (any encoding with from two bytes per character)

Ex: (UTF-8, may use up to 1, 2, 3 or 4 bytes per character)

Before Unicode, DBCS (Double-byte character set) was typically used for the many D's like CJK (Chinese, Japanese, Korean) with a two-byte per character restriction.

Some languages other than Thai, Vietnamese, Hindi, and Arabic (Multi-Byte Character Set) are normally going to be encoded in this system of Unicode.

Because of UTF-8 and UTF-16, DBCS is dying out and many languages are or soon will be represented as 'Multi-byte' or rather under their respective names (e.g. Chinese, Japanese, Korean, Swahili, and others).

Conclusion: For the Machine Translation of Double-byte languages, the similarities are stronger than ever. Today, systems can handle data for most languages (two or more bytes).

Double-byte languages is still a term in use and it is quite popular, but now you know the range is bigger.

4. Double-Byte (DBCS) and Multi-Byte Languages Worldwide

We have talked about Chinese and Japanese but there are a lot more Double-byte languages to include as well. So let's take a tour around the world to learn all these languages…

4.1 Older Historically Double-Byte (DBCS) Languages

DBCS is for the most part used by CJK Languages (Chinese, Japanese, Korean) in East Asia.

Simplified Chinese (China, Singapore) – Old encoding: GB2312, GBK
Traditional Chinese (Taiwan, Hong Kong, Macau) — Old encoding: Big5
Japanese – Old encoding: Shift JIS, EUC-JP
Korean – Old encoding: EUC-KR

These languages have a vast number of characters, necessitating Double-Byte encoding in pre-Unicode systems.

Korean keyboardImage by Wikimedia Commons

4.2 Standards-Compliant – Languages Using Multi-byte (MBCS)

In the present day, encoding systems in these languages might require two, three or even four bytes per character. They are usually symbolized as UTF-8 or UTF-16 today.

Southeast Asian Languages:

→ Vietnamese — Uses Latin alphabet with many diacritics that may take more than one byte in older encodings.

→ Thai – some character combinations need more than 1 byte to represent them correctly.

→ Lao — (just like Thai) One byte, more appropriate for 1 byte characters in the relevant context.

→ Khmer (Cambodia) – has a large character set that needs Multi-byte.

→ Myanmar (Burmese) – Contains hard characters that are needed to Multi-Byte encoding.

South Asian Languages:

→ Hindi and other Devanagari like Hindi (Marathi), Nepali/ Sanskrit, Tamil, Telugu, Kannada, Bengali, Gurmukhi (Punjabi), Gujarati, Malayalam, Sinhala.

The scripts are very complex and need to be represented in a format called Multi-Byte encoding with their various combinations.

Middle Eastern languages: Multiple bytes

→ Arabic, Persian – these languages are often encoded using contextualized character encoding, meaning the same character might be represented by more than one byte, depending on the context, since they have relatively small alphabets.

→ Hebrew: Like Arabic, depending on what you use to encode it longs more than one byte.

Tibetan, Georgian, and Armenian: May need many bytes to get them in certain encodings.

Most original scripts of the African and American Indigenous languages (multiple bytes most of them).

5. Why is it Important for Machine Translation to Handle Double-Byte and Multi-Byte Languages?

5.1) Global Accessibility

It is a major opportunity for many Asian Languages as Chinese, Japanese and Korean, as well as some Southeast Asian languages — meaning the encoding requirement is multiple bytes.

If the systems of translation are unable to handle properly Double-Byte and Multi-Byte encoding into words, it results in encoding errors, system failures, and wrong translation.

‍
As international trade and communication develop evermore, providing translations for those languages is a critical aspect of market expansion and better global communication.

Shenzhen Airport, Shenzhen, ChinaImage by Andy Beales in Unsplash

5.2) Competitiveness in the World Market

Supporting languages that have been traditionally hard to translate on high-quality machine translation services can open a base for extra users as well.

This not only improves the user experience but also enables many more to receive and use content in their languages.

Customers in vital markets such as Asia need to be protected from losing their sales if machine translation systems have a hard time with multi-byte languages.

5.3) Interoperability and Data Flows

Machine Translation must work well with many-byte languages otherwise the right information cannot be transferred to devices and platforms that have multi-byte-based languages.

This ability allows information to be handled correctly, whether its original location as well as the format of the encoding, thus aiding in using multilingual data within global systems such as applications, websites, and databases.In short, interoperability and data flows ensure that different systems with different coding formats can exchange information effectively, especially in multilingual contexts.

6. Key Points to Conclude

There are many challenges in machine translation for Double-byte languages, such as segmentation, grammar, and contextual meanings.

For the older systems like Shift JIS for Japanese, Big5 for Traditional Chinese, EUC-KR for Korean the terms of double-byte character sets (DBCS) were like a tough task.

With the advent of neural networks, deep learning, and natural language processing (NLP) making contagious strides-over, character storage systems are going to be more accurate in translations — better, faster. UTF-8 is a mark and changed the scenario to deal with these multiple-character languages.

‍

Double-byte and Multi-byte support is a must-have for machine translation; and essential for ensuring accuracy, context, and interoperability.

The absence of this capability leaves a bad taste in most translation systems and distorts the user experience, thus decreasing the performance of translation systems.

‍
So, the most efficient way would be to optimize the machine translation from Double-byte languages using existing modern systems.

The good news is that we already have support for that, we must just keep improving these models.

‍