Infosys delivers concept-to-market software engineering services across the engineering value chain. Our blog will discuss the latest trends in software product engineering, outsourcing, technologies, and address business challenges.

« Server Virtualization: Just how many types are there? | Main | The Multicore and Virtualization Mix »

The Encoding Dilemma

Have you ever got involved in an Internationalization related project and faced a very basic question, “Which encoding should I use in my application?” I have come across this question several times and have realized that I still don’t have a simple and straightforward answer to this question. There are a lot of factors which need to be considered while going with a particular encoding and that choice might be the crucial factor which determines the success or failure of the project. Choosing the wrong encoding for the product, might result in severe performance issues which might ultimately cost a lot in terms of rework, delayed product launches or even loss in market share.

The evolution of character encodings started with the introduction of ASCII way back in 1963. This was a 7 bit encoding scheme and was quite convenient for representing letters, numerals, symbols etc. Around the same time IBM came up with its own EBCDIC system which was an 8-bit encoding scheme. Things worked for a while, but then the shortcomings of these encodings lead to the introduction of a lot of ad-hoc encodings which were sometimes language or location specific. Japan came up with the ‘JIS’ encoding schemes, Korea and Chinese came up with their own encodings, ISO-8859 was adopted by America, Europe and the Middle east etc. Big corporations like Microsoft and Apple introduced their own proprietary encodings. Exchanging data between different locales and geographies became a nightmare because of encoding issues. Unicode in a way tried to simplify things by becoming a superset of all the available encodings, but has also created some amount of confusion in terms UTF-8, UTF-16 and UTF-32. I’m sure everyone who has done sufficient reading on the various encodings available, will agree that Unicode is the way to go in order to ensure that we are following the same standards across geographies. But making the choice between UTF-8, UTF-16 and UTF-32 poses a challenge in itself.

While UTF-32 is perhaps the simplest way to store Unicode data, it is also the most expensive in terms of memory usage. For network intensive or storage applications, the impact on network data transfer and disk usage will be high. There is also the issue of byte-ordering while transmitting data between two machines. So even though it makes life simple for the programmer, this is not a preferred choice as far as Unicode encodings go, however memory is is not a concern for you, go with UTF-32.

Almost always, the choice has to be made between UTF-16 and UTF-8. UTF-16 also faces the byte-ordering issue, but it is still a very popular encoding because it is fixed width (for all practical purposes, but can be variable width by using surrogates). The native character type in Java and C# is 16-bit, which makes UTF-16 a good candidate for applications written in those languages. C/C++ users face an issue because UTF-16 is not the same as wchar_t on every platform. UNIX defines wchar_t as a 32-bit data type while Windows systems define it as 16-bits. This causes encoding issues in applications which involves communication between Windows and UNIX machines. UTF-16 is efficient when it comes to storing Asian characters since most of them fit into 2 bytes; however there is a lot of space wastage when UTF-16 is used to store ASCII characters which require just one byte. Hence the choice of encoding is also dictated by the locales which need to be supported.

UTF-8 is a variable width encoding ranging from 1 to 4 bytes and is backward compatible with ASCII. It also has wide support from the XML community. The big advantage of using UTF-8 over UTF-16 and UTF-32 is that it does not have any byte-ordering issues. Another advantage is that most string processing functions which work on bytes rather than characters work fine with UTF-8 encoding. So the amount of code change is minimized. However for some languages UTF-8 encoded data might take more space than the corresponding multi-byte encoding. Character indexing will be an issue with UTF-8 and special handling is required.

All 3 encodings have their pros and cons and none is really better than the other. The choice of encoding will depend on the programming language used, the locales and languages to be supported, the amount and kind of text processing involved in the application, the deployment scenarios i.e. whether there is going to be data exchange between 2 machines, performance baselines, disk space usage, the version of Unicode required etc. All these factors have to be considered in the analysis phase and the choice of encoding must be one of the inputs for the design phase. Needless to say, the penalty may be huge if a wrong encoding is selected in the early phases of the project life cycle.

If you are an I18N architect, I invite you to share your views on the choice of encoding and the factors which influence you to make that choice.

TrackBack

TrackBack URL for this entry:
http://www.infosysblogs.com/engineering-software-mt/mt-tb.fcgi/9

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

Please key in the two words you see in the box to validate your identity as an authentic user and reduce spam.

Subscribe to this blog's feed

Infosys on Twitter