The Encoding Dilemma
Have you ever got involved in an Internationalization related project and faced a very basic question, “Which encoding should I use in my application?” I have come across this question several times and have realized that I still don’t have a simple and straightforward answer to this question. There are a lot of factors which need to be considered while going with a particular encoding and that choice might be the crucial factor which determines the success or failure of the project. Choosing the wrong encoding for the product, might result in severe performance issues which might ultimately cost a lot in terms of rework, delayed product launches or even loss in market share.
While UTF-32 is perhaps the simplest way to store Unicode data, it is also the most expensive in terms of memory usage. For network intensive or storage applications, the impact on network data transfer and disk usage will be high. There is also the issue of byte-ordering while transmitting data between two machines. So even though it makes life simple for the programmer, this is not a preferred choice as far as Unicode encodings go, however memory is is not a concern for you, go with UTF-32.
Almost always, the choice has to be made between UTF-16 and UTF-8. UTF-16 also faces the byte-ordering issue, but it is still a very popular encoding because it is fixed width (for all practical purposes, but can be variable width by using surrogates). The native character type in Java and C# is 16-bit, which makes UTF-16 a good candidate for applications written in those languages. C/C++ users face an issue because UTF-16 is not the same as wchar_t on every platform. UNIX defines wchar_t as a 32-bit data type while Windows systems define it as 16-bits. This causes encoding issues in applications which involves communication between Windows and UNIX machines. UTF-16 is efficient when it comes to storing Asian characters since most of them fit into 2 bytes; however there is a lot of space wastage when UTF-16 is used to store ASCII characters which require just one byte. Hence the choice of encoding is also dictated by the locales which need to be supported.UTF-8 is a variable width encoding ranging from 1 to 4 bytes and is backward compatible with ASCII. It also has wide support from the XML community. The big advantage of using UTF-8 over UTF-16 and UTF-32 is that it does not have any byte-ordering issues. Another advantage is that most string processing functions which work on bytes rather than characters work fine with UTF-8 encoding. So the amount of code change is minimized. However for some languages UTF-8 encoded data might take more space than the corresponding multi-byte encoding. Character indexing will be an issue with UTF-8 and special handling is required.
All 3 encodings have their pros and cons and none is really better than the other. The choice of encoding will depend on the programming language used, the locales and languages to be supported, the amount and kind of text processing involved in the application, the deployment scenarios i.e. whether there is going to be data exchange between 2 machines, performance baselines, disk space usage, the version of Unicode required etc. All these factors have to be considered in the analysis phase and the choice of encoding must be one of the inputs for the design phase. Needless to say, the penalty may be huge if a wrong encoding is selected in the early phases of the project life cycle.If you are an I18N architect, I invite you to share your views on the choice of encoding and the factors which influence you to make that choice.

