Infosys delivers concept-to-market software engineering services across the engineering value chain. Our blog will discuss the latest trends in software product engineering, outsourcing, technologies, and address business challenges.

« Internationalization - Concepts of String Collation and approach | Main | Leveraging the power of Crowdsourcing for Localization »

Internationalizing Legacy Software - The Other Dimensions

We have often noticed customers developing cold feet over suggestions of reimplementing an existing legacy product completely onto a new technology. There is resistance towards venturing into the unknown especially with respect to time-tested running products, which have done well in a specific geography. To do business in a wider market, the product would need to be internationalized, and then localized to specific regions. It is of course much easier when the product has been developed in an internationalization-aware technology like Java or the .NET languages. But, when the product is a legacy product implemented in C/C++ targeting a specific geographical market (and developed in the pre-globalization era), the challenge is acute. The demand is often to magically transform the existing code base into an I18N aware one with the minimum changes possible. To top the icing, some customers (especially in Japan) demand source code portability across operating systems.

I18N enabler tools in the market available today surely help in identifying very obvious I18N issues (hardcoded strings, non-Unicode aware APIs etc) in existing legacy code, but for a wholesome solution, there are other critical dimensions especially with regard to encoding that need consideration.

Architectural Analysis

It is important to understand the architecture of the current system and evaluate if it would still satisfy the original quality requirements when dealing with pseudo-localized text data. A thorough analysis will help feedback on possible deviations well upfront. Feedback to the customer will help understand the requirements that must compulsorily be achieved for the product to be globally competitive and the trade-offs during the process.
In a tiered architecture, where the layers are physically separated, it is important to perceive how localized text data would pass across boundaries. Especially when tiers are spread across operating systems running on various architectures (Windows -x86, Solaris -Sparc), considerations of endian correctness are important. UTF-8 encoded data streams maintain the same byte order when transmitted and hence are not affected by 'endian' problems. In addition, when sending text data across boundaries, it is also important to choose a neutral encoding scheme as target systems may not have the necessary components (code pages for example) to interpret native encoding schemes.
Though the customer is often keen to maintain the system 'As-Is' to the extent possible, it is necessary to be proactive in analyzing the legacy technologies to determine if they are potential blockers to complete internationalization. For example, one could re-evaluate the choice of database for its inability to store text data other than those in the ASCII encoding etc.

 

Interfacing with external entities and interoperatability

It is important for the system to know the encoding of the files that are read by the system. Such files are typically configuration or start up files which are created in certain editors and saved in the default encoding of the editor - which might be different from the encoding that the system expects. Again, UTF-8 is a standard choice for such files especially when the system is being re-designed to be UTF-8 aware.
Cases of interoperations across technologies also require attention. While JNI enables Java to interact with C/C++ components, it is important to understand the encoding conversion that occurs when text data is passed across boundaries. In legacy applications (developed for a specific culture), the text data is normally converted from UTF-16 (default encoding of strings in Java) to native encoding (example, Shift-JIS or EUC-JP in Japanese products). There are similar cases of interoperations between .NET (text data is again UTF-16 here) and legacy C/C++ components.  To internationalize such code, analyze and understand what it would take to enforce a neutral encoding (UTF-8, UTF-16 etc) in the C/C++ component source code (Of course, this becomes a problem if you are interfacing with a third-party library, but then that is identified as a limitation)

 

Base encoding of text data within the process in execution

By default, text data originating within the application and stored in character arrays/strings are as per the default encoding of the system.  On Windows, for example, text data stored in normal char arrays or strings reflect the encoding as determined by the current code page. However, data stored in wide char versions is UTF-16 on Windows systems (and UTF-32 on Unix like systems). 
Ideally, one should perform a thorough analysis, and determine what encoding one would like textual data to be in within the application. When textual data enters the system from various sources (files, inter-process communication, user input from user interfaces etc.), storing this information in character arrays /strings (as received) will result in variedly encoded strings in existence within the system. As a result, there will be the need for conversions from one encoding to another in sporadic locations - leading to code that is non-maintainable and difficult to understand. 
It is thus important to ensure that a conversion to the determined target encoding is performed at all gateways (entry points) to the application so that when the actual business of string processing is being performed, all text data is uniformly encoded.
When the processed text data needs to be displayed or passed out of the system (exit points), perform the conversion from the base encoding to what was the original encoding at the point of receipt (or as expected by the outside component)

 

The dilemma of choosing the base encoding

It is best to avoid choosing native encodings as the basic encoding of strings in the application. Universally, Unicode is the best choice, but then again one could debate on the exact flavor of Unicode encoding (UTF-8, UTF-16, and UTF-32) to choose. UTF-8 is widely considered the better encoding considering the fact that the single-octet encoding for ASCII characters help maintain code as near to the original as possible besides allowing code to work with a majority of existing APIs that take byte strings and do not deal with characters individually. There are other advantages (and disadvantages) of adopting UTF-8, which are discussed here.  With UTF-8 as the choice of base encoding, it would suffice to use the 'char' data type to store UTF-8 encoded strings.
It must be noted that ASCII characters take 1 byte in UTF-8 and 2 in UTF-16. In most modern European languages, text data stored in UTF-8 will consume less space than when encoded in UTF-16 or UTF-32 due to the presence of ASCII characters. There are a range of Unicode code points that represent certain characters in Chinese and Japanese which take 3 bytes when represented in UTF-8 and 2 bytes when represented using UTF-16. As the chances of such occurrences is rare, UTF-8 encoded data stored in char arrays are considered to be a good choice. However, to account for the need to expand buffers to store localized text data (in Chinese or Japanese) in UTF-8 encoded character arrays, it is advisable to extend legacy character buffers by a margin of 3 times to be safe.

 

The choice of the data-type storing text

You might have observed that 'wchar_t' is a popular data type expected to store and process international characters. However, there are certain limitations with respect to using this type, especially when the same code base is expected to execute cross-platform.
The problem with using wchar_t is that it stores UTF-16 encoded characters on Windows (size of 2 bytes per character) but when ported onto Unix like systems, wchar_t would represent UTF-32 encoding (size of 4 bytes per character). Cross platform C/C++ source code which uses wchar_t is expected to consume twice the memory on UNIX systems as compared to Windows systems. The choice of data-type however can be decided based on the maximum amount of text data that is expected to be stored in memory at any point of time (for example, a case where a huge file needs to be read into memory and processed etc.)

 

Text encoding under the hood in Operating Systems

It is important to understand the base encoding used by the operating system on which your application runs. With Windows NT architecture and beyond, text representation under the hood in the Windows OS has been based on UTF-16 by default- while Unix like systems are considered to be based on UTF-8. Any calls to Windows APIs when the text data is natively encoded, will result in an internal conversion to UTF-16 under the hood and subsequent execution based on UTF-16. (For example, the Windows API GetModuleFileName() will actually return an intermediate UTF-16 encoded string as per it's base encoding which will then be converted into the encoding of the calling program)

 

The use of third-party I18N libraries

Tools can be used to examine string operations occurring in the base code. Assuming that we are talking with respect to text data that has been converted to UTF-8 at the gateways, operations like strcpy() ,strcat() etc. can be maintained as is as they deal with whole strings and byte lengths and not character counts/lengths. One however needs to pay attention to string operations that deal with individual characters or character counts like strncpy(), strchr() etc. It is here that third party libraries like ICU, Rogue Wave and the likes can be used to perform necessary processing on the UTF-8 byte stream.
Remember, inclusion and use of such libraries could possibly result in a binary bloat and marginally reduced performance as compared to the original, but these are trade-offs that one needs to accept considering the limitation of having to work on legacy source code.

 

This blog has attempted to highlight considerations that affect requirements related to internationalization of legacy C/C++ code. The next time you are expected to do something along these lines, do question yourself on the points discussed here and determine the way ahead with the accompanying trade-offs.

TrackBack

TrackBack URL for this entry:
http://www.infosysblogs.com/apps/mt-tb.cgi/3294

Comments

This is a great post Suraj on the various dimensions of internationalization. Would be great if you could share your thoughts on two specific points 1. how do we advise customers on their internationalization roadmap
2. What are the implications of internationalization on testing the product/application?

Your question would warrant a blog by itself :). Assuming that your question is for a customer who would like to partner with service companies in the I18N process, here’s writing down some major considerations
Analyze existing product suites to determine which products have a potential in the global market (You wouldn't want to invest on internationalizing products which are very region specific and have no potential outside).Identify key target markets.
Analyze to determine if the existing technology on which the product is built, suffices all the requirements of the target markets to determine possible technology revamp, choice of technology etc.
Document all the key quality attributes that MUST be maintained post the internationalization expertise. This serves as an important input to partners to determine implementation approach.
Determine a production strategy which targets staggered deployment (domestic market first followed by international market) or concurrent deployment (develop targeting the international market at one go). (We have noticed that most Japanese customers opt for a staggered approach to deployment)
Trigger a gap analysis (involving a study of the architecture, design and the analysis of the code) with the partnering team to help understand what it would take to internationalize the product – to draw up effort estimates.
Determine Internationalization Checklists/Guidelines for the product to enable the in-house/partnering I18N team to work along those parameters. (One of our large customers had provided a set of expected guidelines which needed to be considered during the I18N exercise to align the work product as per their expectations)
Determine a localization road map for the product (custom functionalities for specific markets, translations etc)
Spend time and develop a test plan and test acceptance criteria which cover all facets of the internationalized product and test the finished product based on these parameters.


As regards testing, you would need to test two aspects of the product, the globalization aspect and the localization aspect. The globalization aspect would deal with checking the ability of the product to adapt itself to various cultures. For example, change the user locale to check if the date-time formatting changes, confirm bidirectional display of text in Arabic locale, etc. This would require knowledge of I18N principles in the tests. Localization testing confirms if the product works as expected in a specific locale and language. In the absence of language experts, one way of sanity checking is by executing pseudo-localization tests. But, the crux of localization testing is that you would need language experts who validate correctness of messages, strings in UI elements, cultural relevance of icons used etc.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

Please key in the two words you see in the box to validate your identity as an authentic user and reduce spam.