Infosys delivers concept-to-market software engineering services across the engineering value chain. Our blog will discuss the latest trends in software product engineering, outsourcing, technologies, and address business challenges.

« Handling Data in Enterprise Mashups | Main | Globalization and the Japanese Software Industry »

Internationalization and Performance considerations

Almost always, during the design discussion of any Internationalization project, one of the questions asked by the client is, “So, will Internationalization have any impact on the performance of the application?”. No matter what you think, there is no denying the fact that Internationalization does have a performance impact on the application, whether it is big or small. There may be situations where the business benefits of Internationalization will outweigh the performance criteria and in such situations it makes sense to go ahead with Internationalization even at the cost of some amount of performance degradation. However a good design can help you in avoiding severe performance hits.

In order to come up with a good design to minimize the performance impact, it is important to first understand the areas which contribute most towards performance degradation. Some of the areas are listed below,

  • Character set conversions:

During Internationalization, the application is designed to support a particular encoding which may be UTF-8, UTF-16 or any other encoding typically used. However this does not guarantee that the incoming data will be encoded using the same character set. In order to process the incoming data, the application has to first convert the data to support the same encoding as the application; else the string processing within the application will go haywire. In case of a networking application where huge amounts of data might reach the application from various kinds of devices, encoding conversions at the applications’ interfaces will definitely degrade performance. This will also happen if the application is expected to write its output to files having a different encoding scheme or pass data to third party applications which support a different encoding. In general, character set conversions- whether done at the application’s interfaces or within the application- contribute towards performance degradation.

Choosing an appropriate encoding for the application can minimize the impact due to character set conversions. If majority of the incoming data is in a particular encoding, it makes more sense for the application to support the same encoding in order to minimize the character set conversions at the interfaces. E.g. if the incoming data is UTF-8 or US-ASCII encoded, the application should internally support UTF-8 for better performance. If majority of the input data is in a native encoding like Shift-JIS or EUC, it would probably be better to support native encoding within the application, even though supporting Unicode might be the ideal choice. The choice of encoding also depends on many other factors, so a tradeoff is often required.

  • Memory requirements:

The memory requirements of an application generally change after it is internationalized. This impacts both secondary and primary memory. E.g. keeping all the message strings in separate properties files increases space usage on the hard disk as well as increases the access time for the messages. Also since all encodings have different memory requirements, the application may use more RAM considering the data structures will need more memory to store the internationalized data. E.g. for English and Western European languages, choosing UTF-16 over UTF-8 generally doubles the memory requirement while choosing UTF-32 uses 4 times more memory. While UTF-16 and UTF-32 make string processing easier to some extent, it is also important to consider the performance hit due to the added memory requirements. The tradeoff is generally between memory and ease of processing.

Choice of an appropriate encoding depends on the locales which are to be supported. It is observed that UTF-8 generally takes 50% less space than UTF-16 for English or Western European languages, but it might take 50% more space than UTF-16 for some Asian scripts like Chinese. So if the majority of the data is in a Western European language, choosing UTF-8 would be a better option. Moreover since UTF-8 is backward compatible with ASCII, it is easier to make internationalization changes. Similarly for far eastern locales, choosing UTF-16 would give a better performance since it will use almost 50% less space than UTF-8. By choosing the appropriate encoding, the memory requirements can be optimized which leads to better performance.

  • Message catalogs:

Typically during the process of Internationalization, all user visible strings are moved into a message catalog or resource file. The application has to retrieve the required strings from the message catalog using the corresponding string IDs. This loading of strings can happen during application start, which increases the loading time of the application or it might happen when the application is running, which decreases the response time of the application. In an application having a huge number of strings, this can contribute to performance degradation.

While creating message catalogs is an essential part of Internationalization, the application can be designed in such a way so as to minimize the performance impact. Instead of loading the entire message catalog when the application starts, the application can be designed to load only the essential strings at load time and load the rest as and when the need arises. Along with this a message caching mechanism can be implemented to enable faster access to frequently used message strings.

  • Sorting multilingual strings:

Sorting of strings in an application is a very common feature and does have a performance impact in a Unicode environment due to the difference between Unicode sorting rules and Non-Unicode sorting rules.

Though this impact is not very significant in most cases, the use of proper Collations and Collation keys can improve the performance of the application. This can be done at the application level (as in Java) or even at database level. The collation you choose can significantly impact the performance of queries in the database. Collation also impacts substring matching in queries. A collation can be chosen for the quickest possible performance or for the most accurate results. Both have their pros and cons. If you want accuracy, you can choose to go with the Unicode Collation Algorithm, but it will have some performance overhead.

In general reducing the amount of encoding conversions and string formatting can help in minimizing the performance impact due to Internationalization. The choice of an appropriate encoding also plays a very important role. In the end it is always a tradeoff between the performance and ease of processing. What are your views on this subject and how does your design team deal with it? It would be good to share some best practices for handling these issues.

TrackBack

TrackBack URL for this entry:
http://infosysblogs.com/engineering-software-mt/mt-tb.fcgi/27

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

Please key in the two words you see in the box to validate your identity as an authentic user and reduce spam.

Subscribe to this blog's feed

Infosys on Twitter