Infosys delivers concept-to-market software engineering services across the engineering value chain. Our blog will discuss the latest trends in software product engineering, outsourcing, technologies, and address business challenges.

« April 2010 | Main | July 2010 »

May 27, 2010

Internationalizing Legacy Software - The Other Dimensions

We have often noticed customers developing cold feet over suggestions of reimplementing an existing legacy product completely onto a new technology. There is resistance towards venturing into the unknown especially with respect to time-tested running products, which have done well in a specific geography. To do business in a wider market, the product would need to be internationalized, and then localized to specific regions. It is of course much easier when the product has been developed in an internationalization-aware technology like Java or the .NET languages. But, when the product is a legacy product implemented in C/C++ targeting a specific geographical market (and developed in the pre-globalization era), the challenge is acute. The demand is often to magically transform the existing code base into an I18N aware one with the minimum changes possible. To top the icing, some customers (especially in Japan) demand source code portability across operating systems.

I18N enabler tools in the market available today surely help in identifying very obvious I18N issues (hardcoded strings, non-Unicode aware APIs etc) in existing legacy code, but for a wholesome solution, there are other critical dimensions especially with regard to encoding that need consideration.

Architectural Analysis

It is important to understand the architecture of the current system and evaluate if it would still satisfy the original quality requirements when dealing with pseudo-localized text data. A thorough analysis will help feedback on possible deviations well upfront. Feedback to the customer will help understand the requirements that must compulsorily be achieved for the product to be globally competitive and the trade-offs during the process.
In a tiered architecture, where the layers are physically separated, it is important to perceive how localized text data would pass across boundaries. Especially when tiers are spread across operating systems running on various architectures (Windows -x86, Solaris -Sparc), considerations of endian correctness are important. UTF-8 encoded data streams maintain the same byte order when transmitted and hence are not affected by 'endian' problems. In addition, when sending text data across boundaries, it is also important to choose a neutral encoding scheme as target systems may not have the necessary components (code pages for example) to interpret native encoding schemes.
Though the customer is often keen to maintain the system 'As-Is' to the extent possible, it is necessary to be proactive in analyzing the legacy technologies to determine if they are potential blockers to complete internationalization. For example, one could re-evaluate the choice of database for its inability to store text data other than those in the ASCII encoding etc.

 

Interfacing with external entities and interoperatability

It is important for the system to know the encoding of the files that are read by the system. Such files are typically configuration or start up files which are created in certain editors and saved in the default encoding of the editor - which might be different from the encoding that the system expects. Again, UTF-8 is a standard choice for such files especially when the system is being re-designed to be UTF-8 aware.
Cases of interoperations across technologies also require attention. While JNI enables Java to interact with C/C++ components, it is important to understand the encoding conversion that occurs when text data is passed across boundaries. In legacy applications (developed for a specific culture), the text data is normally converted from UTF-16 (default encoding of strings in Java) to native encoding (example, Shift-JIS or EUC-JP in Japanese products). There are similar cases of interoperations between .NET (text data is again UTF-16 here) and legacy C/C++ components.  To internationalize such code, analyze and understand what it would take to enforce a neutral encoding (UTF-8, UTF-16 etc) in the C/C++ component source code (Of course, this becomes a problem if you are interfacing with a third-party library, but then that is identified as a limitation)

 

Base encoding of text data within the process in execution

By default, text data originating within the application and stored in character arrays/strings are as per the default encoding of the system.  On Windows, for example, text data stored in normal char arrays or strings reflect the encoding as determined by the current code page. However, data stored in wide char versions is UTF-16 on Windows systems (and UTF-32 on Unix like systems). 
Ideally, one should perform a thorough analysis, and determine what encoding one would like textual data to be in within the application. When textual data enters the system from various sources (files, inter-process communication, user input from user interfaces etc.), storing this information in character arrays /strings (as received) will result in variedly encoded strings in existence within the system. As a result, there will be the need for conversions from one encoding to another in sporadic locations - leading to code that is non-maintainable and difficult to understand. 
It is thus important to ensure that a conversion to the determined target encoding is performed at all gateways (entry points) to the application so that when the actual business of string processing is being performed, all text data is uniformly encoded.
When the processed text data needs to be displayed or passed out of the system (exit points), perform the conversion from the base encoding to what was the original encoding at the point of receipt (or as expected by the outside component)

 

The dilemma of choosing the base encoding

It is best to avoid choosing native encodings as the basic encoding of strings in the application. Universally, Unicode is the best choice, but then again one could debate on the exact flavor of Unicode encoding (UTF-8, UTF-16, and UTF-32) to choose. UTF-8 is widely considered the better encoding considering the fact that the single-octet encoding for ASCII characters help maintain code as near to the original as possible besides allowing code to work with a majority of existing APIs that take byte strings and do not deal with characters individually. There are other advantages (and disadvantages) of adopting UTF-8, which are discussed here.  With UTF-8 as the choice of base encoding, it would suffice to use the 'char' data type to store UTF-8 encoded strings.
It must be noted that ASCII characters take 1 byte in UTF-8 and 2 in UTF-16. In most modern European languages, text data stored in UTF-8 will consume less space than when encoded in UTF-16 or UTF-32 due to the presence of ASCII characters. There are a range of Unicode code points that represent certain characters in Chinese and Japanese which take 3 bytes when represented in UTF-8 and 2 bytes when represented using UTF-16. As the chances of such occurrences is rare, UTF-8 encoded data stored in char arrays are considered to be a good choice. However, to account for the need to expand buffers to store localized text data (in Chinese or Japanese) in UTF-8 encoded character arrays, it is advisable to extend legacy character buffers by a margin of 3 times to be safe.

 

The choice of the data-type storing text

You might have observed that 'wchar_t' is a popular data type expected to store and process international characters. However, there are certain limitations with respect to using this type, especially when the same code base is expected to execute cross-platform.
The problem with using wchar_t is that it stores UTF-16 encoded characters on Windows (size of 2 bytes per character) but when ported onto Unix like systems, wchar_t would represent UTF-32 encoding (size of 4 bytes per character). Cross platform C/C++ source code which uses wchar_t is expected to consume twice the memory on UNIX systems as compared to Windows systems. The choice of data-type however can be decided based on the maximum amount of text data that is expected to be stored in memory at any point of time (for example, a case where a huge file needs to be read into memory and processed etc.)

 

Text encoding under the hood in Operating Systems

It is important to understand the base encoding used by the operating system on which your application runs. With Windows NT architecture and beyond, text representation under the hood in the Windows OS has been based on UTF-16 by default- while Unix like systems are considered to be based on UTF-8. Any calls to Windows APIs when the text data is natively encoded, will result in an internal conversion to UTF-16 under the hood and subsequent execution based on UTF-16. (For example, the Windows API GetModuleFileName() will actually return an intermediate UTF-16 encoded string as per it's base encoding which will then be converted into the encoding of the calling program)

 

The use of third-party I18N libraries

Tools can be used to examine string operations occurring in the base code. Assuming that we are talking with respect to text data that has been converted to UTF-8 at the gateways, operations like strcpy() ,strcat() etc. can be maintained as is as they deal with whole strings and byte lengths and not character counts/lengths. One however needs to pay attention to string operations that deal with individual characters or character counts like strncpy(), strchr() etc. It is here that third party libraries like ICU, Rogue Wave and the likes can be used to perform necessary processing on the UTF-8 byte stream.
Remember, inclusion and use of such libraries could possibly result in a binary bloat and marginally reduced performance as compared to the original, but these are trade-offs that one needs to accept considering the limitation of having to work on legacy source code.

 

This blog has attempted to highlight considerations that affect requirements related to internationalization of legacy C/C++ code. The next time you are expected to do something along these lines, do question yourself on the points discussed here and determine the way ahead with the accompanying trade-offs.

May 18, 2010

Internationalization - Concepts of String Collation and approach

Locale is a terminology that is very widely used in the I18N world. This refers to culture specific processing  of data for various types of formatting (date-time formatting, currency/number formatting, calendar) and string handling (such as string comparison, sorting etc).
This post provides high level insights into the concepts of string collation, its significance and approach followed in various technologies/frameworks. This post should help in understanding the concepts and can be used as a pointer for additional details. This should enable designers and developers in building internationalization ready products with this specific feature of collation more efficiently.

Background:

Don't you think it is easy to sort a bunch of 5-6 mangoes in the order of size when you are asked? How about sorting a bunch of mixed fruit varieties as per size? Sure, you shouldn't be having a problem here either. But assume that, there are a couple of Mangoes and a couple of Apples of the same size. Is it not difficult to decide their place in the order? How do we decide whether it is the Mangoes that have to be placed first or the Apples (considering 2-3 of these are of same size)?  Depending on the individual, some people may place the Apple first while some others might decide to place the Mango first. But ultimately, the correct order can be decided only if there is clarity on the expectation of the person demanding this exercise. I am sure you will agree that the best score on a sort exercise can be obtained only if the rules governing the sort are obtained from the corresponding responsible groups.
I have attempted to cite a simple example of determining the process of obtaining the correct output in a daily-world scenario. The idea is to lead you towards an understanding of the terminology 'Collation' used in the software internationalization world! In the software I18N world, a similar set of characters may represent a different sorting order based on the context of use in a particular culture.

Locale and its' relevance:
Let us spend some time to understand 'Locale' as the first step.
Locale is a terminology that is very widely used in the I18N world. This refers to culture specific processing  of data for various types of formatting (date-time formatting, currency/number formatting, calendar) and string handling (such as string comparison, sorting etc).   To elaborate a bit on 'Culture', it is the specific requirements that need to be built into the software so that it reflects unique characteristics of specific regions of a country or locations which enable local users to be more appreciative of the application (format of date is an example where the order of placement of the date, month and year vary from culture to culture)
In the Windows world, locale is based on the 'regional setting option' set in the system. The selected option can be detected programmatically using Windows system APIs and applications can handle processing of data as required, based on this locale setting

Collation and its' relevance:
Now let us look into the topic in focus- 'Collation'.
This terminology 'Collation' (used commonly as 'String Collation') is another very important concept in the I18N dictionary.  This refers to the locale specific handling of strings where special attention is required due to region specific requirements for character recognition, string processing etc. Are you wondering why there is a big deal here as we do follow the appropriate Unicode encoding (say UTF-8 or UTF-16 etc) anyway? Sure, choosing the appropriate encoding will solve the initial part of the problem as we can identify each character uniquely (considering that the application would need to support various countries and languages in a global scenario), but not really the other part of the problem - which is the unique processing that is needed, based on each region's specific convention and requirement,.. The following simple example might help throw more light on this.
The letter 'ä' is considered as an individual letter in Swedish and hence sorted after the letter 'z' in the alphabet. But the same letter 'ä' is considered as accented (such as 'ae') in German and hence follows the letter 'a' in the alphabet.
This becomes complex because collation varies by language to language and region to region. However, not withstanding this, the application still has to come up with generic/common customization to serve the unique preferences of users of specific regions.
Besides sorting order variations based on the language, special attention is also needed while processing these because of other requirements such as 'context of usage' and 'customization rules'.
To elaborate further using the same example, in the German Language dictionary, 'öf' is sorted before 'of' whereas the same 'öf' is sorted after 'of' in the German Telephone directory. Examples of customization rules for sorting can be such as 'always treat Upper case letters first' (such as 'Z' before 'z') or 'always treat Lower case first' (such as 'z' before 'Z') or 'ignore punctuation' etc.
Another consideration is numerical sorting. This is important because, computers internally use character sets which assign a numeric code point to each letter or glyph. It is challenging when strings containing numbers (numeric characters) have to be collated and sorted. Important aspects to be considered in this case are, 'leading zeros' to bring about a uniformity in the number of digits for the numbers being displayed, adding a constant value to all negative numbers (internally) to turn them positive  to make ascending order meaningful during display etc. These are some considerations that help resolve the issues related to numeric sorting (there are of course other considerations based on various problem sets that we are not considering here).
More complexities on Collation!
We looked into the sorting order variations due to language, context of usage and preferences. Complexity increases further with additional consideration factors like 'base characters', 'accents', 'case', 'punctuation' and 'Tie-Breaker'. As the number of parameters to be considered increase, it is important to define an optimal algorithm which can consider these multiple parameters in a particular order and levels. This is quite important because, string comparison operations are CPU intensive and performance is directly impacted unless handled appropriately.
To handle this, there are a defined set of 'Comparison Levels' defined by the Unicode consortium (www.unicode.org) to help define multilevel comparison algorithms.  These various levels are L1, L2, L3, L4 and Ln for 'base characters', 'Accents', 'case', 'punctuation' and 'Tie-Breaker' respectively. This concept helps define efficient algorithms in a standard way for performing various string processing such as comparison, sorting or ordering etc. There are various algorithms corresponding to different platforms, languages available in the Unicode Technical Standard #10 (UTS #10) defined by Unicode consortium as part of technical reports (http://unicode.org/reports/tr10/).
Another complexity which we need to consider is 'canonical equivalent characters'. This means the same basic characters can be represented in multiple ways. Example is the Latin Letter 'A'. This can be represented by placing the ring above or combining, like 'Å' or 'A ◌̊' respectively.  Canonical equivalent characters must be considered to be occupying the same position while sorting a set of characters.
In East Asian languages, the sorts and comparison etc are ordered by the stroke and radical of ideographs. Again, for this, there are a set of standards such as JIS X 4061:1996, where the collation rules are defined.
We have discussed on various collation requirements in string processing many of which are clearly applicable in the UI or Business logic layer. However, collation plays a crucial role in the Data tier (database) too as part of the need to enable internationalization in the application/product. This is especially true in case of additional requirements such as how strings are processed during search operations (efficiency of the data base engine in accomplishing this task considering various collation rules), the efficiency and correctness of filtering and comparison operations , the level of flexibility available (can we have the flexibility to choose different collations in the server level, Database level, table level, column level etc) etc (list becomes huge as we think deeper on this). 

How do we approach and what is available to help on this?
Based on the above, it is very clear that, to perform string operations (such as order, sort, compare etc), one shouldn't rely on only 'code point' values. Such operations should be performed by assigning multi-level weights to characters or sequence of characters and then comparing those weights at each level. The standard Unicode Collation Algorithm (UCA) comes with a default weight table for all assigned characters as well as a tailoring mechanism that describes how this table can be modified to conform to local conventions, where necessary.
Various platform and technology vendors have adopted such standards available (with defined collation rules) and have come up with set of algorithms and libraries to make the developer's life easier. Let us see couple of these at very high level now. For the developers it is critical to know these and understand appropriate usage depending upon the technology/platform being worked upon.
Microsoft .NET framework provides the System. Globalization namespace to help developers develop internationalized products more efficiently. This Globalization name space provides the 'CultureInfo' class which enables collation related methods and attributes apart from various other Locale specific needs. This is in line with the Unicode Technical Standard defined by the Unicode consortium.
The Java platform provides the 'Collator' class (in the 'java.text' package)  to perform various locale sensitive string operations. This addresses various levels of complexities which we talked earlier in collation by providing various strength levels while comparing ('Collator's strength' property in Collator class), methods for setting strength properties, reading strength properties. This platform also provides various overloaded methods for comparison. There are a few  extended classes and additional classes such as 'CollationKey' and 'RuleBasesCollator' etc which can be leveraged as per requirements to write more efficient code.
Similarly for C/C++ developers, ICU (International Components for Unicode ) Libraries are very handy where we get various libraries for high performance and efficient collation related operations. ICU is an open source project containing widely portable C/C++ and Java libraries software internationalization. Facilities for language sensitive collation and searching are available to help in string processing operations.

Conclusion:
Collation is a very important consideration when building internationalized products and applications. There are set of collation rules, guidelines and standards available to make implementing this a lot easier. Also there are libraries and frameworks in place to implement these in various technologies.
 But the onus is on developers know what are the libraries available, how to use them, which one to use for based on context etc. It is very crucial to choose the right set of libraries with the right set of parameters and this can't be achieved unless once has good grasp on the concepts of collation, and available technologies.


 

May 3, 2010

Enhancing Productivity of the Internationalization Process

Staffing for i18n/L10n projects is normally done by bringing in people who have prior experience of Internationalization along with a team which is well versed in the technology underneath (Java, C++ etc). In these cases, there is generally the overhead of training the team on i18n and L10n concepts. Unless the whole team fully understands the Internationalization process, they will not be very productive. In the real world, it is almost impossible to get a perfect team which has good Internationalization experience in addition to the required technical skills. Also, with tight deadlines looming over us, most of the times it is not possible to invest a lot of time in training the team on i18n/L10n concepts. So the best way to execute the project is to improve the productivity of the team by using Software Productivity Tools and in turn enhance the productivity of the Internationalization process itself.

We were once working on a proposal to internationalize a few products for a leading Fortune 500 company. During the effort estimation stage, the plan was to estimate for the effort to internationalize the products by analyzing the corresponding source code. There were well over 10 million lines of C++ and Java code with only 3 days planned to complete the activity. In this kind of situation there is intense pressure to finish the work on time. In addition, one also needs to ensure that the analysis is accurate enough to enable an effort estimate which is realistic enough for the development team to finish the project. Deadlines are tight and there are unrealistic expectations from the people doing the exercise. In such scenarios, the productivity of the team can be enhanced by making use of Software Productivity Tools specifically designed for i18n/L10n projects.

These Software Productivity tools remove most of the manual effort that developers have to put in during i18n analysis and increase their productivity multiple folds. A typical i18n exercise generally consists of line by line analysis of the code to check for hard coded strings, non-Unicode APIs etc. The code size can run into millions of lines; which makes it practically impossible for a team to complete this activity manually. In fact, doing the analysis would be an entire project in itself. i18n static analysis tools automate this activity by scanning the entire code and generating reports which clearly segregate the hard coded strings and areas of the code which are not i18n aware. The tools can also give information on the number and usage of non-Unicode API's and data types, date/time formatting etc. This activity, which usually would have taken days to complete, can be done in a few hours. In addition, some tools also provide options to make changes to the code by automatically moving all hard coded strings to a resource file, substituting non-Unicode APIs and data types with their Unicode equivalents (depending on the target encoding) etc. By taking care of the low level activities involved during the internationalization process automatically, these tools enable developers to focus in improving the product quality by working on more specific issues.

At Infosys, we use an in-house developed i18n tool which automates the entire process of analyzing the source code for areas which are not i18n aware. The reports which are generated, help in the effort estimation exercise, product analysis, design and testing. A rule engine allows developers to set specific rules for code analysis in order to get reports based on the desired encoding to be supported, additional functions and keywords, filters etc. Multiple projects have benefitted by using the tool; which has helped in increasing the teams' productivity, accelerating product development and in turn ensuring faster and quality deliverables to customers. Currently the tool supports C/C++ code and is being enhanced to support a variety of programming languages like Java, C# etc along with better features in order to enhance competitiveness  as compared with some of the commercially available tools

Defect Prevention in SDLC Phases

The purpose of Defect Prevention is to identify the cause of defects and prevent them from recurring. Defect Prevention as a concept is something that people all over surely identify with, no matter if it has to do with things in the industry or with life in general. Is it not a great blessing if basic problems are identified well upfront, to enable the change of course before things become serious? In the industry the world over, there is a conscious effort to have processes in place to identify defects early enough to enable better quality within expected and scheduled timelines. In software, this is no different. The problem is however to identify finer aspects of 'when' you should trigger a conscious defect detection and prevention effort, so as to achieve benefits in the task under progress.

Defect Prevention is one of three key process areas (KPAs) identified to achieve a level 5 maturity in terms of the SEI-CMM Level 5 mandate. As expressed here - "Defect Prevention involves analyzing defects that were encountered in the past and taking specific actions to prevent the occurrence of those types of defects in the future. The defects may have been identified on other projects as well as in earlier stages or tasks of the current project". Most readers working in mature software organizations would have heard this before. But, what craves special attention is the point regarding "early stages or tasks of the current project".

Defect Prevention and related activities of the genre are unfortunately considered to be post-project activities with the focus on providing enough inputs to prevent similar defects in subsequent projects. There are of course no complaints about this. But, the problem with relegating this as a post-project activity is that you miss on improving in each phase of your current project.

I have always believed that it is important to nurture defect prevention into each phase of an executing project. Make no mistake; this is an important investment that is going to save a lot of long term headaches for your project.

Most companies in the world are often in a race to recruit the best personnel coming out universities and colleges in the world. The plan is to invest in this talent and train them over a term to reap benefits in actual production environments. Despite trainings, it is natural that these personnel need time to adapt to live working environments. When such a fresher is assigned to code a particular module, it is normal that some habits of college-style coding creep into the work product, which could in turn manifest into much bigger problems at the end if not caught and controlled in advance. Some typical problems seen include non-conformance to coding standards, writing non-performant code, lack of consideration of the failure case in a conditional statement, initialization of variables etc. A planned code review at the end of the coding phase would result in a large number of such problems being detected, resulting in the possible shifting of focus from the review of actual functionality to unnecessary details. Besides, sending the code back to the developer to fix all such problems would result in a good amount of change in the baseline code, with the resultant possibility of new defects being introduced - and hence a need for another code review. This could introduce unwanted cycles of fix and review - affecting schedule, confidence in work achieved. Matters compound when customers ask why such a large number of defects were being detected now.

Defect Prevention inserted in each phase of the project could help prevent such a situation. Adopting a stop, review, analyze and feedback approach during completion of around 10% or 20% of the current phase would help. Perform a vigorous review of code completed at 10% of the coding phase, analyze the defects, detect the root causes, plan measures and feedback to the developers. Most defects that are detected at such check points are defects related to standards, coding styles, carelessness, lack of experience in needing to consider both the positive and negative conditions etc. Pointing this out early will prevent repetition of such defects in the remaining 80% of the code, thus allowing a more meaningful end phase review with the focus being on functionality and achievement of other quality attributes. Such specialized focus will help unearth meaningful, quality and complex defects fixing which will definitely result in code which oozes confidence within expected timelines.

Even in projects having regular experienced team members, such a phase is important to understand the course of the phase early enough and signal corrections to prevent end phase frustrations and long term project complications.

In the Requirements phase, analysis and documentation of a single detailed (important) business case or subset of business cases and discussions with the customer to check if all his needs are well covered  - can help in covering the remaining business use cases with the necessary details on understanding the customer's thinking process. Doing such an exercise could help uncover earlier unexpressed thoughts/recommendations in customer's minds which could help in other use cases going forward.

In the Design Phase, a thorough review of the high level design/architecture would help in preventing a wrong course in the detailed design. Review on 10%-20% completion of the detailed design, will help in validating if the designers have understood the requirements correctly early enough to enable a course correction. A review here also enables correcting incorrect assumptions, incorrect documentation styles  etc. which are basic defects which cause immense heartburn if caught in the end-of-phase review resulting in preventable rework.

With regard to Testing, a review of created test cases at 10%-20% completion will help detect if the test case creator is indeed being imaginative enough in creating cases for normal, boundary and error cases with a good understanding of the functionality. Problems in understanding functionality can be disastrous if not caught early enough - as test cases created might not be meaningful or penetrative enough to guarantee a good quality end product.Ofcourse, the usual problems of documentation style etc. can be corrected here.

Remember the old adage - "An apple a day keeps the doctor away". Detecting defects and correcting course early enough in each phase, keeps end project complexities away. Insist on incorporating defect prevention activities in each phase, and reap the benefits of a good quality product with reduced frustrating rework!

Subscribe to this blog's feed

Follow us on

Blogger Profiles

Infosys on Twitter