Internationalization - Concepts of String Collation and approach
This post provides high level insights into the concepts of string collation, its significance and approach followed in various technologies/frameworks. This post should help in understanding the concepts and can be used as a pointer for additional details. This should enable designers and developers in building internationalization ready products with this specific feature of collation more efficiently.
Background:
Don't you think it is easy to sort a bunch of 5-6 mangoes in the order of size when you are asked? How about sorting a bunch of mixed fruit varieties as per size? Sure, you shouldn't be having a problem here either. But assume that, there are a couple of Mangoes and a couple of Apples of the same size. Is it not difficult to decide their place in the order? How do we decide whether it is the Mangoes that have to be placed first or the Apples (considering 2-3 of these are of same size)? Depending on the individual, some people may place the Apple first while some others might decide to place the Mango first. But ultimately, the correct order can be decided only if there is clarity on the expectation of the person demanding this exercise. I am sure you will agree that the best score on a sort exercise can be obtained only if the rules governing the sort are obtained from the corresponding responsible groups.
I have attempted to cite a simple example of determining the process of obtaining the correct output in a daily-world scenario. The idea is to lead you towards an understanding of the terminology 'Collation' used in the software internationalization world! In the software I18N world, a similar set of characters may represent a different sorting order based on the context of use in a particular culture.
Locale and its' relevance:
Let us spend some time to understand 'Locale' as the first step.
Locale is a terminology that is very widely used in the I18N world. This refers to culture specific processing of data for various types of formatting (date-time formatting, currency/number formatting, calendar) and string handling (such as string comparison, sorting etc). To elaborate a bit on 'Culture', it is the specific requirements that need to be built into the software so that it reflects unique characteristics of specific regions of a country or locations which enable local users to be more appreciative of the application (format of date is an example where the order of placement of the date, month and year vary from culture to culture)
In the Windows world, locale is based on the 'regional setting option' set in the system. The selected option can be detected programmatically using Windows system APIs and applications can handle processing of data as required, based on this locale setting
Collation and its' relevance:
Now let us look into the topic in focus- 'Collation'.
This terminology 'Collation' (used commonly as 'String Collation') is another very important concept in the I18N dictionary. This refers to the locale specific handling of strings where special attention is required due to region specific requirements for character recognition, string processing etc. Are you wondering why there is a big deal here as we do follow the appropriate Unicode encoding (say UTF-8 or UTF-16 etc) anyway? Sure, choosing the appropriate encoding will solve the initial part of the problem as we can identify each character uniquely (considering that the application would need to support various countries and languages in a global scenario), but not really the other part of the problem - which is the unique processing that is needed, based on each region's specific convention and requirement,.. The following simple example might help throw more light on this.
The letter 'ä' is considered as an individual letter in Swedish and hence sorted after the letter 'z' in the alphabet. But the same letter 'ä' is considered as accented (such as 'ae') in German and hence follows the letter 'a' in the alphabet.
This becomes complex because collation varies by language to language and region to region. However, not withstanding this, the application still has to come up with generic/common customization to serve the unique preferences of users of specific regions.
Besides sorting order variations based on the language, special attention is also needed while processing these because of other requirements such as 'context of usage' and 'customization rules'.
To elaborate further using the same example, in the German Language dictionary, 'öf' is sorted before 'of' whereas the same 'öf' is sorted after 'of' in the German Telephone directory. Examples of customization rules for sorting can be such as 'always treat Upper case letters first' (such as 'Z' before 'z') or 'always treat Lower case first' (such as 'z' before 'Z') or 'ignore punctuation' etc.
Another consideration is numerical sorting. This is important because, computers internally use character sets which assign a numeric code point to each letter or glyph. It is challenging when strings containing numbers (numeric characters) have to be collated and sorted. Important aspects to be considered in this case are, 'leading zeros' to bring about a uniformity in the number of digits for the numbers being displayed, adding a constant value to all negative numbers (internally) to turn them positive to make ascending order meaningful during display etc. These are some considerations that help resolve the issues related to numeric sorting (there are of course other considerations based on various problem sets that we are not considering here).
More complexities on Collation!
We looked into the sorting order variations due to language, context of usage and preferences. Complexity increases further with additional consideration factors like 'base characters', 'accents', 'case', 'punctuation' and 'Tie-Breaker'. As the number of parameters to be considered increase, it is important to define an optimal algorithm which can consider these multiple parameters in a particular order and levels. This is quite important because, string comparison operations are CPU intensive and performance is directly impacted unless handled appropriately.
To handle this, there are a defined set of 'Comparison Levels' defined by the Unicode consortium (www.unicode.org) to help define multilevel comparison algorithms. These various levels are L1, L2, L3, L4 and Ln for 'base characters', 'Accents', 'case', 'punctuation' and 'Tie-Breaker' respectively. This concept helps define efficient algorithms in a standard way for performing various string processing such as comparison, sorting or ordering etc. There are various algorithms corresponding to different platforms, languages available in the Unicode Technical Standard #10 (UTS #10) defined by Unicode consortium as part of technical reports (http://unicode.org/reports/tr10/).
Another complexity which we need to consider is 'canonical equivalent characters'. This means the same basic characters can be represented in multiple ways. Example is the Latin Letter 'A'. This can be represented by placing the ring above or combining, like 'Å' or 'A ◌̊' respectively. Canonical equivalent characters must be considered to be occupying the same position while sorting a set of characters.
In East Asian languages, the sorts and comparison etc are ordered by the stroke and radical of ideographs. Again, for this, there are a set of standards such as JIS X 4061:1996, where the collation rules are defined.
We have discussed on various collation requirements in string processing many of which are clearly applicable in the UI or Business logic layer. However, collation plays a crucial role in the Data tier (database) too as part of the need to enable internationalization in the application/product. This is especially true in case of additional requirements such as how strings are processed during search operations (efficiency of the data base engine in accomplishing this task considering various collation rules), the efficiency and correctness of filtering and comparison operations , the level of flexibility available (can we have the flexibility to choose different collations in the server level, Database level, table level, column level etc) etc (list becomes huge as we think deeper on this).
How do we approach and what is available to help on this?
Based on the above, it is very clear that, to perform string operations (such as order, sort, compare etc), one shouldn't rely on only 'code point' values. Such operations should be performed by assigning multi-level weights to characters or sequence of characters and then comparing those weights at each level. The standard Unicode Collation Algorithm (UCA) comes with a default weight table for all assigned characters as well as a tailoring mechanism that describes how this table can be modified to conform to local conventions, where necessary.
Various platform and technology vendors have adopted such standards available (with defined collation rules) and have come up with set of algorithms and libraries to make the developer's life easier. Let us see couple of these at very high level now. For the developers it is critical to know these and understand appropriate usage depending upon the technology/platform being worked upon.
Microsoft .NET framework provides the System. Globalization namespace to help developers develop internationalized products more efficiently. This Globalization name space provides the 'CultureInfo' class which enables collation related methods and attributes apart from various other Locale specific needs. This is in line with the Unicode Technical Standard defined by the Unicode consortium.
The Java platform provides the 'Collator' class (in the 'java.text' package) to perform various locale sensitive string operations. This addresses various levels of complexities which we talked earlier in collation by providing various strength levels while comparing ('Collator's strength' property in Collator class), methods for setting strength properties, reading strength properties. This platform also provides various overloaded methods for comparison. There are a few extended classes and additional classes such as 'CollationKey' and 'RuleBasesCollator' etc which can be leveraged as per requirements to write more efficient code.
Similarly for C/C++ developers, ICU (International Components for Unicode ) Libraries are very handy where we get various libraries for high performance and efficient collation related operations. ICU is an open source project containing widely portable C/C++ and Java libraries software internationalization. Facilities for language sensitive collation and searching are available to help in string processing operations.
Conclusion:
Collation is a very important consideration when building internationalized products and applications. There are set of collation rules, guidelines and standards available to make implementing this a lot easier. Also there are libraries and frameworks in place to implement these in various technologies.
But the onus is on developers know what are the libraries available, how to use them, which one to use for based on context etc. It is very crucial to choose the right set of libraries with the right set of parameters and this can't be achieved unless once has good grasp on the concepts of collation, and available technologies.


