Deciding Optimal Unicode Solution for Globalization Database
The concept of Globalization and the estimation model has been explained very well by Aviraj Singh in his post Effort estimation for a Globalization project. Being a database person I always look at it from a different perspective, giving a bit extra weightage to database. There are lots of granular intricacies that one has to think of before deciding the solution for supporting Unicode data in databases. It can be achieved though Unicode database i.e. Upgrading database character set to one that supports UTF-8 encoded characters as SQL datatypes like CHAR/VARCHAR2 etc. Another option can be Unicode Datatype i.e. to support multilingual data only for certain columns by using Unicode national character set to store multilingual data in SQL NCHAR datatype attributes , without making any changes to database character set. The most confusing and key decision for Globalization project is whether one should opt for Unicode database or Unicode data types for supporting multiple languages in database. This is a key decision for the success of any Globalization project and will also have a considerable impact on effort estimations. An incorrect choice at this stage can lead to a lot of rework and end hour surprises.
It is always better to clarify business and technical requirements especially as regards the need for globalization , details of languages/geographies to be supported , the size of database and data distribution , application downtime available for upgrade(for existing application) , application code language. In addition, one should also understand the future business growth plan and geographies to be supported. Once specific details are available, a team of experts should study the requirements closely and brainstorm to arrive at an optimal Unicode solution to be implemented. Following are few key things that should be considered for decision making:
Ø Business Requirements
o Do I need to provide multilingual support for existing database or need to create it from scratch? Which languages do I need to support?
o How large is the existing database? If an incremental upgrade is required, then Unicode data type may be the better option. In case it needs a complete overhaul in a big bang, then changing database character set may be the better option.
o Application/Database downtime and future business requirements may also be factors that influence the final decision. Examples could be as regards the languages that application may require to support in near future etc.
Ø Business Data
o Type of data
Does the application needs to support Asian, European etc languages; do these languages have supplementary characters? This will help in deciding optimal Unicode encoding. For example, UTF-16 provides more compact storage for Asian languages whereas European scripts are more efficient with UTF-8 encoding.
o Data distribution
If multilingual fields are distributed across all databases, it’s better to go for Unicode database than Unicode data types.
o Data volume and application downtime
If the data to be migrated is huge and the application downtime does not allow for a big bang migration, it’s always better to opt for an incremental upgrade with Unicode datatypes; especially for a existing DB where database character set (say WE8ISO8859P1) is not a subset of any UTF-8 encoding. In this case database character set upgrade will require additional overhead for converting data from existing database character set to Unicode character set.
o Binary data ( BLOBS and CLOBS)
If there is a requirement to store different types of multilingual documents and search their content in BLOB data types, one should go for a Unicode database. BLOB data is converted into the database character set before being indexed. Hence, if your database character set is non-Unicode then there will be data loss if the documents contain characters that cannot be converted to the database character set.
Ø Performance
o A Unicode database comes with a performance overhead due to the non-optimal use of data storage (depending on language and database/national character set compatibility) and conversion that may be required before storing data in database. If feasible, it may be better to store multilingual (Unicode) data in NCHAR/NVARCHAR data types, without changing the database character set.
Ø Application code
o Application code too plays an important role in deciding the suitable optimal Unicode solution. For example VC /VC++ applications on MS windows may perform better with Unicode datatypes as the data lengths of wchar_t buffer in VC/VC++ match the length of SQL NCHAR data types in database. This will make data comparison more efficient and may avoid buffer overflow in client applications.
One can also consider selecting combination of Unicode database and Unicode datatypes depending on project requirements. This can be a ideal situation where the database character set (US7ASCII) of the existing database is an exact sub-set of Unicode database character set (AL32UTF8) and you have Java application code running on Windows. Both Java and Windows being better compatible with NCHAR data types of UTF-16 encoding will give better performance and will be easy to manage so national character set may be set to AL16UTF16. Database character set upgrade to superset Unicode character set will also be quite easier and faster as this will not require any conversions.
Unicode solution has a huge impact on design, implementation approach and hence has an impact on the effort estimates of a globalization project. Though it is bit difficult to generalize the globalization effort estimation framework as scope and intensity of application/database code changes will be largely driven by business requirements, it will still be better to focus on areas discussed above in the initial stage and then review the project estimates and implementation strategy accordingly.
I have tried to cover most of key focus areas for driving Unicode solution for globalization database. Any other thoughts on this are most welcome.

