Welcome to the world of Infosys Engineering! It is a half a billion plus organization that takes pride in shaping our engineering aspirations and dreams and bringing them to fruition. We provide engineering services and solutions across the lifecycle of our clients’ offerings, ranging from product ideation to realization and sustenance, that caters to a cross-section of industries - aerospace, automotive, medical devices, retail, telecommunications, hi tech, financial services, energy and utilities just to name a few major ones.

« Knowledge Based Solutions can keep the Aviation industry afloat… | Main | Internationalization and the development life cycle »

Effort estimation for a Globalization project

Effort estimation is the first step to undertaking any software project and a Globalization project is no different. Effort estimation for a product or application which needs to be Globalized follows more or less the same estimation principles as regular maintenance projects, yet there are no defined methods specifically for estimating the amount of I18N or L10N changes required. While working on the proposal for a Globalization project for one of our clients we were faced with the dilemma of adopting standard methodologies like SMC based estimation, FP based estimation etc or trying to create a hybrid and come up with our own estimation model which follows the same estimation principles but is more tailored for globalization projects. Finally we came up with a raw estimation model which was fine tuned over time and gave us estimates which were statistically inline with the results from other maintenance projects.

The first step to estimation is to understand the underlying product. Embarking on a project without complete information generally leads to disaster later. In the initial meetings with the client it is important to understand the current scope of the product. It will be useful to know the target geographies where the product is going to be sold, the current degree of internationalization if any, the platforms which need to be supported, the product architecture etc. Each requirement throws in more challenges in terms of estimation. The technical people involved in the estimation should have prior Globalization experience and understand the various I18N impact points in the code. They should be able to isolate code which needs I18N related changes with the rest of the code. Off course this is a very daunting task when the code base is huge, which is the typical scenario with a product; so we need tools and utilities which can find out all the impact points in the code. There are static analysis tools available which can do this to a certain degree. They can help in finding out the number of hard coded strings in the product, the number of non-Unicode API's and data types used etc and come out with reports which can be further analysed and used while estimation. At Infosys we use our in-house developed Internationalization tool which is rule based and helps in analysing code based on the specific set of rules that we set. This way the reports contain very relevant information which can be directly used in the estimation model.

At the time of estimation, it is important for the architect to decide on the encoding which will be supported by the product. This decision has a direct binding to the impact points in the code. In case the application has to support UTF-16, most of the API's and data types in a C++ application have to be replaced with their wide char equivalent, while if the application has to support UTF-8, only a handful of string related API's are impacted. The decision to use a particular encoding can prove to be very important since deciding to use a different encoding later at the implementation stage can prove to be very expensive and introduces risks in the quality and schedule of the project. Every encoding has its pros and cons and it must be well debated before going ahead with the decision. If there is database support in the product, the database layer should be analysed so that data that flows in and out of the database is in the required encoding. All internal and external interfaces of the application must be analysed so that the data flowing between modules or applications has an encoding which the communication layer can understand. The tools which help in estimation have a limited scope and the rest depends on the expertise of the person analysing the code and design documents.

The software estimation process breaks down the requirements into sub requirements which are made as granular as possible. At a very granular level if we know the number of API's or data types we need to change, we can roughly estimate the effort required to make those changes. If we know the third party tools the application interfaces with, we can estimate the effort required to internationalize the external interfaces or upgrade the third party tools to their Unicode supported version. A simple requirement like Unicode support for the UI translates to creating resource files for all locales, getting the number of strings which need to be externalized into those resource files, creating a library for reading and writing to the resource files etc. In this way we estimate at the very granular levels always taking into account our past experiences while making similar changes and the organization wide PCB (Process Capability Baseline) metrics. This estimation model is based on the bottom-up approach where estimates at the very root level finally add up to give the total development effort. To this we add the usual project management and testing efforts and come up with a final estimate.

The key to the whole estimation process is understanding the product and coming up with an exhaustive list of I18N impact areas and breaking them down into measurable entities which can be analysed manually or using tools. Like any other estimation process, this may or may not be very accurate, but after applying this to several Globalization projects, the model gets more and more well defined and the estimates are much more accurate. I am sure there are other estimation models people have experimented with while estimating effort for Globalization projects. It will be interesting to discuss alternate models and understand the pros and cons of each.


Deciding Optimal Unicode Solution for Globalization Database

The concept of Globalization and the estimation model has been explained very well by Aviraj Singh in his post "Effort estimation for a Globalization project." Being a database person I always look at it from different perspective, giving a bit extra weightage to database. There are lots of granular intricacies that one has to think of before deciding the solution for Unicode database. The most confusing and key decision for Globalization project is whether one should opt for Unicode database or Unicode data types for supporting multiple languages in database. This will be the key decision for success of any Globalization project and will also have considerable impact on effort estimations. Any wrong decision at this stage can lead to lot of rework and eleventh hour surprises.

It will be better to clarify business/ technical requirements specially need for globalization , details of languages/geographies to be supported , size of database and data distribution , application downtime available for upgrade(for existing application) , application code language and future business growth plan and geographies to be supported etc . Once specified details are available, team of experts should study requirements closely and brainstorm to arrive at optimal Unicode solution to be implemented. Following are few key things that should be considered for decision making:

 Business Requirements

• Do I need to provide multilingual support for existing database or need to create it from scratch? Which languages do I need to support?
• How large is existing database? Can I do incremental upgrade, Unicode data type may be the better option or do I need to do it in big bang, changing database character set may be the better option in this case.
• Application/Database downtime and future business requirement may also be deciding factor to come to a concrete decision e.g. any languages that application may require to support in near future, should be considered during discussions.

 Business Data

• Type of data
Does the application needs to support Asian, European etc languages; do these languages have supplementary characters? This will help in deciding optimal Unicode encoding. e.g. UTF-16 is provides more compact storage for Asian languages whereas European scripts are more efficient with UTF-8 encoding.

• Data distribution
If multilingual fields are distributed across all databases, it’s better to go for Unicode database then Unicode data types.

• Data volumes and application downtime
If the data to be migrated is huge and the application downtime does not allow for big bang migration, it’s better to go with incremental upgrade with Unicode data types; especially if the database is existing, and database character set is not subset of any Unicode encoding and requires additional overhead for converting data to Unicode.

• Binary data ( BLOBS and CLOBS)
If there is a requirement to store different types of document and search their content in BLOB data types, one should go for Unicode database. BLOB data is converted to database character set before being indexed. There may be some data loss if your database character set is non-Unicode.

 Performance

• Unicode database comes with performance overhead due to non-optimal use of data storage (depending on language and database/national character set compatibility) and conversion that may be required before storing data in database. If feasible, it may be better to store multilingual (Unicode) data in NCHAR/NVARCHAR data types, without changing database character set of database.

 Application code

• Last but not the least, application code plays an important role in deciding optimal suitable Unicode encoding for database/data types. VC /VC++ applications on MS windows perform better with Unicode data types as the data lengths of wchar_t/strings match the length of SQL Unicode data types ( like NCHAR). This will make data comparison more efficient and avoid buffer overflow.

One can also consider selecting combination of Unicode database and Unicode datatypes depending on project requirements. This can be a ideal situation where character set of existing database is exact sub-set of Unicode encoding and you have Java application code running on Windows. Both Java and windows being better compatible with NCHAR data types of UTF-16 encoding will give better performance and will be easy to manage. Database character set upgrade to superset Unicode character set will also be quite easier and faster as this will not require any conversions.

Unicode solution has a huge impact on design, implementation approach and hence has an impact on the effort estimates of a globalization project. Though it is bit difficult to generalize the globalization effort estimation framework as scope and intensity of application/database code changes will be largely driven by business requirements, still it will be better to focus on highlighted areas at initial stage and review project estimates and implementation strategy accordingly.

I have tried to cover most of key focus areas for driving Unicode solution for globalization database. Any other thoughts on this are most welcome.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

Please key in the two words you see in the box to validate your identity as an authentic user and reduce spam.

Subscribe to this blog's feed

Follow us on

Blogger Profiles

Infosys on Twitter