Infosys delivers concept-to-market software engineering services across the engineering value chain. Our blog will discuss the latest trends in software product engineering, outsourcing, technologies, and address business challenges.

« December 2009 | Main | February 2010 »

January 28, 2010

Internationalization and Performance considerations

Almost always, during the design discussion of any Internationalization project, one of the questions asked by the client is, “So, will Internationalization have any impact on the performance of the application?”. No matter what you think, there is no denying the fact that Internationalization does have a performance impact on the application, whether it is big or small. There may be situations where the business benefits of Internationalization will outweigh the performance criteria and in such situations it makes sense to go ahead with Internationalization even at the cost of some amount of performance degradation. However a good design can help you in avoiding severe performance hits.

In order to come up with a good design to minimize the performance impact, it is important to first understand the areas which contribute most towards performance degradation. Some of the areas are listed below,

  • Character set conversions:

During Internationalization, the application is designed to support a particular encoding which may be UTF-8, UTF-16 or any other encoding typically used. However this does not guarantee that the incoming data will be encoded using the same character set. In order to process the incoming data, the application has to first convert the data to support the same encoding as the application; else the string processing within the application will go haywire. In case of a networking application where huge amounts of data might reach the application from various kinds of devices, encoding conversions at the applications’ interfaces will definitely degrade performance. This will also happen if the application is expected to write its output to files having a different encoding scheme or pass data to third party applications which support a different encoding. In general, character set conversions- whether done at the application’s interfaces or within the application- contribute towards performance degradation.

Choosing an appropriate encoding for the application can minimize the impact due to character set conversions. If majority of the incoming data is in a particular encoding, it makes more sense for the application to support the same encoding in order to minimize the character set conversions at the interfaces. E.g. if the incoming data is UTF-8 or US-ASCII encoded, the application should internally support UTF-8 for better performance. If majority of the input data is in a native encoding like Shift-JIS or EUC, it would probably be better to support native encoding within the application, even though supporting Unicode might be the ideal choice. The choice of encoding also depends on many other factors, so a tradeoff is often required.

  • Memory requirements:

The memory requirements of an application generally change after it is internationalized. This impacts both secondary and primary memory. E.g. keeping all the message strings in separate properties files increases space usage on the hard disk as well as increases the access time for the messages. Also since all encodings have different memory requirements, the application may use more RAM considering the data structures will need more memory to store the internationalized data. E.g. for English and Western European languages, choosing UTF-16 over UTF-8 generally doubles the memory requirement while choosing UTF-32 uses 4 times more memory. While UTF-16 and UTF-32 make string processing easier to some extent, it is also important to consider the performance hit due to the added memory requirements. The tradeoff is generally between memory and ease of processing.

Choice of an appropriate encoding depends on the locales which are to be supported. It is observed that UTF-8 generally takes 50% less space than UTF-16 for English or Western European languages, but it might take 50% more space than UTF-16 for some Asian scripts like Chinese. So if the majority of the data is in a Western European language, choosing UTF-8 would be a better option. Moreover since UTF-8 is backward compatible with ASCII, it is easier to make internationalization changes. Similarly for far eastern locales, choosing UTF-16 would give a better performance since it will use almost 50% less space than UTF-8. By choosing the appropriate encoding, the memory requirements can be optimized which leads to better performance.

  • Message catalogs:

Typically during the process of Internationalization, all user visible strings are moved into a message catalog or resource file. The application has to retrieve the required strings from the message catalog using the corresponding string IDs. This loading of strings can happen during application start, which increases the loading time of the application or it might happen when the application is running, which decreases the response time of the application. In an application having a huge number of strings, this can contribute to performance degradation.

While creating message catalogs is an essential part of Internationalization, the application can be designed in such a way so as to minimize the performance impact. Instead of loading the entire message catalog when the application starts, the application can be designed to load only the essential strings at load time and load the rest as and when the need arises. Along with this a message caching mechanism can be implemented to enable faster access to frequently used message strings.

  • Sorting multilingual strings:

Sorting of strings in an application is a very common feature and does have a performance impact in a Unicode environment due to the difference between Unicode sorting rules and Non-Unicode sorting rules.

Though this impact is not very significant in most cases, the use of proper Collations and Collation keys can improve the performance of the application. This can be done at the application level (as in Java) or even at database level. The collation you choose can significantly impact the performance of queries in the database. Collation also impacts substring matching in queries. A collation can be chosen for the quickest possible performance or for the most accurate results. Both have their pros and cons. If you want accuracy, you can choose to go with the Unicode Collation Algorithm, but it will have some performance overhead.

In general reducing the amount of encoding conversions and string formatting can help in minimizing the performance impact due to Internationalization. The choice of an appropriate encoding also plays a very important role. In the end it is always a tradeoff between the performance and ease of processing. What are your views on this subject and how does your design team deal with it? It would be good to share some best practices for handling these issues.

January 27, 2010

Handling Data in Enterprise Mashups

Mashups are always ever-green, hence gets the attention from all the stakeholders, be it a creator of the mashup or the user of  the mashup. Thanks to Google Maps which has taken the popularity to next level. A Typical mashup application comprises of a web application that combines data or functionality from two or more external sources to create a new service. The term Mashup implies easy, fast integration, frequently using open APIs and data sources. An example of a mashup is the use of cartographic data to add location information to real estate data, thereby creating a new and distinct web API that was not originally provided by either source. These mashups have also got its foot into enterprise business and the termed coined is “Enterprise Mashups”. Here in addition to just data the process also comes into picture. If the enterprise is SOA enabled then we can directly use the BPM engine for process orchestration. Enterprise Mash up consists of:

 

  •  Web services

     

  •  RSS Feeds

     

  •  Platform services in a cloud 

     

  •  Data

     

  •  Client Application 

     

 

In this blog, I will quickly touch upon on Data part of the mash-ups. Data in Enterprise Mashups can be in the form of:

 

  • XML data residing in RSS feeds or in webservices.

     

  • DB data

     

  • Unstructured data

     

  • JSON

     

In Mashups the processing of data is a dynamic activity hence the time taken to process the data may increase the overall execution of the mashup application. To tackle this problem distributed computing can be applied on different kinds of data as mentioned above.

 

For XML and JSON data, the parallel parsers can be used to create the Mash up. This could be multithreaded or use Multicore architecture of Intel chip at hardware level http://www.intel.com/cd/software/products/asmo-na/eng/406212.htm. On other hand we can use hadoop’s HDFS and MapReduce for un-structured data.
Hadoop is a framework based on java that supports distributed computing scale very well for data intensive applications. Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations. MapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes http://wiki.apache.org/hadoop/. One good example of an enterprise mashup is “CRM-gadget” http://www.programmableweb.com/tag/enterprise , which searches new account or validate accounts on oracle on demand over Google local search.  This mashup can tap the potential of Hadoop HDFS and Mapreduce and reduce the time to search the accounts. 

 

 To conclude, we need to build POCs and see the dynamic dissection/split of data on parallel/distributed nodes to achieve almost linear speed-up. This will in-turn reduce the total time of executing an Enterprise Mashup application.

 

 

January 24, 2010

Google File System

The Google File System (GFS) is a scalable distributed file system designed and developed by Google for distributed data intensive applications. GFS was born out of the need to meet the rapidly growing data processing needs of Google. The design of the GFS shared many of the same goals (e.g. concurrency, scalability, availability and reliability) as previous distributed file systems, but differed from earlier file systems to meet the demands of application workloads and technological environment at Google. Almost a decade later, most of Google’s applications rely on GFS to store and process data. Although Google has not published the GFS code, the design of GFS is discussed in detail, in a paper (titled “The Google File System”) published by Google engineers. To explore more about the design of GFS, one needs to read the original paper present at http://labs.google.com/papers/gfs.html.

During early days, engineers at Google felt that the existing distributed file systems could not satisfy the demands of their applications. Hence they decided to design a new file system. Several key assumptions guided the architecture of GFS. Some of the assumptions were:
  1. Google uses thousands of storage machines built from cheap commodity hardware (typical Linux machines) and any of these machines could fail and never recover from failures. Hence the file system had to incorporate monitoring, error detection and recovery mechanisms.
  2. Google’s Web applications generate and consume files of sizes varying from few hundred megabytes to several terabytes. Files with small size were almost non-existent. For e.g. web crawlers employed by Google’s Search Engine, continuously scan internet and store information related to millions of web pages. Hence they needed a file system which could handle huge blocks of data.
  3. Most of operations on the files involved either large streaming reads (very few random reads) or large sequential appends to the end of file. Random writes within the file were almost non-existent. In large streaming reads, clients typically read 1MB or more of data. Overall they needed a file system optimized for reading and writing huge chunks of data in streaming mode.
  4. The files were often used as producer-consumer queues, with multiple producers writing into the same file. Hence the file system had to provide APIs to support concurrent appends to files, with minimum synchronization overhead.  
  5. During early days, Google had only search engine, for which huge amount of data was processed in the background (e.g. generating inverted indexes for web pages).  At that time, they did not have any user facing web applications like Gmail or Youtube which are sensitive to latency (i.e. require low latency).  Hence they needed a file system tailored for batch oriented operations (higher throughput) than for latency oriented operations (lower latency) 
With the assumptions listed above, they designed and developed a file system which consisted of mainly 3 components:
  1. A master server for maintaining the file system metadata. There would be one active master per cluster.
  2. Chunk servers for storing the actual chunks of data. Each file is divided into several chunks and each chunk is of size 64 MB. And each chunk is replicated at least 3 times. For e.g. let’s assume that a file “a.txt” has chunks c1, c2 and c3. Each of these chunks will have at least 2 more replicas, e.g. c1’ and c1’’, c2’ and c2’’, c3’ and c3’’.  These chunks are usually placed on different machines in order to ensure availability in case of machine failures. Users can override this default replication factor of 3 and specify their own replication factor for each of the files.
  3. GFS client, which will be used by applications for reading, writing or deleting data. GFS client provides APIs like create, delete, open, close, read and write. Apart from these standard APIs, GFS provides snapshot API for creating replicas and record append API for concurrent appends to a same file.

 

Master maintains the file system metadata. For e.g. file namespaces, mapping between file names to chunk locations. Chunk servers send regular heartbeat messages to the master indicating their health and changes in the chunk status (if any). For e.g. a chunk could get corrupted (this is determined using a checksum) or could have outdated data (outdated chunk is determined using chunk version number). Whenever a chunk server dies, all the chunks present on that chunk server need to be re-replicated.  Master comes to know about the death of a chunk server if it does not receive a heartbeat message within a configured interval. Master places chunks in such a way that the data is distributed evenly across all the machines within a cluster.

Certain metadata, e.g. file namespaces and file to chunk mapping, is kept in persistent state on the master’s disk. In case of a crash, master recovers by reading the metadata stored on the disk. This data is also replicated to shadow masters at regular intervals of time. If master machine itself crashes and it is not possible to restart the master, then one of the shadow masters takes over.

GFS implements lazy garbage collection mechanism for removing the deleted data. Deleted files are not removed immediately. They are garbage collected at a later point of time. This helps in undoing accidental deletes, which could be costly considering the size of data.

Leasing mechanism is used to maintain data consistently across all the chunks. The GFS client has to obtain a lease on a chunk to do any data mutation on that chunk. Till the lease on that chunk expires, other clients cannot access that chunk for any data mutation. Any mutation to a chunk, is replicated to all the chunk replicas and the mutations are applied in a consistent order to all the replicas.  For e.g. if data blocks A, B and C are written to primary chunk c1, then secondary chunks c1’ and c1’’ also get the data in the same order, i.e. A, B and C. This ensures data consistency on all the chunks.  

Application code is linked with GFS client library. For any operation, client first contacts the master for getting the chunk location and lease on that chunk (in case of mutations). Once the chunk location is obtained, the client directly contacts the chunk servers to read, write or delete the data (by bypassing the master).

Google’s publications on GFS and MapReduce (a programming model for distributed data processing) have inspired an open source project named Hadoop (http://wiki.apache.org/hadoop/HDFS?action=show&redirect=DFS). If you want to explore Hadoop, check: http://hadoop.apache.org/.

Exponential growth of internet and proportionate growth in data has exposed some of the drawbacks of GFS. This has prompted Google to rethink on some of the initial design decisions. Some of the drawbacks of earlier system are:
  • It was designed mainly for batch centric applications, i.e. the applications which need to process huge amount of data in batch mode and are not sensitive to latency. With Google Search Engine becoming immensely popular, Google added other applications like Gmail, Youtube etc, which are sensitive to latency. Hence if these applications were to use GFS, certain adjustments had to be made to the file system.
  • To simplify the design, GFS was implemented with a single master node, which maintains the file system metadata for entire cluster. By initial estimates, GFS was expected to handle few million files with sizes up to few terabytes. But the demands for data grew from terabytes to petabytes. This increased the size of metadata maintained by the single master. This in turn increased the processing time at master node and limited the number of client requests that a master can handle within a specified period of time.

Over the years, some of these drawbacks have been managed by tweaking the file system or tweaking the applications which used this file system. Engineers at Google have been working on a new distributed master system (as opposed to single master design) to solve some of the problems of GFS. If you are interested in knowing how the file system has evolved over the years, you can check this recently published ACM link: http://queue.acm.org/detail.cfm?id=1594206.

January 11, 2010

Is Big Bang the right approach to Internationalization?

Over the years our project teams have matured in the way they handle the implementation of an Internationalization project, however things were not always so smooth. There were times when the project was tested and delivered to the client, but it refused to work on the client’s machines. The offshore team just couldn’t figure out the reason for this to happen. A lot of fire fighting effort was then required to get things back on track and take corrective actions. Most of the problems were due to wrong planning, lack of technical understanding and incorrect assumptions. Things are pretty much streamlined now with an i18n Center of Excellence (CoE), i18n frameworks, analysis tools, POC’s and best practices in place. Here I am going to recollect my earliest Internationalization experience and what we learnt from it.

Almost a decade back, during one of our assignments we were engaged with a Japanese client. They had an English product which they wanted us to internationalize and subsequently localize to Japanese. Internationalization was a known concept around that time but we did not have adequate practical experience with such work. We had a team of people who were familiar with the concept of Localization and they were brought into the team. Some more relatively lesser experienced people were also added to the project. The product was written in Visual C++ and so the objective of choosing the team was to get people with adequate understanding of C++ and train them on Internationalization and Localization concepts.

The requirements were gathered, process documents were created and the team came up with the implementation and release plan. As with all Japanese projects, time to release was a critical factor and the offshore team did not have much time to ramp up their i18n skills. At the concept level the team had an understanding that anything that is shown in English on the UI must now be shown in Japanese. So the approach was to find all the hard coded strings in the source code and move them to an external resource file. Secondly since native C++ functions and data types do not have support for Unicode, they had to be replaced with their wide char equivalents. This means a ‘char’ variable should be replaced with ‘wchar_t’ and functions like ‘strcpy’ should be replaced with ‘wcscpy’. None of the team members had an understanding of the repercussions of making these changes and since time was ticking away like a bomb, it was decided to follow the Big Bang approach and do a find-replace on the entire source code since analyzing the data flow in the source code to find impacted areas would have taken too much time. Subsequently scripts were written to automate the whole process and substitute all the data types and functions with their wide char equivalents and substitute all hard coded strings with resource bundle calls. With the approach neatly lined up the team got busy making the substitutions and compiling the individual modules. Finally all the changes were complete and the source code was compiled. Since the Localization to Japanese was not yet done, the product was tested using the English resource files and everything worked as expected on all the offshore machines. The product was delivered right on time to the customer. It was now time to sit back and wait for the appreciation mails to flow in.

The customer installed the product on one of their Japanese machines and tried to launch it. The application crashed. No matter what combinations they tried the application refused to launch. The customer pressed the panic button. The offshore team could not figure out the reason for the crash. They got a machine with Japanese OS and tried running the application on it. It worked fine. After understanding the customer’s environment, it was decided to install the product in a folder having a Japanese name. The product failed to launch and crashed. The code was debugged and it was found that one of the replaced wide char functions was the culprit. Pointer arithmetic on data bytes was not modified to reflect the fact that a character could now be represented by multiple bytes; and so at some point this resulted in incorrect processing, corrupt data and eventually a crash. This happened as the team had followed the Big Bang approach and just replaced all the impacted functions with the wide char equivalents without analyzing the data processing logic. It is not just enough to use wide char functions; a thought has to be given to the usage as well. Subsequently an extension was sought and corrective measures were taken and the project was eventually delivered in perfect working condition for the Japanese environment. The initial approach had backfired and quite a few lessons were learnt from the experience,

  1. Have the right team - Your team might comprise of people with 5+ years of experience, but when it comes to Internationalization, it is important to have a team which understands the concepts and technical aspects of Internationalization. It will shorten the development cycle and the end product will have lesser defects.
  2. Have the right processes in place - A Big Bang approach is always dangerous to start off with. A more mature implementation methodology is required. Checklists must be in place to ensure that when a particular change is made; all other changes related to that change are also dealt with. Internationalization changes can have cascading effects on other areas of the code. Changes should be done module wise or feature wise so that defects are caught earlier and in a localized manner instead of taking the  Big Bang approach and messing up the entire code.
  3. Analysis is more important than development - It is very important to have a team of experts who will analyze the source code to find all areas which need to be modified to support Unicode. It is quite possible that some functions and data types need no change because they will not be handling any Unicode data. In such cases replacing them with their wide char equivalents is an overhead and could contribute to a performance hit. It is also important to understand the data flow in the application so that the required changes can be done in the code to handle encoding conversions etc in the functions or external interfaces. The memory usage of the application also increases when you support Unicode, hence the code must be analyzed to increase memory allocations only in the impacted areas. The Big Bang approach doesn’t check for all these things and it mostly leads to bloated code which uses more memory than desired and under-performs at runtime.
  4. Use the right Tools - Using the right set of tools during development can speed up the development process. There are a lot of commercial tools available in the market which can help in static analysis of the source code. Infosys has developed a set of in-house tools for Internationalization and Localization. Among other features in the tool set, it helps reduction in analysis time by auto-detecting all areas in the source code where i18n changes are possibly required. It can also help later in assessing the i18n readiness of the product. However it should be kept in mind that tools are not a substitute for experienced people. While they can help increase productivity, the developers should still have an understanding of the i18n concepts in order to interpret the output of the tools correctly.
  5. Do not make assumptions regarding the input data - In the scenario above, the team assumed that since the product was working with English inputs, it should also work with Japanese inputs. It is wrong to make such assumptions. A Japanese user can input filenames in Japanese or try saving a file in a folder with a Japanese name. The code should anticipate such use-cases.
  6. Have the right test environment - Just because there is no language translation expert in the team, it is inadequate to test the product with English data. This will definitely spring some nasty surprises later when the product is deployed in a pure Japanese environment. You should either plan for localization at the time of testing or use alternate approaches like Pseudo-localization and make sure the product is tested with Japanese strings as well.

The Big Bang approach is similar to cooking a dish by mixing all the ingredients into the pan at the same time. The outcome is unpredictable and in most cases will not get you the desired result. It is better to follow a systematic approach which will guarantee success as well as allow you to take corrective actions as and when something appears to be going wrong, rather than waiting for disaster to happen and start cooking all over again.

Subscribe to this blog's feed

Infosys on Twitter