Infosys delivers concept-to-market software engineering services across the engineering value chain. Our blog will discuss the latest trends in software product engineering, outsourcing, technologies, and address business challenges.

« November 2009 | Main | January 2010 »

December 31, 2009

Deciding Optimal Unicode Solution for Globalization Database

The concept of Globalization and the estimation model has been explained very well by Aviraj Singh in his post Effort estimation for a Globalization project.   Being a database person I always look at it from a different perspective, giving a bit extra weightage to database. There are lots of granular intricacies that one has to think of before deciding the solution for supporting Unicode data in databases.  It can be achieved though Unicode database i.e.  Upgrading database character set to one that supports UTF-8 encoded characters as SQL datatypes like CHAR/VARCHAR2 etc. Another option can be Unicode Datatype i.e. to support multilingual data only for certain columns by using Unicode national character set  to store multilingual data in SQL NCHAR datatype attributes , without making any changes to database character set. The  most confusing and key   decision for Globalization project is  whether one should opt for Unicode database or Unicode data types for supporting multiple languages in database. This is a key decision for the success of any Globalization project and will also have a considerable impact on effort estimations.  An incorrect choice at this stage can lead to a lot of rework and end hour surprises.  

 

It is always better to clarify business and technical requirements  especially  as regards the need  for globalization ,  details of  languages/geographies to be supported , the size of database  and  data distribution ,  application downtime available for upgrade(for existing application) , application code language. In addition, one should also understand the future business growth plan   and geographies to be supported.  Once specific details are available, a team of experts should study the requirements closely and brainstorm to arrive at an optimal Unicode solution to be implemented.  Following are few key things that should be considered for decision making:

Ø  Business Requirements

o   Do I need to provide multilingual support for existing database or need to create it from scratch?  Which languages do I need to support?

o   How large is the existing database? If an incremental upgrade is required, then Unicode data type may be the better option. In case it needs a complete overhaul in a big bang, then changing database character set may be the better option.

o   Application/Database downtime and future business requirements may also be factors that influence the final decision. Examples could be as regards the languages  that application may require to support in near future etc.

Ø    Business Data

o   Type of data 

Does the application needs to support Asian, European etc languages; do these languages have supplementary characters?  This will help in deciding optimal Unicode encoding. For example,  UTF-16 provides more compact storage for Asian languages whereas European scripts are more efficient with UTF-8 encoding.

o   Data distribution

If multilingual fields are distributed across all databases, it’s better to go for Unicode database than Unicode data types.

o   Data volume  and application downtime

If the data to be migrated is huge and the application downtime does not allow for a big bang migration, it’s always better to opt for an incremental upgrade with Unicode datatypes; especially for a   existing DB where database character set (say WE8ISO8859P1) is not a subset of any UTF-8 encoding. In this case database character set upgrade will require additional overhead for converting data from existing database character set to Unicode character set.

o   Binary data ( BLOBS and CLOBS)

If there is a requirement to store different types of multilingual documents and search their content in BLOB data types, one should go for a Unicode database. BLOB data is converted into the database character set before being indexed. Hence, if your database character set is non-Unicode then there will be data loss if the documents contain characters that cannot be converted to the database character set.

Ø  Performance

o   A Unicode database comes with a performance overhead due to the non-optimal use of data storage (depending on language and database/national character set compatibility) and conversion that may be required before storing data in database.  If feasible, it may be better to store multilingual (Unicode) data in   NCHAR/NVARCHAR data types, without changing the database character set.

Ø  Application code

o   Application code too plays an important role in deciding the suitable optimal Unicode solution.  For example VC /VC++ applications on MS windows may perform better with Unicode datatypes as the data lengths of wchar_t buffer in VC/VC++ match the length of SQL NCHAR data types in database. This will make data comparison more efficient and may avoid buffer overflow in client applications.

One can also consider selecting combination of Unicode database and Unicode datatypes depending on project requirements. This can be a ideal situation where the database character set (US7ASCII) of the existing database is an exact sub-set of Unicode database character set (AL32UTF8) and you have Java application code running on Windows. Both Java and Windows being better compatible with NCHAR data types of UTF-16 encoding will give better performance and will be easy to manage so national character set may be set to AL16UTF16. Database character set upgrade to superset Unicode character set will also be quite easier and faster as this will not require any conversions.

Unicode solution has a huge impact on design, implementation approach and hence has an impact on the effort estimates of a globalization project. Though it is bit difficult to generalize the globalization effort estimation framework as scope and intensity of application/database code changes will be largely driven by business requirements, it will still be better to focus on areas discussed above in the initial stage and then review the project estimates and implementation strategy accordingly.  

I have tried to cover most of key focus areas for driving Unicode solution for globalization database. Any other thoughts on this are most welcome.

December 30, 2009

Google Public DNS

A month ago, Google announced the release of Google Public DNS (Domain Name System), which is a free DNS resolution service. DNS is used to translate human friendly computer names into IP addresses. When a user types the name of a website, the Domain Name Servers convert this name into an IP address, and this IP address is used by your machine to send requests. A DNS network contains a set of servers which maintain a cache of domain name to IP address mappings. Usually these Domain Name Servers are maintained by your Internet Service Providers (ISP). With Public DNS service, Google wants to provide an alternative to your ISP’s service. Public DNS leverages the existing infrastructure used by Google’s search engine, which uses crawlers to scan through millions of websites. The DNS information cached by these web crawlers is used by Public DNS. Already a company by name Open DNS offers a similar popular DNS resolution service.

These DNS services claim to provide faster (by caching relevant DNS information and hence speeding up page retrieval) and safer (preventing spoofing and denial of service (DoS) attacks) service as compared to your ISPs.

Delay in loading a webpage could be caused by factors like geographical distance between the client and resolving servers (which could result in longer round trip time, or loss in packets due to network congestion etc.), cache misses (in this case, a resolving server does not have information about the requested domain name and needs to recursively query other servers to get the information) and heavy load on resolving servers due to under provisioning of servers or denial of service attacks (deliberate overloading of servers by malicious users, to deny service to genuine users). Public DNS claims to mitigate these delays with following approaches:

1.     Adequate provisioning of servers to handle both the genuine requests and denial of service attacks.

2.     Usually DNS lookup queries are load balanced amongst several name resolving servers. If there is over provisioning of resolving servers (as described in point 1, over provisioning is necessary to prevent DoS attacks) and if the load balancer randomly selects the servers, it could result in different servers having entirely different set of cached information (fragmented cache). This results in high percentage of cache misses and hence increased traffic between the servers, especially for popular domain names (remember that whenever a server cannot find the requested information in its cache, it has to query other servers). Public DNS handles this problem by splitting servers into 2 categories. One category of servers uses a global cache which contains popular domain names (e.g. Google.com). Since popular names are requested frequently, this global cache remains refreshed at all the times, resulting in quicker resolution. Other category of servers uses a local cache (i.e. each server maintains its own cache), which caches less popular domain names. Since these less popular domain names are requested infrequently, cache misses will not result in increased network traffic. But to service these less popular domain names as efficiently as popular domain names, Public DNS optimizes the request resolution by always forwarding requests for a domain name to the same server. For e.g. if the request is for www.indya.com, it is always forwarded to server A. If the request is for www.sify.com, it is always forwarded to server B. So, if user requests www.indya.com repeatedly, the cached information at server A would result in quicker resolution.

3.     To ensure faster resolution of domain names, Public DNS pre-fetches and periodically refreshes the names irrespective of whether user requests these names. This is implemented using an offline component which periodically selects and ranks the domain names based on factors like popularity and hit rate (number of times it is requested).  Another runtime component resolves these pre-fetched names and refreshes them based on their time to live value. This ensures that frequently requested domain names are served quickly (even if they are not universally popular domain names like www.google.com).

4.     Google hosts Public DNS in its data centers across the world and routes the requests to the geographically closer mirror sites (e.g. google.co.in for requests from India), thus resulting in faster browsing experience.

Another consideration for a DNS service is security. DNS servers could become targets of spoofing (redirect users to malicious sites) and denial of service (DoS) attacks. Public DNS has implemented following approaches to prevent above mentioned security threats:

1.     To prevent the DoS attacks:

a.     Public DNS enforces rate control over the amount of traffic that could be directed to other name servers. Thus it will not be possible for attackers to flood name servers with high volume of malicious traffic. The rate control is also enforced on the responses that are sent back.

b.     To prevent amplification attacks (amplification attacks exploit high response to request ratio of name servers. Attackers can inject large responses into name server’s cache, thus flooding the network with traffic), the response traffic is limited by applying “maximum average amplification factor” to each client IP.

If requests/responses exceed any of the above mentioned parameters, the error is returned. In some cases, no response is sent for such requests.

2.     To prevent cache poisoning, basic validity checks, like rejecting the malformed responses or responses which don’t match the attributes of the requests (e.g. source IP, port), are enforced.

3.     To make it difficult for the attackers to easily predict and match a combination of name servers, ports and query names, these attributes are randomized. For e.g. the requests are sent out on different port numbers and to different name servers (not to the nearest name server always) to add some unpredictability to the requests. Also, the cases in the queried domain names are changed to prevent forged responses. For e.g. wwW.gooGLE.com or WwW.gOoGlE.cOm.

4.     To prevent attackers from injecting multiple duplicate requests for the same name resolution, Public DNS does not allow more than one request with same query attributes (port number, destination IP).

If you want to try out Public DNS, follow the instructions mentioned at: http://code.google.com/speed/public-dns/docs/using.html

To try out free basic version of Open DNS, check http://www.opendns.com/start/.

December 11, 2009

Don’t think local, think locale

Imagine yourself going to Japan to open a restaurant. Your market research says that your burgers are going to sell like hot cakes there, so you have planned a major investment there and drawn up plans for expansions. You land at the Narita airport and are absolutely clueless on how to get out of there. You look around and find that all directions and signs in Japanese. You try to ask for directions but all you get is blank stares because no one understands English. Somehow you manage to find your way out and get busy with your work. After a lot of hard work, you finally open your restaurant but you don’t find many people walking in. Your business goes dry and it’s difficult to survive with so much local competition around. What is really going wrong? Didn’t your market research say that you are bound to succeed?

This is a big dilemma for a lot of entrepreneurs when they try to enter emerging markets. You need to cross the language and cultural barrier in order to succeed beyond your neighborhood. If you open a restaurant in Japan, you have to ensure that your menu is customized for their tastes. You have to ensure that you have a menu in Japanese as well. All posters and signboards inside or outside your restaurant must be in Japanese, else how will the Japanese people know what you are trying to sell? Mc Donald’s sells their burgers in many countries, but they have customized their burgers according to their target market. While you may find a vegetarian burger in India, you will probably not find it in Japan. Instead they have a teriyaki chicken burger in Japan which they don’t sell in Canada. Over the years companies such a Mc Donald’s, Microsoft, Apple, IBM etc have realized the importance of customizing their offerings for the global markets.

Localization is critical when entering new markets. Localization is more than just translation of your user interfaces of help documents. It also takes into consideration the cultural, legal, regulation issues etc. It makes sense to invest in the global markets only when you foresee an ROI from the opportunity. So which are the emerging markets in 2009 and beyond? Which geographies should you target to increase your revenues? There are the most common questions which come up and firms like Forrester and Gartner have extensive market research data to answer all these questions. A research by Byte Level Research says that non-English speakers will represent 79% of all the internet users by 2010. So which language will dominate the internet in future? German is currently the most popular language on the internet, but Spanish is expected to overtake it and Chinese (simplified) is quickly gaining ground. According to the World Intellectual Property Organization (WIPO) and the International Telecommunications Union (ITU), Chinese will outrank English as the most-used language on the internet. Today the number of online users from China and Europe far exceed those from the United States.

The numbers don’t lie. All these statistics have been generated by collecting information and data from hundreds of small to medium to large corporations across the globe. Many companies are expanding their already established businesses, into other geographies. New players into the market are already making expansion plans into other geographies. According to a Byte Level Research done in 2007; on an average 80% of the companies interviewed, see their competitors taking their business global. It is imperative for them to be pro-active in such a scenario and make plans for going global themselves. It’s a case of ‘Go global or perish’. Intel generates around 70% of their revenues from outside the US. Microsoft makes around one third of their revenue from outside the US. Google had already reached the 50% mark by 2008. As I have mentioned in one of my previous blogs, it’s not enough being the best in your neighborhood anymore. Don’t think local, think locale…

December 09, 2009

Green Computing and Virtualization

While contemplating about the importance of virtualization in achieving green computing standards especially in organizations hosting data centers, I came across an interesting article here on how energy emissions from data centers can be used to warm homes in Scandinavian countries.

The article mentions that in a typical data center only 40-45% of the energy is used in actual computing while the remaining is used in powering agents to cool these servers. Besides, data centers run by a search giant already seem to be using around 1% of the word's energy and their demands seem to be rising fast every year.

This is interesting information in the context of the fact that the world has finally woken up to the need to put their heads together to resolve issues related to global warming at the UN Climate Summit at Copenhagen, Denmark.

There is a general feeling that the summit highlights the importance that the world is conscious that  technological advancements have contributed to increased emissions and the need of the hour is to bring in another set of eco-friendly technology advancements. Which means 'Virtualization' is a word that is going to be part of every day conversation in enterprises as the advantages of 'virtualizing' the data center are well known. As a by-product, the focus is also going to be on development of tools that are going to make the process of virtualizing the infrastructure and managing it much easier.

December 08, 2009

Testing Cycles and Product Stability

Years of experience in software development have not helped reduce anxiety levels whenever a project enters the 'Testing' phase of the Software Development Life Cycle. It feels the same as one would feel when parents accompanied you to school to collect your academic results at the end of term examinations. There is always the anxiety of whether the output of the design and coding phase will be able to successfully sustain the test case bombardment. Besides, you would also be anxious  to know if there are enough test cases to traverse all paths of the software while testing functionality and QoS parameters - so as to be confident that all loose ends are covered. An even more difficult pill to swallow is a situation where you realize that a number of your tests are failing, and you will have to get back to the customer with the bad news and request an extension. But even after that, how do you gurantee that your product will be defect free ? How do you gurantee that you have not introduced defects unknowingly while fixing the known ones ? Experts would ofcourse recommend a 'thorough peer code review' - but even after that, you would still need a 'tested and passed' certificate before the software is passed to the customer for his acceptance tests.

I have often felt the need to be able to estimate upfront the number of test cycles that should be executed during the 'Testing' phase. What projects often do is estimate for two cycles of test as an approximation. In Test-Cycle 1, all the test cases are run and all the defects in the sofware are assumed to have been detected. Fix them. Test-Cycle 2 is run to make sure that all the defects detected in Test-Cycle 1 are fixed. It is quite common that some defects are not completely fixed, especially when the number of defects detected is large. In addition, new defects may have also crept in as a by-product of the defect fixes. Naturally, this would call for another test cycle, once the Test-Cycle 2 defects are fixed ..and you end up squeezing in a Test-Cycle 3 with the hope that Test-Cycle 3 would be defect-free. (By-product defects are often seen in GUI intensive products where for example, regression defects are common while controlling state of change  of UI elements.)

One can relate the above situation to what Bruce Powell Douglas mentioned during his July 2009 visit to the Bangalore campus. With regard to a different context Bruce had said that - "The number of defects in the software is proportional to the number of defects that you know about" which means testing and quality has to be a continuous process.

Every project in it’s estimation phase has to dwell deeper into the possible ways to counter deficiencies in development during the SDLC. One such problem is to estimate the number of defects that would need to be ironed out during the various phases of the project and the effort required to counter them so that the goal of a zero defect product is achieved before the delivery to the customer. The measure is a method of checking if the output of the particular phase has been reviewed and tested and that it will contribute to the final quality goal.

Most leading companies have metrics with regard to software quality management which are created as a result of observing trends in defect data across projects in similar technologies having similar complexities. One such metric would be for example, the number of defects per KLOC of code that are expected to detected during the complete life cycle (In today's world of function point based estimation, this might be a little primitive).

Let us assume that, as per the expected standards, the total number of defects per KLOC for both GUI and non-GUI applications (developed in C or C++) could range between 15-20 defects. Considering the pessimistic choice of 16 defects per KLOC, for a given application of an estimated size 25 KLOC, the total number of defects that could be expected in the complete life cycle of the project is (25 *16 ) 400 defects.

The total number of defects estimated to be detected during the SDLC is divided across various phases of the project. The confidence that justice has been done to the review of the product and that the product is measuring up to required quality can be obtained by measuring defects at the end of the phase. Assuming a general spread of defects detected to be 30%:40%:30% across the Requirements and Design, Coding and Testing phases (please check your company metrics for the spread if any), 30% of the estimated defects still exists when the software enters the Component/Integration Testing phase.

The Component Testing or the Integration Testing phase is high on the list of defect detection stages simply because this integrates a number of independently developed modules – and is more or less the environment which the user is expected to use the application. This is the last stage of validation and verification – before the application passes hands outside the development team.

Ideally, the CT testing should be divided into cycles of CT to make sure that defects creeping in because of defect fixes - are caught and cleaned.

It is quite common to see three cycles of CT testing being planned for certain products. The first cycle should aim at detecting 60-65% of the defects estimated in the CT stage. Assuming that your defect estimate calculations reveal that there are still  120 defects still lingering in the software as you begin your CT,  you should be detecting 120 in the first cycle of CT testing and 68 defects in the second cycle of CT testing. The third cycle of CT testing should be a more confirmation cycle to confirm that there are no regression defects.

But, is this approach enough to gurantee good software ? Time pressed, you are assuming and hoping that you would have neatly fixed the 68-70 odd defects that are lingering in the software at the end of the second cycle of CT thus allowing CT-3 to be more of a confirmation cycle - which is quite a risk!

A better approach would probably be to consider the "divide by two" approach to determining the number of cycles of CT. In this approach, the number of defects estimated to be lurking in the software at the start of CT, may be recursively divided by 2 (until the defects are single digit) to determine the number of cycles.

Hence, in this case – 120, 60, 30, 15, 7 are the number of defects expected in each cycle with the sixth cycle being a confirmation cycle. However, in this approach care must be taken to change the test cases being executed – and also to focus more on problem areas (considering the 80-20 rule) so as to not make testing redundant and wasted by repeatedly testing defect free areas.

The "divide by two" principle might not be acceptable in all situations - and it is understandably difficult to convince about its need in a development process which contains strong design and code review phases. But, it is a good method to forsee upfront how much time and effort you would need during your testing phase to provide an efficient and defect free product.

P.S - External Testing is a phase where the development team completes their CT testing and provides the application to an external team within the unit for independent testing – which will surely comprise unbiased testing. This actually means that even on receiving an OK from the development team, there could still be defects lingering in the system –hidden from familiar eyes-which can only manifest in the eyes of an unrelated tester. This team should not be in anyway involved in the design/development of the product. The test cases developed by this team should be based completely on the FS/FD.

Subscribe to this blog's feed

Infosys on Twitter