The Infosys Labs research blog tracks trends in technology with a focus on applied research in Information and Communication Technology (ICT)

« April 2010 | Main | July 2010 »

June 24, 2010

Parallelism - Scalability and Amdhal's Law

Scalability of a system, from a performance engineering point of view, is the ability of a system to use additional resources available to it in a judicious manner and maintain the performance parameters within acceptable limits. Load testing can be used to determine whether a system is able to progressively use additional hardware available to it and maintain Non-Functional Requirement (NFR) metrics at a constant level under increased user load. A performance test engineer can determine whether his software is scaling well by looking at how three parameters of the system behave, namely, Resource Utilization, Throughput and Response Time.

Ideally, with an increase in user load, the system should be able to progressively increase Resource Utilization. A graph which plots Resource Utilization vs. User Load should have a positive slope. Likewise, for a scalable application, Throughput should also increase with an increase in user load. On the other hand, the response time should remain more or less constant; adhering to the NFR. This graph will ideally have a slope of zero. But in a real world scenario, a deviation of upto 15 percentage is considered acceptable. Here is an excellent article on Scalability from MSDN.

Today's software is radically distributed and the need for these to be intrinsically scalable is important. Design of software for parallel performance, i.e. scalability is determined by the percentage of code that can be parallelized. No software is fully scalable, i.e., there will always be a small amount of code which cannot be parallelized. Amdahl's Law states that

Speedup = 1 / (s + p / N)

where N is the number of processors, s is the amount of time spent (by a serial processor) on serial parts of a program and p is the amount of time spent (by a serial processor) on parts of the program that can be done in parallel.

As you can see, an increase number of processors (N), will only increase the efficiency of the software by a factor influenced by the percentage of code represented by 'p'. So systems designed to operate in parallel environments should have minimum code under 's'.

June 2, 2010

NextGen Data Warehousing Trends - Part I

"Necessity is the mother of all inventions" - this quote holds true today as well, except the fact that we are starting to realize the necessities based on the inventions that are shaping up. Data Warehousing is certainly no exception, and over the past years we have seen various avatars of Data Warehousing shaping up organizations, and driving their growth. To name a few - Enterprise Information Management, Operational Intelligence, Real-time/Near Real-time Data Warehousing, BI As a Service (BIaS), in-Memory analytics, Master Data Management etc.

And mind you this trend is not going to stop here, and I am trying to provide few trends which are potentially going to shape the Next Generation of Data Warehousing in coming years. This blog will be a series, explaining each of the trends separately to ensure due credit and focus is provided to each platform/trend.

The various trends shaping the NextGen Data Warehousing are listed below, and this is not an exhaustive list:
1. Data Warehouse Appliances
2. Open Source Databases, Integration and Reporting solutions
3. Advanced Analytics - Predictive Analytics
4. Massively Parallel Processors (MPP) architectures
5. In-Memory solutions with larger data caches leveraging 64-bit platforms
6. XML based/Web Services or SOA based Interfaces
7. Columnar Databases
8. Real-time Integration between Data Warehouses and Operational systems
9. Saas and Cloud Computing transforming Data Warehousing and BI reporting
10. Multi-domain Master Data Management, Model Driven MDM solutions

On a careful observation of the trends, a common focus and thought process of ensuring the improvements in query performance and platform scalability is quite visible. In addition to this, there is a strong need to loose couple the systems/platforms/applications without compromising on the data integration & quality aspects allowing to better leverage the existing investments to the best usage. Idea is to expose the services and work in a collaborative model.

Lets talk about one of the trends "Data Warehouse Appliances" to start with, and will follow other trends in successive blogs. The term "Data Warehouse Appliances" was coined by Foster Hinshaw, Founder of Netezza . These are typically used in large Data Mart implementation where people are expecting use multi-TB's of live data.

What is Data Warehouse Appliance? - An integrated set of Servers, Storage Media, Operating Systems, Database Systems, ETL/Reporting/Metadata Softwares pre-installed/configured for Data Warehousing platform. The platform definitely involves the underlying networking layer as well.

Who all are providing Data Warehouse Appliances? - Teradata, Netezza, DATAllegro, Kickfire, Kognitio, IBM Infosphere Balanced Warehouse, Oracle Optimized Warehouse to name a few.

What are the benefits of Data Warehouse Appliances? - Actually several advantages, and few of the key ones are:
1. Out of the box performance delivered - Entire platform with performance built out of the box for usage, no need to go on shopping spree separately for hardware, Softwares, ETL, Reporting environment etc
2. Offload your Enterprise DWH platform with high performing ad-hoc queries, thereby freeing the Enterprise Data Warehouses for the power users
3. Single vendor - translating to single point for any administration needs, support services done via one support center
4. MPP Architectures help achieve high query performance, high availability & scalability options - all built in

What are the areas of application of "Data Warehouse Appliances"? - Data Marts with large querying needs for analytics that typically would put Enterprise data warehouse under pressure for performance, short term based deployment projects requiring little data integration, isolated query intensive and ad-hoc based analytical solutions requiring tera-bytes of live data - are few of the examples where DWH Appliances would fit well.

That's it for the Part I, watch out for the other trends in coming blogs.