The Infosys Labs research blog tracks trends in technology with a focus on applied research in Information and Communication Technology (ICT)

Main

June 25, 2012

The world's first Virtualized GPU

Desktop virtualization has been around for quite a long time now. Users have been able to take advantage of compute infrastructure of high end machines in a virtualized fashion while enterprises have been able to better cost optimize the server utilization across users. But until now this virtualization has been limited to hardware resources such as CPU, RAM and the hard disk connected to the remote machines. There was no means of virtualizing the Graphics Processing Units (GPUs). So if a gaming application or a high graphics demanding simulation experiment had to be run on a virtual machine, it permitted only one instance of the virtual machine to access the GPUs while the others had to be wait until the GPU was freed up. There was no load distribution capability across multiple VM instances. So even if a computation hungry GPU with GFLOPs of performance was reay to take up more work, it was unable to do so. Not anymore! Welcome Nvidia Kepler GPUs.

At last month's GTC conference, Jen-Hsun Huang, Nvidia CEO, spoke about the latest Kepler architecture GPUs from Nvidia. This is supposed to be the world's first GPU having virtualization capability. The key feature driving the GPU accelerated desktop virtualization is the new VGX Hypervisor technology from Nvidia. It manages the GPU resources so as to allow multiple users access and share the GPU hardware for their computing needs. With this feature, the graphics processing on the remote machine can now be offloaded from CPU to GPU and hence CPU will be more free for other IO and task intensive operations. As a result higher user density across virtual machines is possible and a means to dynamically allocate resources as per the changing needs. A rich GPU accelerated experience can be delivered remotely thats been a wanting ask by lot of graphics and HPC applications so far.

Below is the GPU virtualization architecture using VGX Hypervisor as shared by Nvidia.

View image

Another interesting bit is that Nvidia VGX works with Windows Server 2012 RemoteFX feature in order to accelerate Directx9 and Directx11 applications. RemoteFX shares the GPU across virtual machines. The VGX card is supposed to serve 100 RemoteFX users simultaneously.

GPU virtualization is exciting news for gamers and HPC users. Online games will get a major boost, as now gaming applications can be hosted on a cloud infrastructure(there are a few already) that has GPU virtualized set up. They will be able deliver amazing graphics performance, possibly as close to console driven. Ofcourse bandwidth and latency issues are there. The easy access to online games will bring in more users on board. HPC users have so far been constrained to on-premise GPUs in order to run their GPU accelerated applications. This has been a concern since buying GPU hardware and justifying the investments is still a daunting task. With virtualized GPUs and having them on the cloud eliminates this bottleneck in GPU adoption.

May 15, 2012

Monte Carlo Integration On the GPU - VEGAS Algorithm

The subject of this blog is the first of the two research talks that I would be presenting at the GPU Technology Conference in San Jose this week. The talk is titled "Fast Adaptive Sampling Technique For Multi-Dimenstional Integral Estimation Using GPUs".

Few numerical methods bring as much delight as Monte-Carlo integrations do to a HPC programmer. Even more so, when the platform is a GPU. Their relative ease of implementation coupled with the inherent parallel nature of these numerical methods and the knack with which they find solutions to some problems that are considered tough nuts to crack, place these methods at the top of a statistical programmer's toolbox. Be it pricing complex derivative products in Finance or be it in areas of modern physics such as Quantum Chromodynamics, often Monte-Carlo methods are the only ways by which a reasonable answer to the problem can be found out. However, there is no free lunch. Attractive Monte-Carlo integration is but its not hunky dory all the time. The same law of large numbers that underlies Monte-Carlo method's success is also sometimes the reason why these methods become computationally demanding and hence impractical. Frequently there arise scenarios in which the simulation just does not converge fast enough. Rephrased the number of samples required for the simulations to converge might just be prohibitive large for practical purposes. It is here that Variance Reduction techniques come to the rescue. Variance reduction techniques exploit the structure of the problem at hand and impart direction to what is otherwise a numerical method which is absolutely random and blindly so. VEGAS is one such variance reduction technique. It can be thought of as a hybrid of both Importance Sampling and Stratified Sampling. It's an adaptive technique in the sense that the algorithm iteratively identifies the right distribution of the function at hand and works towards generating random samples that closely mirror the distribution. VEGAS greatly improves the accuracy and speed with which Monte-Carlo integrations can be solved.

As an example consider the following diagram. The graph of the function that needs to be integrated is shown. 

 

MCIntegration1.pngThe algorithm proceeds in the following fashion.

1) The area in the limits of the integration is subdivided into equal sized blocks called bins. Such a set of blocks can be set up using a grid as shown in the figure above.

2) a large number of random samples are generated such that there are equal number of samples in each bin.

3) The integral is now evaluated for each bin of samples. Bins are now weighted by the contribution they make to the integral's value.

4) Using the weights obtained in the previous step the grid is now resized to reflect those weights. The grid resize ensures that there are more number of bins in the area that forms the meat of the function.

5) We go back to step 3.

6) Steps 3,4 and 5 are repeated until the necessary confidence interval is achieved.

Grid resizing is shown in the picture below.

 

MCIntegration3.png

The most straighforward of strategies for running this algorithm in parallel is to evaluate the integral at each of these points in parallel. The unbiased estimator that gives us the value of the integral can also be carried out in parallel using parallel reduction sum.

The iterative nature of the algorithm means that the task has to be carried out quite a number of times in succession until desired convergence criteria is met.

Since the GPU implementation and the optimizations are the primary subject of my talk on Wednesday, I will hold back on writing on those until that time. I will then re-edit this post and put in the details of the strategies we took to take advantage of the full power of the GPUs, the challenges we faced on the way and how we have overcome those.

 

May 14, 2012

Infosys @ Nvidia GPU Technology Conference 2012

Hi There, 

I am super excited to tell you that I will be presenting some of the work that the High Performance Computing team @ Infosys has been doing using GPUs at the annual Nvidia GPU Technology Conference at the McEnery Convention Center in  San Jose, CA. While the conference itself kicks of in a few hours from now, the Infosys talks are scheduled on 16th, i.e. Wednesday.

The first talk is titled "Fast Adaptive Sampling Technique For Multi-Dimensional Integral Estimation Using GPUs". This is happening in Marriot Ball Room 3 at 2:30 PM.

The second talk is titled "GPU Based Stacking Sequence Optimization For Composite Skins Using GA". This talk is happening in Room K at 3 PM.

The subject of the first talk is an algorithm called VEGAS. VEGAS is a variance reduction technique that hastens convergence of a Monte-Carlo integration. This algorithm has wide applications from Computational Finance to High Energy Physics.

The subject of the second talk is a genetic algorithm that's at the heart of aircraft wing manufacturing. Modern aircraft wings are manufactured using composite materials. Sheets of these materials have to be overlaid on top of one another such that ability of the wing to sustain high stress in flight is maximized while at the same time minimizing violations of constraints that dictate what's an admissible ordering of the materials. 

I will elaborate on these short summaries of these two talks in subsequent blog posts over the next couple of days.

If you are going to be at GTC, kindly make it convenient to attend these talks. I will glad to meet you and tell you all the good work that we have been doing in the area of GPU computing in our labs and I would be equally excited to know about some of the coolest ways in which you are using GPUs too or else leave us a comment here on the blog. I will get back to you and we can engage in some geekery. 

Cheers...

April 25, 2012

Uncomplicating HPC using technology aids

Ok, so we understand that HPC is noteworthy. But if we said parallel computing is complex then achieving HPC is definetely no easy game. The industry offering to simplify HPC is growing and HPC cluster management software is an interesting technology that is doing its bit to ease HPC adoption. To put simply, HPC comes into play when there is typically a cluster of parallel hardware that needs to be used efficiently. And cluster management becomes crucial in order to effectively use and administer the cluster.


Amongst the key players in the HPC cluster management space is Microsoft with its Window HPC Server 2008. This incredibly user-friendly and powerful solution from Microsoft, comprises of a Job Scheduler, MPI support and cluster administration including monitoring facilities for a multicore environment. Built on Windows Server 2008 64-bit OS, HPC Server can efficiently scale to thousands of processing cores, efficiently scheduling jobs on the cluster and providing user-friendly console to monitor and manage this cluster. As a scheduler, it can efficiently schedule jobs by balancing the load based on one of these resources in the HPC cluster:
1) Node wise
2) Port wise
3) Core wise


HPC Server comes as a free add-on to Windows Server 2008 R2 and is very handy to easily bring in HPC for an embarrassingly parallel application that is aiming to leverage the full power of the underlying cluster. But wait, let me clarify. Sequential applications whose operations are embarrassingly parallel can be HPC-enabled by employing HPC Server. When there is an inherent parallelism in a sequential application, it is possible for it to effectively run on a HPC cluster with the help of HPC Server, without having to rewrite it to make it parallel. That's a treat, I must say. I hope you find this as awesome as I do.

April 23, 2012

Is Parallel Computing HPC?

Often times I use parallel computing and High Performance Computing rather loosely, interchanging the two and substituting one for the other. But for the purpose of clearly understanding both of these, it can be stated that if HPC were the end goal then parallel computing is the means. Parallel computing is independent of HPC meaning that the end goal of parallel computing need not be HPC. Parallel computing using supercomputers is typically what is called HPC. But with massively parallel hardware such as the GPUs available commonly, this definition seems to have been diluted a little and colloquially speaking parallel computing and HPC are not distinguished.


HPC is a growing and niche technology area and it is interesting to note that the U.S. government considers this an important technology that will help U.S. businesses, primarily manufacturing, to compete effectively by accelerating innovation. It is interesting to note that Ron Bloom, special assistant for manufacturing to the U.S. President, Mr. Obama, participated in a meeting, organized by the Council on Competitiveness Technology Leadership and Strategy Initiative advisory committee on HPC, to discuss how HPC can help U.S. manufactures to innovate and compete more effectively in the global market. It is with the same enthusiasm that other nations are looking to use HPC for innovation.
HPC needs are definitely growing and here are some of the key drivers for HPC:
• Reduce computation time - There are applications that are so complex that it takes a day to a week to get answers. With changing business dynamics, these applications, which enable key business decisions, would need to be tuned to produce their results in much lesser time for faster decision making. Despite the optimizations it wouldn't be possible to get higher application performance simply because these applications are sequential.
• Real time computations - It is becoming crucial for several core business applications to deliver real time or near real time results. This is simply not possible given the sequential nature of these applications.
• High throughput - Sometimes the need is to be able to get applications do much more within the same time window. Again, unless the application is adapted to parallel hardware it will simply not be possible to deliver high throughput.


HPC is slowly moving mainstream and is seeing adoption in the analytics and business intelligence space and planning and forecasting. As businesses target real time and near real time applications, HPC will become imperative.

March 9, 2012

Engineering High Performance Applications

Here it is a peek into the work that the Infosys HPC Research team is working excitedly on.  They are studying efficient kernel composition techniques with an aim to deliver optimized application performance. To state simply, kernels are GPU programs. The HPC industry today is busy discovering ways to write optimized kernels but taking this to the next level would be to think how best I can build an application that is made of several well-optimized kernels. Can I simply bundle these highly optimized kernels to create a high performance application?


Component based design is well-researched and mature area and is aimed at reusability. Such a reusability concept can also be used to build applications in the HPC world. Using composition it is possible to build an application that is composed of kernels. These kernels perform a specific task and are highly optimized, making efficient use of the GPU. And since everyone knows that GPUs are used primarily for high performance, it becomes imperative that such a composition optimizes not just reusability but also optimizes performance.
Kernel developers characterize the performance of their kernels through its performance signature. The application designer combines these kernels with the objective that the performance of the refactored kernel is better than the sum of the performances of the individual kernels.  But there is more to this than just putting these kernels together.  What make this interesting and also difficult is that different kernels may make unbalanced use of different GPU resources like different types of memory. Kernels may also have the potential to share data. Refactoring the kernels, combining them and scheduling them suitably, improves performance. The research team has studied different types of potential design optimizations and has evaluated their effectiveness on different types of kernels.

The team shares that by applying their kernel composition techniques; they have observed that the composed application performance increases considerably as compared to just naively tying the kernels together. 


Now, I think this is going to be very useful soon, when the focus shifts from developing isolated GPU programs to building applications that consume these individual high performing computation units. Going by the evolution of software engineering that began with writing small programs to the present day SOA, I am quite sure that HPC and parallel computing will gain enough momentum to propel software engineering methodologies for HPC. What say?

February 29, 2012

OpenCL Compiler from PGI for multicore ARM processors

Here is some great news for those looking for accelerating applications on the Android platform using OpenCL. Portland Group (PGI), has announced OpenCL framework for multicore ARM based processors. What this means is, we now have an OpenCL compiler for ARM based CPUs as a compute device in addition to existing ones for x86 CPUs and GPUs. With this announcement, PGI OpenCL becomes the first OpenCL compiler for Android targeting multicore ARM processors.

OpenCL being an open standard programming model for heterogonous processor systems, developers can now build portable multicore applications that can run across various mobile platforms using PGI's new framework. The initial release supports OpenCL 1.1 Embedded profile specification and is currently targeted at ST Ericsson Novathor ARM based processors. 

As specified by PGI, following core components comprise the PGI OpenCL framework:
1. PGI OpenCL device compiler--compiles OpenCL kernels for parallel execution on multi-core ARM processors
2. PGCL driver--a command-level driver for processing source files containing C99, C++ or OpenCL program units, including support for static compilation of OpenCL kernels
3. OpenCL host compilers--the PGCL driver uses the Android native development kit versions of gcc and g++ to compile OpenCL host code
4. OpenCL Platform Layer--a library of routines to query platform capabilities and create execution contexts from OpenCL host code
5. OpenCL Runtime Layer--a library of routines and an extensible runtime system used to set up and execute OpenCL kernels on multi-core ARM

More details on the framework on the PGI site here.

February 28, 2012

Is Parallel Computing a Rocket Science or Esoteric? Part 3

Having said a lot about the hardware evolution and intricacies in my previous posts(Part1 & Part2) that have influenced Parallel Computing, the question now to ask would be - Is Parallel Computing really a Rocket Science? Is Parallel Computing esoteric? The answer may be both yes and no. Bill Gates' Keynote in Supercomputing 05 Conference was titled "The Future of Computing in the Sciences"; the title seems apt, as Parallel Computing evolved mainly owing to the computational requirements for solving complex and advance computation problems in sciences that entailed high performance. This involved use of huge clusters and supercomputers. This class of computing is thus rightly named High Performance Computing (HPC). Understandably, these class of applications happened to be aimed at solving the toughest and convoluted problems of diverse sciences like astronomy, biology, mathematics and so on. Thus, owing to the complex nature and specialization entailment of these subjects, HPC seems to be esoteric here.

But due to the hardware technology advancement today; we have servers approaching teraflops speed thus the realization of "Supercomputer in your Desktop" may not be too far from reality in the near future. Desktops today have multicore-processors with languages that support porting of functionalities from legacy serial application to parallel. These parallel languages are powerful yet simple and intelligible to a novice programmer. So, it wouldn't be condescending to the power that Parallel Computing brings, to say that parallel programming is becoming easier. The problem however rests in migration of complex logic inherent to legacy application while porting from serial to parallel. Thus, owing to ever simpler paradigms for parallel programming PC is not a rocket science after all sans the HPC problems.

The future looks to be an extremely adventurous ride given the present technology tendencies. We are in times of shaping new horizons and touching upon new frontiers. Let's not ostracize Parallel Computing thinking it to be a rocket science and esoteric; let's embrace it with open arms because there always rests a middle path for us to choose.

 

February 24, 2012

IDC Top 10 HPC Market Predictions for 2012

International Data Corporation (IDC), the premier provider of market intelligence and advisory services has come up with the TOP 10 predictions for the HPC market for 2012. IDC's HPC team includes Earl Joseph, Steve Conway, Chirag DeKate, Lloyd Cohen, Beth Throckmorton, Charlie Hayes, and Mary Rolph. Their predictions offer insights into how the trends in HPC markets could drive the future changes and developments in this field.


1. The HPC Market Will Continue to Benefit from the Global Economic Recovery
2011 saw a HPC server revenue of about $10 billion. This is significant rebound to the pre-recession high point. The forecast from IDC for 2012 is that this will reach $10.6 billion and a projected revenue of $13.4 billion by 2015.

2. The Worldwide High End Race Will Accelerate
The geographic breadth and diversity has increased in the HPC vendor market. India, China, France, Italy, Russia and US are all now into this space. North America still leads the HPC server share with 45% and US vendors occupy 94% on the high-end ($500k/system) revenue. With the largest supercomputers now costing $100-$400 million, there will be increasing pressure from political circles to justify the ROI. (Did we hear Japan?)

3. Exascale Decisions Could Shift Future Leadership
IDC predicts nations that under-invest in exascale software will lose ground in the HPC market. Improvements and advances in software and tools for effective usage of HPC platforms will be more important than the hardware progress. Hence we are seeing increasing number of vendors entering into the software HPC market thereby driving commoditization. Maintaining optimal balance among performance, power consumption and reliability will continue to be a challenge for architecting HPC solutions

4. Software Leadership Will Become the New Battleground
Predominantly US has been leading in the HPC software sector. But others are sure catching up. European commission is making big investment plans for HPC software and hardware. Japan has plans of investing $35-$40 million for exascale software development.

5. The Processor Arena Will Become More Crowded
x86 processors remains the dominant HPC processor market with about 82% share. IBM Power also is a prominent player with 11%. But it's the accelerators like GPUs and FPGAs that are gaining enormous ground. 28% of HPC sites worldwide are now enabled with GPU acceleration. Low powered processors such as ARM have found a liking by lot of hardware vendors (such as Nvidia) to build heterogeneous processors. Challenge ofcourse though is providing the programmers with the right tools and software to build applications targeting these new hardware platforms.

6. National/Regional Technology Development Will Gain Momentum
The worldwide sentiment continues to remain that HPC technology is strategic and not preferable to be outsourced. Well, we know India built its own homegrown supercomputer back in 1980s because of the denial of access to HPC systems from abroad. Europe and Russia are all on the path to developing indigenous HPC technologies. A growing thought from all scientists and engineers is that a creditable HPC technology development can only happen if the environment is one of more choices for the users and avoiding protectionism.

7. Big Data Methods Will Start to Transform the HPC Market, Including Storage
Existing commercial Big Data vendors are understanding the importance of HPC and the 2 fields are colliding. There is enormous interest generating for Big Data applications built on HPC technologies. Storage revenue will continue to grow 2-3% faster than servers. Data transfer technologies are of utmost importance for HPC applications for performance considerations. Faster interconnects and improved memory design for minimized data movement are on the feature list of most of the hardware vendors today.

8. Cloud Computing Will Make Steady Progress
A fair bit of adoption is happening in the private cloud for HPC. But the same is not true for public cloud because of concerns on security, latency and pricing. But some of the suggested workloads where HPC public clouds can be adopted are those which do not have significant communication overheads. This includes pre-production R&D projects and those by Small and Medium business enterprises who cannot afford for large data centers. Early adopters of HPC on cloud have been Government sectors, Manufacturing industries, Bio-Life sciences, Oil and Gas and Financial companies.

9. There Will Be Shifting Sands in the Networking Market
Infiniband has had more momentum in the HPC interconnect market, but nevertheless Ethernet is poised to expand its share. The forecast for HPC interconnect market is $2 billion by 2014. As per IDC, for the proprietary interconnect market to grow, they will have to differentiate on top of the emerging and advanced standards to compete with Ethernet and Infiniband.

10. Petascale Performance on Big Systems Will Create New Business Opportunities
The advances happening in the HPC server, processor, storage and networking market have opened up opportunities for a wide class of business applications to benefit from it. Application software will benefit from the higher performance that can be derived from heterogeneous systems and are also power aware. Big Data methods will see wider applications. On the system software side, smarter compilers and runtime systems are possible. Efficient power management is another critical design goal that can be achieved.

February 22, 2012

Is Parallel Computing a Rocket Science or Esoteric? Part 2

The preceding part spoke about the advent and history of the realm of Parallel Computing. This part will further speak on the evolution of Parallel Computing and attempt to answer "Why Parallel Computing is so important all of a sudden today?"

In the last half century, Parallel Computing has evolved, if said, in a covert fashion without the renown that it has managed to amass in the last decade. We see so much of advancements in the Parallel Computing front today, which may seem overwhelming at times, but it is not that hard to discern as to why there has been an upsurge in this technology frontier.

So why so much of interest in Parallel Computing now, even when it existed 50 years ago? As Prof. John Kubatowicz says "Parallelism is Everywhere" - accounting for the fact that modern microprocessors have a billion transistor's rampant today even in handheld mobiles devices and clearly one must make them to work in parallel. Conversely, Parallel Computing is a trend today because it is forced upon us than our fancy. We have now hit the upper limit to making a single processor even faster in future than it is today given the present properties of the chip raw materials, say Silicon. Inadvertently, adding more than one processor to the chip die seems to be the only answer increasing the scope to program in parallel exponentially. Another reason for this astronomical boom is due to the advancement and evolution of computer hardware that has surfaced a new breed of processors into main stream computing than ancillary graphics processing, the Graphics Processing Unit(GPU).

The General-purpose computing on graphics processing units (GPGPU) that involved hoodwinking the GPU to do computation in past, has now evolved into GPU Computing that inherently caters to complex computation owing to changes in the GPU hardware facilitating parallel programming. Today we see the giant microprocessor manufacturers in a tryst to make their chips massively parallel. New parallel programming languages are finding mass appeal every day. Also a GPU is no more considered a subordinate; this outlook has led to a new phase in multiprocessor technology influencing the creation of Accelerated Processing Unit.

To be Continued... 

February 20, 2012

What is your "parallel" style?

Well, starting from where I left last, parallel computing is a complex business but then the computing industry is working to simplify this daunting task. There is a rich variety of solutions already in the shelves today - near-Auto Parallelizers, Accelerators, a wide collection of libraries and other high level abstractions. There are also the traditional low level support and programming languages that offer rich customizations and optimizations. This post gives a neat classification of various tools and programming abstractions in GPU Computing.

So it's really the need, as is the case for any decision, that drives the style of the parallel code. You could do it all yourself by custom coding or achieve parallelism by using some automatic means. You could be building a completely fresh parallel program or perhaps converting an existing serial program to a parallel one. The diagram below sort of puts some structure to help you decide your program style:

View image

I have only listed the most popular methods available today that caters to 2 different objectives. It is quite possible that the need is to achieve high performance by fine tuning the code to work very well on a particular parallel hardware. This will require intimate knowledge of the target hardware. This, as shown in the digram, is the Custom Parallelization path that, if done correctly, can yield high performance. This path will be opted by a mature, parallel computing programmer. Such a style will be most suitable for core scientific applications and frameworks or libraries, where it will essential to achieve the highest performance possible on a given hardware. On the other hand, it is possible that the intent is to achieve parallelization rapidly and a reasonable application performance is acceptable. This, as shown in the diagram, is Rapid Parallelization that yields accelerated parallelization with perhaps reasonable performance. The performance optimization is left to the software aid or accelerator used. I have categorized the available software accelerators, programming languages, libraries and other utilities that can help achieve, both, custom parallelization and rapid parallelization objectives. 

HPC is a fertile ground for research and I am sure that there are going to be rapid advancements in the next few years. Every vendor is aiming to make parallel programming simpler and this will spur HPC adoption greatly, especially in the enterprise world. At Infosys research too, we are building tools that will smoothen the steep learning curve of parallel computing and accelerate conversion of sequential applications to their parallel couterparts. We are also working on the software engineering aspects of building new HPC applications.

Hmm, I have already said too much. Just like the TV Sitcoms, that leave the audience each day with enough anticipation to watch the next episode, I will end this with this whiff of our work. In the meantime, you could think about what is your parallel programming style.

February 17, 2012

HPC tools classification

Gone are the days when there were only few parallel programming frameworks available like OpenMP, MPI and Threading Building Blocks. Now, with the advent of GPU computing and manycore architectures, there are a lot of High Performance Computing (HPC) languages and tools available which helps in speeding up our applications.

The HPC tools can be classified into 4 categories:

1. HPC Migration Language / HPC Language Extensions:

These HPC tools are basically extensions of existing languages such as C/C++. NVIDIA CUDA (Compute Unified Device Architecture) is one of them. For migration of an application using CUDA, the source code needs to be re-engineered; algorithm needs to be modified such that a large no. of GPU threads can be utilized to achieve the desired speed-up. OpenCL (Open Computing Language) is another such language extension. It is more low-level language than CUDA. The latest addition to this category is Microsoft C++ AMP.

2. Parallel Coding Assistant:

The tools in this category help us while coding in IDE like Microsoft Visual Studio. Intel Parallel Studio 2011 and Intel Parallel Studio XE 2011 can be integrated with Microsoft Visual Studio. They have been provided features which can help a programmer to analyze the code to find hotspots for adding parallelism, compose the source code by adding Intel Threading Building Blocks or Intel Cilk Plus constructs to exploit parallelism. They also have features to find memory errors or threading errors. The modified source code can be executed on any multi-core CPUs or Intel's Many Integrated Core (MIC) architectures.

3. Directives based Accelerator Models:

These are programming models which help a programmer to exploit parallelism by adding directives to potentially parallel portions of a sequential source code. The directives can be C pragma directives. PGI Accelerator Compiler from Portland Group Inc.(PGI) is based on such programming model. It also provides compiler feedback for the portions which could not be parallelized owing to dependency involved. For the computations to be performed on GPUs for getting desired speed up, the data needs to be copied to GPU device memory and then the result obtained has to be copied back to the CPU. This data transfer activity is taken care of by the PGI Accelerator compiler. The portion of the source code marked with parallel region gets executed directly on GPU device, thus accelerating the application.  HMPP from CAPS Enterprise is similar Accelerator model. OpenACC is an upcoming accelerator model which is being supported by Cray, CAPS Enterprise, NVIDIA and PGI.

4. Library assisting in HPC migration:

There are a lot of library available which makes GPU programming easier. Lot of common algorithms like reduction, scan, etc. which are required in case of GPU programming are available which are optimized for execution on GPU devices. CUBLAS, CUFFT and CURAND are such libraries which can be used on CUDA platform. Thrust is a library of parallel algorithms with an interface resembling the C++ Standard Template Library (STL) that greatly enhances developer productivity. Libra SDK is a C++ programming API for creating high-performance applications. ArrayFire is a GPU software acceleration library.

The HPC tools mentioned above is not an exhaustive list. Through this blog, I have tried to classify the HPC tools and very briefly wrote about the tools. More details on HPC tools in my coming blogs.

Is Parallel Computing a Rocket Science or Esoteric? Part 1

My association with the field of High Performance Computing has been intriguing and a journey of revelations where in I have tried to understand the intricacies of a subject that has been long under the hoods. It just seems so recent that it has been rewarded with its much awaited 'Glory'.

I make here a humble attempt to bring to you my understanding of this so called "Dark Science" considered by many only a esoteric craft. I bring to you a 3 part series describing "The past, present and future of Parallel Computing through the eyes and experiences of a commoner"

You pour yourself a cup of hot brewing coffee and descend in a chair sipping on it, while reading your early morning dose of news. A common daily routine for each one of us, so what's so special? We seldom happen to appreciate that the trifling that surrounds us influences our broader picture of life immensely. The coffee with the newspaper was a classic example of multi-tasking or doing things in parallel. And this is exactly what we do in Parallel Computing aka Processing (PC); go about doing or trying to do computing simultaneously.

From ancient ruins dating say a 100BC which gave us some tablets and abacuses capable of doing computation in parallel to the many-core architecture cutting edge parallel computer architectures today, the journey has been intriguing and of a realization. Each milestone reached in this journey involved imbibing something from simple real life to make it a breakthrough in the technology world. For example, Prof. Dave Patterson's Laundry Example; outlining the principles of pipelining in parallel computer architecture. Goes to say is we all know Parallel Processing aka Computing, it's just that we never realized we did.

Though, the IBM 704 with its Principal Architect Gene Amdahl has been regarded as the first commercial breakthrough at creating a machine with floating-point hardware in 1955; Wikipedia tracks back the true origins of Parallel Computing (aka MIMD parallelism) to Federico Luigi, Conte Menabrea and his "Sketch of the Analytic Engine Invented by Charles Babbage" in 1842. This work by Luigi can be regarded as the first treatise describing many aspects of computer architecture and programming.

To be continued...

 

February 15, 2012

Hello Parallel Computing!

Computing applications are popularly serial in nature, meaning the program is designed to run on a single computer. Since they are run on a single machine, at any given instance there is just one programming instruction that is executed. Thus, clearly the instructions in a serial program are executed one at a time and one after the other, in the sequence in which they appear in the program. Given today's parallel infrastructure, it is a shame to run such serial programs on this powerful hardware. Clearly the hardware is quite underutilized. And this is exactly what parallel computing solves. To simply define, parallel computing allows more than one instruction to be concurrently executed and hence is able to use the dedicated computing resources effectively. So, for example, if a parallel program is run on a quad-core processor, it is possible to execute 4 simultaneous instructions at any given instance in time. A serial program, on the other hand, will only be able to utilize one of the 4 cores at any instant.
Now that I have (hopefully) your attention, let me pop a question. So is parallel computing as complex as it sounds to be? And, unfortunately, the answer is yes. It is a hard job to design and develop an error-free parallel program. Hard because thinking in parallel and designing parallel programs is not something most are trained for and also because this process takes a good amount of time.
In the world of HPC, parallel computing is a must in order to use computing resources like multi-core CPUs, GPUs and other accelerators. And, the program design varies from one computing resource to another.  It is because the computing resources can be classified to work best for a particular class of parallelism. On a high level, parallelism can be classified as Task Parallel and Data Parallel. To be able to design a well optimized parallel program, it will be essential to identify whether the problem is task or data parallel or is a mix of both. The next step will be to identify the appropriate computing resource. A data parallel program is well suited for a GPU. On the other hand, the CPU is best for task parallel programs. With the decision on the hardware made, comes the next step of actually designing the parallel program, developing this using the appropriate programming model and running this to check for correctness. This is a mammoth step and an important achievement. And then it comes to the star step, to quench the thirst for speed, it is essential to fine tune or optimize the program to achieve the much sought after speed up.
Things get really exciting in my side of the world, the HPC world. Keep watch to read about ways to solve the mysteries of parallel computing.

February 7, 2012

Microsoft C++ AMP is now 'Open'

A significant announcement was made by Microsoft last week regarding C++ AMP. The technology has now been made an Open Specification. This was announced at the GoingNative 2012 event. This means any C++ compiler developer can now come up with an implementation of these specifications and support the C++ AMP features on a wide array of heterogeneous hardware.

For those new to C++ Accelerated Massive Parallelism(AMP), it is a native progamming model that enables C++ code to be accelerated on data parallel hardware such as GPUs. The data parallel hardware is referred to as accelerator. Just as with CUDA, using C++ AMP, one can parallelize the data intensive portions of a program on the  accelerator and explicitly control to data communication between the CPU and accelerator. ​

Microsoft C++ AMP is part of Visual Studio 2011 release. C++ AMP Open specification can be found here.

February 1, 2012

High Performance Computing: An Overview

High Performance Computing (HPC) is today, one of the fastest growing fields in the IT industry. At a recent survey conducted by IDC, it asked companies across many industry verticals such as aerospace, energy, life sciences and financial services, about how they would be impacted without access to HPC. An interesting result from the survey is that, 47% of the companies said they could not exist as a business without HPC and another 34% said they will not be able to compete and would face issues in terms of cost and time to market in the absence of HPC. So this makes it obvious that HPC is one of the key technologies for every company to invest, in order to innovate and compete.

What is HPC? HPC is a system which brings 3 elements together. They are - Computers, Software and the right Expertise to utilize them. HPC technologies are used to solve problems which have traditionally known to be difficult to solve. When I say difficult problems, one could think of complex simulation problems encountered in financial systems for risk assessment of portfolios or in life sciences field for gene sequence matching or for the seismic imaging of earth's sub-surface by an oil services company. These are very time consuming applications and require enormous computation power. The need in such systems is to accelerate the execution by best utilization of the computer hardware resources on hand. This is where the modern day multi-core and many-core processors such as those from Intel, AMD and Nvidia come to aid by accelerating the compute and data intensive applications. With 8-16 cores on an Intel multi-core CPU to somewhere around 400-500 cores on an Nvidia GPU chip, there is enormous compute power that is available for use. Their processing power is in the order of GFLOPS to TFLOPS. In order to utilize such enormous compute power one will have to design their applications to effectively utilize the hardware resources. But most of the current mainstream applications and scientific applications have been developed using sequential algorithms and cannot effectively utilize the large computation power that is at disposal. The only way to make best use of this processing power is by porting or rewriting these applications using parallel algorithms. By multithreading the applications across the multiple cores of the processor, thereby dividing work, enormous speed-up can be achieved and benefits can be derived in terms of throughput and performance. This is the essence of HPC, which is to break problems into pieces and work on those pieces at the same time.

With large computing capacity processors available, we need the right set of technologies to harness them effectively. This is where the 'software' element of HPC that I mentioned earlier comes to the fore. There are vast set of HPC technologies to choose from. One could categorize them in multiple ways but following are the 2 categories that could be done based on the architectural differences of the underlying hardware. These are : multicore CPU technologies and many-core GPU technologies.
1.      Multicore Technologies: This includes all the software tools and libraries that help utilize the multicore CPU infrastructure efficiently. This class of technologies is suitable for compute intensive workloads where computation can be divided into individual tasks each performing a specific functionality. Some of the multicore technologies include Microsoft .Net 4 Parallel framework, Windows HPC Server 2008 and Intel's Parallel Studio. Windows HPC server is a cluster based solution that helps in dividing the work into many chunks and distributes the workload to all the compute machines in the cluster. Here the granularity of the processing can go upto the individual core of the processor.
2.      Many-core Technologies: Under this category, the most prominent and revolutionary technology is the "GPU Computing", which is widely used by scientific computing community and a lot of industry verticals. With hundreds of processing cores on a single chip, GPUs have vast potential for parallelism. GPU computing refers to using GPUs to perform general purpose computations instead of graphics/visualization purposes which they have traditionally been known for. This class of technologies is suitable for data intensive applications where computation has to be performed on large datasets and each data element could be processed in parallel. GPUs are more energy efficient than similar compute capacity CPUs which makes them more attractive for HPC applications because of the large data processing involved. Some of the important GPU technologies include Nvidia's CUDA, Microsoft's C++ AMP and OpenCL.

HPC today is an indispensable area for companies to gain competitive advantage, innovate and build products that was once considered solely the province of big science. The wide variety of tools available in the market is helping democratize HPC usage.