Genome sequencing technologies - Part 2
Genome sequencing technologies - Part 2
In this second part following my previous blog, I am discussing the data analytics challenges faced by various genome sequencing technologies.
Big data as we understand needs specialized processing algorithms in order to make sense out of it in a respecting time frame. The size of big data is not a static quantity; it is constantly changing from terabytes to multiple pentabytes for a single data set.
Life sciences and biomedical research as a growing industry of science, is not lagging behind, in fact already facing issues and obstacles in processing and analyzing big chunks of data produced by Next Generation Sequencing technologies. And it has been estimated to be increasing at the rate of 1.2-2.4 exabytes per year.
The NGS market is one of the rapidly evolving markets where the estimated growth is 16.3% per year. In 2012, globally the NGS market was valued at $1.3 billion and it has been estimated that it will be reaching to $2.7 billion by 2017. Due to this reason many US Govt. funding agencies are funding projects both in academia and industry which are related to next generation sequencing platforms and technologies.
In spite of this staggering growth of biomedical research industry, the interpretation of the data generated by NGS technologies and their storage, processing and management and analysis is still a big challenge. This exponential growth of genomics markets is mostly driven by contributions and collaboration of scientists and data scientists in developing innovative systems and processes leading towards high throughput data generation with increased accuracy and affordable cost.
Another challenge lies in turning this large amount of life sciences research data into insights in the lab and therapies in the clinics. For designing a big-data platforms for pharmaceutical R&D and succeed, life sciences industry must overcome several challenges. Not only these massive datasets need to be managed and analyzed, but also the insights acquired must also be delivered in intuitive forms to healthcare professionals and patients which can help them understand it better.
Integration of different data sources in a biologically meaningful way is another big challenge. Major large-scale biological integration efforts are merely collection of datasets and need sophisticated data analysis and signal detection to get insights. Essentially a biologists and programmers should work closely to facilitate development of tools and systems that can improve biological understanding. In life sciences research, the scientific data comes in varied formats and follow different file formats, which cannot be implemented and defined in advance. Software developers need to develop systems which are highly flexible and easily updatable.
Data visualization for the life sciences data and results are the integral part of solving any biological question. Without visualization, researchers are dependent on attribute values, which most often represent basic statistics and relationships between different attribute values. These can provide very basic information hidden in the data but are often too specific to show hidden relationships between the attributes. Visualization of data brings new insights which leads to hypothesizing and validation of data. Better visualization techniques are the need of hour, which can deal with large datasets and do not take too much time in generating results. They should be capable of reducing the data size through techniques like sub setting, dimensional reduction and sampling etc.
In spite of tremendous effort from agencies like NCBI, FDA etc, the research and clinical data is still un-structured in comparison of datasets of other branches of science. Data scientists often find that biological data is generally not big data and is technically small but it is trickier to organize. Biologists often work with very complex problems, where they combine genomic sequences data to see which genes are turned off or on, what type of RNAs and proteins are getting produced, what type of clinical symptoms were recorded, together with if there are any chemical or other exposures etc.
In an effort to deal with some of these challenges, in 2012 the National Institutes of Health launched the Big Data to Knowledge Initiative (BD2K), which aims, in part, to create data sharing standards and develop data analysis tools that can be easily distributed. This program is still under discussion.
Last but not the least challenge deals with thought process of big pharmaceutical companies, who are do not want to invest in improving big-data analytical capabilities as they do not see an ideal future state for them. Another hesitation is due to the possibility of increase in interactions with regulatory authorities if they pursue this path. Pharmaceutical companies should explore by conducting small scale pilots in order to see value in this direction. The experienced gained might provide long-term benefits for realization of the future state.
Figure 1 Data Analytics related challenges with Next Generation Sequencing data.
Only our ability to interpret and visualize scientific biological data generated by next generation sequencing technologies and clinical studies will predict the how successful biomedical research will be in future. Many IT giants and social networking related companies like Google, Facebook, Twitter, Amazon, Oracle and Microsoft are leaders in interpreting huge data sets. Life sciences and medicine need to implement similar systems with high scalability in order to deal with high volumes of sequencing data. The life sciences industry will need to focus more in the direction of adapting to the advances in data analytics and informatics to successfully address the problems arise by Next Generation Sequencing Technologies.