Genome sequencing technologies - Part 2
Genome sequencing technologies - Part
2
In
this second part following my previous blog, I am discussing the data analytics
challenges faced by various genome sequencing technologies.
Big
data as we understand needs specialized processing algorithms in order to make
sense out of it in a respecting time frame. The size of big data is not a
static quantity; it is constantly changing from terabytes to multiple
pentabytes for a single data set.
Life
sciences and biomedical research as a growing industry of science, is not
lagging behind, in fact already facing issues and obstacles in processing and
analyzing big chunks of data produced by Next Generation Sequencing
technologies. And it has been estimated to be increasing at the rate of 1.2-2.4
exabytes per year.
The
NGS market is one of the rapidly evolving markets where the estimated growth is
16.3% per year. In 2012, globally the NGS market was valued at $1.3 billion and
it has been estimated that it will be reaching to $2.7 billion by 2017. Due to
this reason many US Govt. funding agencies are funding projects both in
academia and industry which are related to next generation sequencing platforms
and technologies.
In
spite of this staggering growth of biomedical research industry, the
interpretation of the data generated by NGS technologies and their storage,
processing and management and analysis is still a big challenge. This
exponential growth of genomics markets is mostly driven by contributions and
collaboration of scientists and data scientists in developing innovative systems
and processes leading towards high throughput data generation with increased
accuracy and affordable cost.
Another
challenge lies in turning this large amount of life sciences research data into
insights in the lab and therapies in the clinics. For designing a big-data platforms
for pharmaceutical R&D and succeed, life sciences industry must overcome
several challenges. Not only these massive datasets need to be managed and
analyzed, but also the insights acquired must also be delivered in intuitive
forms to healthcare professionals and patients which can help them understand
it better.
Integration
of different data sources in a biologically meaningful way is another big
challenge. Major large-scale biological integration efforts are merely
collection of datasets and need sophisticated data analysis and signal
detection to get insights. Essentially a biologists and programmers should work
closely to facilitate development of tools and systems that can improve
biological understanding. In life sciences research, the scientific data comes
in varied formats and follow different file formats, which cannot be
implemented and defined in advance. Software developers need to develop systems
which are highly flexible and easily updatable.
Data
visualization for the life sciences data and results are the integral part of
solving any biological question. Without visualization, researchers are
dependent on attribute values, which most often represent basic statistics and
relationships between different attribute values. These can provide very basic
information hidden in the data but are often too specific to show hidden
relationships between the attributes. Visualization of data brings new insights
which leads to hypothesizing and validation of data. Better visualization techniques are the need
of hour, which can deal with large datasets and do not take too much time in
generating results. They should be capable of reducing the data size through
techniques like sub setting, dimensional reduction and sampling etc.
In
spite of tremendous effort from agencies like NCBI, FDA etc, the research and
clinical data is still un-structured in comparison of datasets of other
branches of science. Data scientists often find that biological data is
generally not big data and is technically small but it is trickier to organize.
Biologists often work with very complex problems, where they combine
genomic sequences data to see which genes
are turned off or on, what type of RNAs and proteins are getting produced, what
type of clinical symptoms were recorded, together with if there are any chemical
or other exposures etc.
In
an effort to deal with some of these challenges, in 2012 the National
Institutes of Health launched the Big Data to Knowledge Initiative (BD2K),
which aims, in part, to create data sharing standards and develop data analysis
tools that can be easily distributed. This program is still under discussion.
Last
but not the least challenge deals with thought process of big pharmaceutical companies,
who are do not want to invest in improving big-data analytical capabilities as
they do not see an ideal future state for them. Another hesitation is due to
the possibility of increase in interactions with regulatory authorities if they
pursue this path. Pharmaceutical companies should explore by conducting small
scale pilots in order to see value in this direction. The experienced gained
might provide long-term benefits for realization of the future state.
Figure 1 Data Analytics related challenges with Next Generation Sequencing data.
Only
our ability to interpret and visualize scientific biological data generated by
next generation sequencing technologies and clinical studies will predict the how
successful biomedical research will be in future. Many IT giants and social
networking related companies like Google, Facebook, Twitter, Amazon, Oracle and
Microsoft are leaders in interpreting huge data sets. Life sciences and
medicine need to implement similar systems with high scalability in order to
deal with high volumes of sequencing data. The life sciences industry will need
to focus more in the direction of adapting to the advances in data analytics
and informatics to successfully address the problems arise by Next Generation
Sequencing Technologies.