« Summoning the Demon | Main | Slow Motion Data Crisis »

Imagine Shopping for Your Data...

People enjoy the convenience of shopping online, which is more than you can say for most employees of large companies trying to find the data they need to do their jobs. Maybe what companies need is an online data marketplace? Can shopping at Amazon provide us some lessons on how to manage the Big Data environments we have and show us how to give an enjoyable "data" shopping experience for our employees?

I think I can safely assume that most people reading this article have experienced shopping on Amazon.com. Whether you were looking for a good book or something else, Amazon is a popular "place" (if I can use that term). Amazon began as a virtual book store and diversified, expanded and disrupted retail sales channels.  Visitors to the Amazon site will account for about 7% of North American retail sales in 2018, (versus 10.6% for Walmart - the largest brick-and-mortar retailer). Part of this can be attributed to 1.8 million items Amazon offers vs Walmart SuperCenter range of 120,000 average. Amazon also produces consumer electronics and cloud infrastructure services (IaaS and PaaS) and recently purchased Whole Foods (for $13.4 billion) to expand their grocery business.

How does shopping online correlate to data management and more specifically towards the challenge of data access? Every enterprise seems to have an established data lake (or more than one) to go along with their data warehouse and a host of specific data marts, systems of record for structured data, enterprise content management systems for documents and many other data sources. Whether they call their strategy a data foundation or a data ecosystem, the amount and variety of data is growing, but the amount of time spent just trying to find what you are looking for, verifying that the data quality is acceptable and the data comes from the right source (data governance), seems to take up most of your day.

Organizations looking to modernize their analytics and data infrastructure to enable both data science teams and self-service for the average employee are running into challenges with islands of data and heterogeneous technology landscapes. This leads to perpetual data integration efforts, making Analytics difficult and restricted to a very few with the time and skills to mine the large data collections available. The vast majority of organizations building Data Lakes are struggling to unlock the maximum value in their Data. The efforts to deploy this new technology are falling short due to multiple barriers including:

  • Use-ability barriers: users don't know what is in the data lake. They need a guide to help them navigate their NoSQL environment
  • Access barriers: most companies have a heterogenous technology landscape and need data access across the board (horizontal as well as vertical) for their Analytics workloads
  • Technology barriers: users don't have the required tools to get value out of the data and enable self-service. Just deploying Hadoop is not enough
  • Skills barriers: hiring data scientists is difficult and expensive and not every problem requires a specialist and not everyone has to be an R or Python programmer
  • Productivity barriers: Data Scientists spend much of their time hunting for data and preparing a data set for analysis
  • Performance barriers: running analytical models against Hadoop scale data using traditional methods takes too long. Companies need analytical engines that scale along with the data

But before we start selecting technology solutions for each specific problem, we need to get a better idea of what kind of data marketplace we are trying to develop. Here are four marketplace options as examples and you can probably think of others that might fit your company's culture and requirements:

  • Gate Keeper: this marketplace is under the tight control of IT who controls access and registration to data sources. Gatekeepers maintain high standards but at a high cost
  • Shop Keeper: users only have one choice for each information object and the choice offered is usually a high-quality choice, blessed by the business and IT. Integration and data quality are high, though you only have the choice offered and getting that changed can be a chore
  • Outlet Mall: in this option, you have more choice but less integration and standardization. There is some oversight on approved choices and data quality/standards but the user is responsible for the consequences when she picks an "off-brand" data source. It can also sprout a cottage industry when data stewards from the variety of data sources across your organization set up shop to attract business
  • Wild West: you may not consider this a choice for your company as there is little oversight, and little to no governance, but at least all the data is under one roof. It can be the place to start modifying behaviors, improving content and services through social pressure (users don't go back to sources that don't provide good data)

After you chose which kind of marketplace you want to promote, there are technologies and methods that can help you. The application of data catalogs, glossaries and meta data management tools will:

  • Improve efficiency of data discovery: support self-service business intelligence interests by directing engineers, analysts, operators to the best source of data for their projects. With the wide diversity of data sources (could be hundreds including informal sources with strong resistance to changing local business unit preferences), they will be grateful for the increased productivity, reliability and direct path to the best source of data
  • Accelerate the acceptance of standard meta data: increase deployment of existing standards and reinforce stewardship of information objects in metadata for data source discovery
  • Accelerate the acceptance of systems of record: use data catalogs to influence adoption of best available "systems of record" where data quality and data management best efforts are focused. Identify the important shadow IT data sources, expose them to projects and see if they can be moved to the existing official systems of record
  • One stop shopping for data: establish a single, project, go-to "marketplace" for data discovery to overcome data governance that is loosely defined and often informal

Just like at Amazon, a data marketplace is supported by an integrated platform behind the scenes. The platform should consist of best-in-class capabilities for data discovery, data sampling, data profiling, data wrangling, data blending, data lineage, data cataloging, data preparation/transformation, analytical modelling, guided analytical modeling, model management and visualization augmenting the data platform to deliver end-to-end self service capabilities to the data analyst and scientist community.

This is not an easy journey to embark on, but you can start with simple steps by focusing on targeted communities (such as the data science team), and targeted data environments (understanding what is in your data lake), and select data sources (ones your business is already devoting time to get in better shape) and a subset of all the technology components in the full data platform.

"A journey of a thousand miles, begins with a single step" according to a Chinese proverb. What is keeping you from taking the first step?

Jim Crompton is a thought leader for Noah Consulting, an Infosys Company, helping pioneer the relationships between complex Upstream processes and enterprises with automation to create competitive advantage.  His experience over numerous decades combined with the development capability of Infosys is working to ensure successful alignment of man and machine.


Jim you always tell the best stories. I LOVE this concept and have struggled telling the data/information story and your analogy works so well. Now if we could only get everybody to classify / tag data correctly we could make this work

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

Please key in the two words you see in the box to validate your identity as an authentic user and reduce spam.