Traditional Data versus Machine Data: A closer look
(Posted on behalf of Pranit Prakash)
You have probably heard this one a lot - Google's search engine processes approximately 20 petabytes (1 PB=1000 TB) of data per day and Facebook scans 105 terabytes (1 TB=1000 GB) of data every 30 minutes.
Predictably, very little of this data can be fit into the rows and columns of conventional databases given the unstructured type and volume of this data. The complexity of this data is commonly refered to as Big Data.
The question then arises - how is this type of data different from system generated data? What happens when we compare system generated data - Logs, syslogs and the likes, with Big Data?
We all understand that conventional data warehouses are one's where data is stored in form of table based structures and useful business insights can be provided on this data by employing a relational business intelligence (BI) tool . However, analysis of Big Data is not possible using conventional tools owing to the sheer volume and complexity of data sets.
Machine or system generated data refers to the data generated from IT Operations and from infrastrucutre components such as server logs, syslogs, APIs, applications, firewalls etc. This data also requires special analytics tools to provide smart insights related to infrastructure uptime, performance, threat and vulnerabilities, usage patterns etc.
So where does system data differ from Big data or traditional data sets?
1. Format: Traditional data is stored in the form of rows and columns in a relational database whereas system data is stored in the form of text that is loosely structured or even unstructured. The format of big data remains highly unstructured and contains even raw form of data that is generally not categorized but is partitioned in order to index and store.
2. Indexing: In traditional data sets, each record is identified by a key which is also used as index. In machine data, each record has unique time-stamp that is used for indexing unlike big data, where there is no criteria for indexing.
3. Query Type: There are pre-defined questions and searches conducted on the basis of structured language in traditional data analysis. In system or machine data, there is a wide variety of queries mostly on the basis of source-type, logs and time-stamps while in big data, there is no limit to the number of queries and it depends on how the data is configured.
4. Tools: Typical SQL and relational database tools are used to handle traditional data sets. For machine data, there are specialized log collection and analysis tool like Splunk, Sumologic, eMite which install an agent/forwarder on the devices to collect data from IT applications and devices and then apply statistical algorithms to process this data. In Big Data, there are several categories of tools ranging from areas of storage and batch processing(such as Hadoop) to aggregation and access (such as NoSQL) to processing and analytics (such as MapReduce).
When an user logs in to a social networking site, details such as name, age and other attributes, entered by the user, get stored in form of strucutred data and constitute traditional data - i.e. stored in the form of neat tables. On the other hand, data that is generated automatically during a user transaction such as the time stamp of a login constitutes system or machine data. This data is amorphous and cannot be modified by end users.
While analysis of some of the obvious attributes - name, age etc. gives an insight into consumer patterns as evidenced by BI and Big Data analysis, system data can also yield information at the infrastructure level. For instance, server log data from internet sites is commonly analyzed by web masters to identify peak browsing hours, heat maps and the like. The same can be done for an application server as well.