The Three Vs of Big Data

Volume

Big Data analytics provide numerous advantages as well as challenges. Consider for example scientific research, where more data points lead to better and accurate results and models. At the same time, voluminous data presents several processing challenges, specifically when unstructured data is involved, which is 90% of Big Data. Thanks mostly to the contributions from sources such as sensors, tweets, social media, geospatial, cybersecurity logs, texting, research organizations, historical data, financial transactional data, etc. Also, hats off to storage area networks and cloud computing, we need not worry anymore about data storage capacity, scalability and accessibility. Several organizations offer scale-out network attached storage (NAS) solutions for storing, managing and retrieving Exabytes (1018 bytes or 1 billion gigabytes) of data. Object-based storage vendor CIA funded Cleversafe has successfully developed a 10 Exabyte storage system capable of housing data in a single pool of capacity. Bear in mind that 1,000 gigabytes is a terabyte, and a terabyte of storage can hold about 300 hours of video. Cleversafe's new storage system could hold 1 million times as much data.

Given the gap between Big Data’s potential and realization, the question is; how do we process it and how do we extract value from Big Data?

How can we revolutionize education by customizing approaches to individual student’s needs?

How do we improve the quality of healthcare by providing personalized care and preventing through remote and continuous monitoring anywhere and everywhere?

How can we translate billions of power meter readings annually to forecast and develop better power consumption models?

It is safe to assume that everyone who deals with Big Data is familiar with Hadoop and Hadoop Distributed File System (HDFS). Besides HDFS, Hadoop, at its core, also consists of MapReduce. MapReduce framework makes it possible to break large volumes of data into smaller chunks and process them separately. A cluster of computing nodes, each one built on commodity hardware, will scan the batches and aggregate their data.

Then the output from these multiple nodes gets merged to generate the resultant data set on which analytics and data mining algorithms are executed.

But Big Data is not all about MapReduce. There’s another computational approach to distributed query processing, called Massively Parallel Processing, or MPP. MPP has a lot in common with MapReduce. In MPP, as in MapReduce, processing of data is distributed across a bank of compute nodes, these separate nodes process their data in parallel and the node-level output sets are assembled together to produce a final result set. Therefore, we have two possible computational approaches for processing Big Data, Apache-based MapReduce and MPP such as Greenplum. However, the choice about the approach is significantly influenced by the Variety of Big Data. Typically, data warehousing approaches involve predetermined schemas, suiting a regular and slowly evolving dataset. Apache Hadoop, on the other hand, places no conditions on the structure of the data it can process. At its core, Hadoop is a platform for distributing computing problems across a number of servers. First developed and released as open source by Yahoo, it implements the MapReduce approach pioneered by Google in compiling its search indexes. Hadoop’s MapReduce involves distributing a dataset among multiple servers and operating on the data: the “map” stage. The partial results are then recombined: the “reduce” stage. To store data, Hadoop utilizes its own distributed filesystem, HDFS, which makes data available to multiple computing nodes. A typical Hadoop usage pattern involves three stages: loading data into HDFS, MapReduce operations, and retrieving results from HDFS. This process is by nature a batch operation, suited for analytical or non-interactive computing tasks. Because of this, Hadoop is not itself a database or data warehouse solution, but can act as an analytical adjunct to one. One of the most well-known Hadoop users is Facebook. A MySQL database stores the core data. This is then reflected into Hadoop, where computations occur, such as creating recommendations for the user based on his/her friends’ interests. Facebook then transfers the results back into MySQL, for use in pages served to users.

Velocity

According to Gartner, velocity "means both how fast data is being produced and how fast the data must be processed to meet demand." The Velocity dimension represents the speed of processing of Big Data. In mathematical terms Dimension 2 is directly proportional to Dimension 1, i.e., the data volume must be processed at a velocity at which it flows. Financial Services Organizations such as Goldman Sachs, JP Morgan Chase, Citi Financial and others have coped with fast moving data to their advantage in the past. But the enormous increase in Volume of structured and unstructured data presents both opportunities and challenges for these organizations. Data Storage, the speed of processing data and the relationships between the structured and unstructured data are some of the challenges that must be addressed to deal with real-time analysis and discernment. A delay of even a few seconds could mean immeasurable financial losses when trade transactions are concerned.

For time-sensitive processes such as catching fraud, big data must be used as it streams into the enterprise in order to maximize its value. Traditional data processing technology cannot be applied and is insignificant for applications requiring low-latency, high-volume real-time data processing. For example, in electronic trading, a latency of even one second is unacceptable, and the trading operation whose engine produces the most current results maximizes arbitrage profits. This fact is causing financial services companies to require very high volume processing of feed data with very low latency. Real-time fraud detection in diverse areas from financial services networks to cell phone networks exhibits similar characteristics. Similar requirements are present in monitoring computer networks for denial of service and other kinds of security attacks. Organizations must deal with the challenge of reacting in a mode that is fast enough to deal with the velocity of data to produce realistic and useful insights.

Streaming processing is normally considered for one of two main reasons. The velocity of incoming data is so fast that it is difficult to store it completely, and when the results of analytics are required in real-time. For example, the US Army has been investigating putting vital-signs monitors on all soldiers. In addition, there is already a GPS system in many military vehicles, but it is not connected yet into a closed-loop system. Using this technology, the army would like to monitor the position of all vehicles and determine, in real time, if they are off course. Other sensor-based monitoring applications are making in-roads in non-military domains. The processing of real-time data from existing and newly-emerging monitoring applications presents a major stream processing challenge and opportunity. Obviously, a real-time feedback generates greater competitive advantage for the organization.

Commercial tools for processing streaming data include IBM InfoSphere, which can capture and analyze data in motion, and emerging technologies such as open source frameworks including Twitter’s Storm, Yahoo S4 and spChains framework. The results are pumped directly into user displays and dashboards, or fed as input into other application products.

Variety

The Variety of Big Data comes in many different formats. For example, structured data could be dbf files from relational database systems and unstructured data could be text, Word documents, PDF documents, Excel spreadsheets, Visio drawings, PowerPoint presentations, sensor data, audio, video, click streams, log files, e-mails, images, etc. These data must be analyzed as a whole and both numeric as well as non-numeric data must be processed to obtain useful information.

The Variety of Big Data is best represented by the utilities. The number of different types of data is very high each corresponding to a different type of device and/or sensor forming a grid. The challenge that this possess is aggregating the different types of data on which analytics can be performed. Often different divisions in an organization use different systems giving rise to different data types complicating Big Data complexity further than what it already is. The challenge in the future is to establish unstructured Big Data interchange standards just as the Health Data Interchange HL7.