Research and development in Big Data technology will define the future of how data and the intelligence derived from it can be applied to defend and protect our planet from man-made and natural disasters. Since the emergence of knowledge economy, to the paradigm shift of the information age today, research in Big Data technology has produced progressive outcomes for the developed world and more recently for the developing world, playing a crucial role in addressing a variety of issues and advancing their economies. However, there are several challenges that lie ahead of us, such as continuous innovation to sustain global economies, democracies and privacy issues in an ever changing scientific, political and socio-economic environment.

Data that is streamed through multi-core computing hardware, processed and visualized as it is being collected is considered Real-Time data processing. Federal Agencies that deal with real-time data include NASA, DoD, NOAA, NGA, SEC and a host of others. Real-Time Data Applications include weather and meteorological related applications, astronomy and space science, environmental, financial, electronic securities trading on stock exchanges, network and infrastructure monitoring for performance and security vulnerabilities, fraud detection processing, command, control and communications, oceanography, acoustic signal processing, census population clock and more.

The most significant requirement of real-time data is zero to low latency. Traditional data processing technology is both insignificant and insufficient for applications requiring low-latency, high-volume real-time data processing. The constant escalation in feed volumes of data for these applications is causing traditional feed processing systems to break. For example, in electronic trading, a latency of even one second is unacceptable, and the trading operation whose engine produces the most current results will maximize arbitrage profits. This fact is causing financial services companies to require very high volume processing of feed data with very low latency. Real-time fraud detection in diverse areas from financial services networks to cell phone networks exhibits similar characteristics. Similar requirements are present in monitoring computer networks for denial of service and other kinds of security attacks.

U.S. Army was the first Branch of Armed Services to embrace wireless sensor network technology. The Army implemented this technology to monitor militant activity in the remote areas of Afghanistan. The Army has also placed the vital-signs monitors on all soldiers. Additionally, a GPS system exists in majority of the military vehicles. Such technology will enable the army to monitor the location of vehicles in real time to determine if they are off course or in danger of being attacked in the enemy infected areas. In addition to the protection of our own soldiers and vehicles, using wireless sensor technology, the Army can also monitor and detect enemy positions and movement. Other sensor-based monitoring applications such as environmental monitoring and crime monitoring, are also making in-roads in the non-military applications. The processing of real-time data from existing and emerging monitoring applications presents both a challenge and an opportunity for stream processing and visualization.

In a real-time Digital Signal Processing, the analyzed and generated samples can be processed continuously in the time it takes to input and output the same set of samples independent of the processing delay. It means that the processing delay must be bounded even if the processing continues for an unlimited time. That means that the average processing time per sample is no greater than the sampling period, which is the reciprocal of the sampling rate. Consider an audio DSP example; if a process requires 2.01 seconds to process 2.00 seconds of sound, it is not real-time. If it takes 1.99 seconds, it is or can be made into a real-time DSP process. A signal processing algorithm that cannot keep up with the flow of input data with output falling farther and farther behind the input is not real-time. But if the delay of the output (relative to the input) is bounded during a process that operates over an unlimited time, then that signal processing algorithm is real-time, even if the throughput delay may be very long.

Characteristics of a Real-Time System:

Continuous Data Streaming

To achieve low latency, a system must be able to perform message processing without having a costly storage operation in the critical processing path. A storage operation adds a great deal of unnecessary latency to the process (e.g., committing a database record requires a disk write of a log record). For many stream processing applications, it is neither acceptable nor necessary to require such a time-intensive operation before message processing can occur. Instead, messages should be processed “in-stream” as they fly by. An additional latency problem exists with systems that are passive, as such systems wait to be told what to do by an application before initiating processing. Passive systems require applications to continuously poll for conditions of interest. Unfortunately, polling results in additional overhead on the system as well as the application, and additional latency, because (on average) half the polling interval is added to the processing delay. Active systems avoid this overhead by incorporating builtin event/data-driven processing capabilities.

Integration of Stored and Streaming Data

For many stream processing applications, comparing “present” with “past” is a common task. Thus, a stream processing system must also provide for careful management of stored state. For example, in on-line data mining applications (such as detecting credit card or other transactional fraud), identifying whether an activity is “unusual” requires, by definition, gathering the usual activity patterns over time, summarizing them as a “signature”, and comparing them to the present activity in real time. To realize this task, both historical and live data need to be integrated within the same application for comparison.

A very popular extension of this requirement comes from firms with electronic trading applications, who want to write a trading algorithm and then test it on historical data to see how it would have performed on alternative scenarios. When the algorithm works well on historical data, the customer wants to switch it over to a live feed seamlessly; i.e., without modifying the application code.

Another reason for seamless switching is the desire to compute some sort of business analytic starting from a past point in time (such as starting two hours ago), “catch up” to real time, and then seamlessly continue with the calculation on live data. This capability requires switching automatically from historical to live data, without the manual intervention of a human. For low-latency streaming data applications, interfacing with a client-server database connection to efficiently store and access persistent state will add excessive latency and overhead to the application. Therefore, state must be stored in the same operating system address space as the application using an embedded database system. Therefore, the scope of a StreamSQL command should be either a live stream or a stored table in the embedded database system.

High-Availability

To preserve the integrity of mission-critical information and avoid disruptions in real-time processing, a stream processing system must use a high-availability (HA) solution. High availability is a critical concern for most stream processing applications. For example, virtually all financial services firms expect their applications to stay up all the time, no matter what happens. If a failure occurs, the application needs to failover to backup hardware and keep going. Restarting the operating system and recovering the application from a log incur too much overhead and is thus not acceptable for real-time processing. Hence, a “Tandem-style” hot backup and real-time failover scheme [6], whereby a secondary system frequently synchronizes its processing state with a primary and takes over when primary fails, is the best reasonable alternative for these types of applications.