Streaming Databases Enable Continuous Analysis and Data Persistence

Written by Matt Aslett | Mar 23, 2023 10:00:00 AM

Success with streaming data and events requires a more holistic approach to managing and governing data in motion and data at rest. The use of streaming data and event processing has been part of the data landscape for many decades. For much of that time, data streaming was a niche activity, however, with standalone data streaming and event-processing projects run in parallel with existing batch-processing initiatives, utilizing operational and analytic data platforms. I noted that there has been an increased focus on unified approaches that enable the holistic management and governance of data in motion alongside data at rest. One example is the recent emergence of streaming databases designed to combine the incremental processing capabilities of stream-processing engines with the SQL-based analysis and persistence capabilities of traditional databases.

Ventana Research’s Streaming Data Dynamic Insights enables organizations to assess their relative maturity in achieving value from streaming data. Data from Ventana Research’s Analytics and Data Benchmark Research indicates that there are distinct benefits for organizations utilizing streaming data. As might be expected, organizations making use of event-streaming technologies are more confident in their ability to analyze data in motion. Almost two-thirds of organizations (64%) in production with event-streaming technologies are confident in their ability to analyze high-velocity data, compared to less than one-half of organizations (43%) that are not in production with event streaming. Additionally, organizations utilizing event-streaming technologies also have greater confidence in their ability to analyze large volumes of data, as well as their ability to analyze the variety of data needed to make informed business decisions. However, even amongst organizations that are making good use of streaming data and events, many are doing so via initiatives that are separate from traditional data platforms designed to support the batch processing of transactional data, which can result in silos of batch and stream data that then need to be combined and integrated to provide a holistic view. This is not the most efficient way to manage data related to an organization, particularly given the increased importance of streaming data produced by various use cases, including real-time e-commerce and internet of things (IoT).

Stream-processing systems have, to date, primarily been used for ingestion of streaming data and events and streaming analytics. Streaming data ingestion enables event data to be cleaned and transformed as it is ingested, while streaming analytics allows users to query the event data in flight, enabling low-latency continuous analysis of data as it is generated. In both cases, once processed, the historical event data can then be stored in an external, relational or non-relational data platform for batch processing and analysis as well as integration with transactional data. Streaming analytics provides a real-time view, and batch processing provides a historical view. If an organization is to gain a complete picture, batch and streaming analytics need to be combined. A prime example is combining transactional and user behavior data to fully understand customer behavior in an online retail environment. Our research leads us to assert that, by 2025, more than 7 in 10 organizations’ standard information architectures will include streaming data and event processing, allowing organizations to be more responsive and provide better customer experiences. The emergence of a new breed of streaming databases could improve operational inefficiencies compared to approaches that rely on separate platforms for stream and batch processing. Streaming databases, including the likes of Confluent ksqlDB, DeltaStream, Materialize, RisingWave and Timeplus, are designed to continually process streams of event data using SQL queries and real-time materialized views and also persist historical event data for further analysis. Unlike streaming compute engines that persist the data in an external database, streaming databases are designed to provide native processing and persistence. As such, a single streaming database could be used as an alternative to a combination of (for example) Apache Flink and Apache Cassandra, reducing deployment, configuration, integration and management complexity.

I recently wrote about how real-time analytic data platforms could be used to develop and support data-intensive operational applications requiring sub-second latency. Streaming databases serve a similar purpose, with latency in the milliseconds. For use cases requiring query response times in the hundreds of milliseconds, either approach could theoretically fit the bill. However, the two technologies remain differentiated by their respective approaches to data processing and persistence. In an analytic database, the data is ingested and stored prior to the batch execution of analytic queries. These queries could be performed with sub-second latency, and therefore deliver “real-time” performance from the perspective of the user, but they still occur after the data is ingested and stored. In comparison, streaming databases perform incremental processing and analysis of the ingested data prior to it being stored. As such, while real-time analytic databases can be used to answer a query in real time, providing a snapshot of the data at the time of the query, streaming databases enable the data to be queried continuously as new data is added, providing a constantly updated, real-time view that is triggered by the ingestion of new data and could therefore be described as truly data-driven. The use of these streaming databases is nascent, but I assert that by 2026, more than one-half of organizations will investigate the use of streaming database products to develop and support real-time applications based on streaming data and events.

Stream and batch processing both have advantages that lend themselves to specific use cases with different performance requirements. As such, they will continue to co-exist. However, there are also potential advantages to be had from unification. Organizations should evaluate vendors on their ability to deliver on a combination of those capabilities, whether delivered as one product or several products. I recommend that the potential use cases for streaming databases should be part of that evaluation process.

Regards,

Matt Aslett

View full post