Ventana Research Analyst Perspectives

DataStax Adds Vector Search to Address Generative AI

Written by Matt Aslett | Aug 3, 2023 10:00:00 AM

As I have previously explained, we expect an increased demand for intelligent operational applications infused with the results of analytic processes, such as personalization and artificial intelligence-driven recommendations. These systems rely on the analysis of data in the operational data platform to accelerate worker decision-making or improve customer experience.

AI-driven intelligent applications require a new approach to data processing that enables real-time performance of machine learning on operational data to deliver instant, relevant information for accelerated decision-making. I explained last year how NoSQL database provider DataStax added streaming data capabilities to its portfolio to address the processing of data in motion and at rest and support the development of interactive, real-time, data-driven applications. Since then, the company has further expanded its portfolio with the addition of open-source temporal event processing for machine learning and vector search capabilities to support the development of generative AI applications.

DataStax was founded in 2010 to build a business around the Apache Cassandra open-source distributed, non-relational database. The company is still best known as a provider of operational data platforms, both on-premises and in the cloud, and has continued to contribute to the development of Apache Cassandra in addition to its DataStax Enterprise distribution. DataStax has also acquired capabilities for graph data processing as well as cloud management services, and in 2020 launched the Astra DB managed database-as-a-service offering. The company has expanded its purview further through two recent acquisitions: DataStax acquired messaging and event-streaming cloud service provider Kesque in January 2021, followed by the acquisition of machine learning specialist Kaskada in January 2023. Both acquisitions addressed growing requirements for real-time data processing and intelligence through a combination of operational and analytic processing. I assert that through 2026, operational data platform providers will continue to invest in hybrid operational and analytic processing capabilities to support growing demand for data-intensive intelligent operational applications.

As a result of acquiring Kesque, DataStax added the ability to process streaming data using the Apache Pulsar open-source project to support the development of interactive, real-time, data-driven applications. The addition of Kaskada enhanced DataStax’s ability to support customers in the development of real-time AI applications. The company subsequently relicensed its machine learning feature engine as open-source software. More recently DataStax responded to the popularization of large language models and generative AI with the addition of vector search capabilities to its Astra DB database-as-a-service to complement LLMs with approved enterprise content and data.

Data platforms continue to be DataStax’s primary focus, with the company offering Luna for Apache Cassandra, a commercial support subscription for open source Apache Cassandra. It also offers the DataStax Enterprise commercial distribution with added security and other enterprise features and the Astra DB database-as-a-service. For stream and event processing, DataStax offers Luna Streaming, a commercial support offering for Apache Pulsar, and the Astra Streaming managed service.

The addition of Kaskada to DataStax’s product portfolio enables organizations with real-time data to adopt real-time AI through a combination of data platform, data streaming and AI/ML products and services. Kaskada is a unified batch and event-stream processing engine with a declarative query language that performs aggregations, joins and windowing to support analytics applications, dashboards and machine learning. Kaskada enables the processing of temporal event data, facilitating the development of applications providing real-time ML on event data. Luna ML is a commercial support offering for Kaskada Open Source, with additional Real-Time AI functionality and services from DataStax to help customers develop and deploy applications that maximize predictive and generative AI. Real-Time AI from DataStax provides additional capabilities to enable the development of AI applications, including the recently introduced vector search capabilities to support generative AI applications based on LLMs.

Although we are at a very early stage of identifying enterprise use cases for generative AI, we expect adoption to grow rapidly. We assert that through 2025, one-quarter of organizations will deploy generative AI embedded in one or more software applications. The ability to trust the output of generative AI models will be critical to their adoption by enterprises. There are multiple approaches to reducing accuracy and trust concerns, one of which is using vector embeddings and vector search to augment generic models with enterprise information and data.

Vector embeddings are multi-dimensional mathematical representations of features or attributes of raw data, which could include text, images, audio or video. Vector search utilizes vector embeddings to perform similarity searches by enabling rapid identification and retrieval of similar or related data. Potential applications for vector search include natural language processing and recommendation systems that find and recommend products similar in function or style, either visually or based on written descriptions. Vector embeddings and vector search complement large language models to reduce accuracy and trust concerns by incorporating embeddings that represent approved enterprise content and data. Astra Vector Search utilizes DataStax’s Storage Attached Indexing to enable the creation of multiple secondary indexes on Astra DB database tables. DataStax has also released CassIO, an open-source library to integrate the Cassandra database with frameworks such as LangChain, making it easier for developers to access Cassandra’s capabilities, including vector search.

DataStax has expanded its addressable market considerably in recent years by adding streaming data and machine learning capabilities and expertise to its existing data platform focus. The company must still work to organize these capabilities into a combined offering, but the expanded portfolio puts DataStax in a stronger position to compete to support the next generation of intelligent operational applications. I recommend that any organization considering options for data platform, streaming and operational AI include DataStax in evaluations. The addition of vector search to address generative AI use cases illustrates how the company continues to adopt its platform in response to evolving requirements and use cases.

Regards,

Matt Aslett