Ventana Research Analyst Perspectives

Databricks Utilizes Unity Catalog to Support Generative AI Development

Written by Matt Aslett | Oct 17, 2023 10:00:00 AM

I previously described how Databricks had positioned its Lakehouse Platform as the basis for data engineering, data science and data warehousing. The lakehouse design pattern provides a flexible environment for storing and processing data from multiple enterprise applications and workloads for multiple use cases. I assert that by 2025, 8 in 10 current data lake adopters will invest in data lakehouse architecture to improve the business value generated from the accumulated data.

Databricks is perhaps still best known for the Apache Spark open-source distributed data processing framework, which is a core component of its Databricks’ Lakehouse Platform. The company’s structured data management and unified governance functionality are just as important to the Lakehouse Platform, as reinforced by the company’s recent announcements related to the expansion of its workloads to include generative AI. Databricks’ Unity Catalog is a core enabler of the Lakehouse AI announcements and the launch of LakehouseIQ.

Databricks was founded in 2013 and initially built a business providing a managed service enabling data engineers, data scientists and developers to create and maintain data engineering and machine-learning workloads using Apache Spark. It is compatible with other open-source projects, including TensorFlow, MLflow and PyTorch.

Databricks expanded its focus to data warehousing with dashboarding and visualization functionality provided by the 2020 acquisition of Redash. The functionality is supported by the Delta Lake table storage project, the Delta Engine high-performance query engine and Unity Catalog, which provides unified governance for files, tables, dashboards and machine learning models. Almost one-half (46%) of participants in Ventana Research’s Data Lake Dynamic Insights research use a data catalog to manage data in the data lake.

Databricks’ combined functionality is offered as Databricks Lakehouse Platform on Amazon Web Services and Google Cloud, and Azure Databricks on Microsoft Azure. It is used by more than 10,000 customers, providing the company with a revenue run rate of over $1.5 billion and a valuation of $43 billion, according to its recent $500 million Series I funding round. Databricks has expanded its focus to address generative AI through internal research and development and through the acquisition of generative AI platform provider MosaicML. Best known for developing and training its large language models, MosaicML also provides a platform for use by enterprises to build, train and deploy LLMs. The MosaicML platform remains available on a standalone basis while Databricks is working on enabling closer integration with the Databricks Lakehouse Platform.

Announcements from the company’s recent Data + AI Summit user conference focused on facilitating the combination of generative AI and Databricks Lakehouse Platform. This includes the Databricks Lakehouse Platform as a foundation for developing generative AI-based applications and the introduction of generative AI-based services to facilitate the generation of insight from data in the Databricks Lakehouse Platform.

Lakehouse AI is not a new product offering but a set of new functionality announcements that facilitate the use of Databricks Lakehouse Platform for the development and training of generative AI. The new features introduced at Data + AI Summit include support for fine-tuning LLMs using Databricks AutoML, optimizing Databricks model serving for a curated list of open-source LLMs available within Databricks Marketplace and enhancements to the MLflow machine learning operationalization platform to support LLMs. Enhancements to MLflow include the addition of MLflow AI Gateway for the management of cloud models and application programming interfaces and MLflow prompt tools for comparing the output of multiple models.

Databricks also announced the addition of vector search, providing the ability to create and manage vector embeddings from files in Unity Catalog, improving the accuracy of generative AI responses with proprietary data and content. Unity Catalog now performs as the backend of Databricks Feature Store, serving data to models in training and production as well as providing governance and lineage capabilities to support model monitoring with Databricks Lakehouse Monitoring. This expands the company’s data and AI monitoring capabilities with the monitoring and management of all data and AI assets within the Lakehouse.

Unity Catalog also provides governance capabilities to support LakehouseIQ, which is described as a knowledge engine. It provides a natural language interface for users to discover, understand and query data in the Lakehouse Platform, including Unity Catalog, in addition to dashboards, notebooks, data pipelines and documents. LakehouseIQ provides enhanced search by adding business context and supports automated query generation in the Assistant capability in Databricks’ SQL editor and notebooks software. It is also used to debug jobs, data pipelines and Spark and SQL queries.

In addition to building on Unity Catalog, Databricks also used Data + AI Summit to announce some enhancements to the product. This included the addition of Lakehouse Federation capabilities to better enable Lakehouse Platform to be used as part of a data mesh environment. The Lakehouse Federation capabilities include query federation, enabling users to query data in data platforms outside Databricks Lakehouse Platform, such as MySQL, PostgreSQL, Amazon Redshift, Snowflake, Azure SQL Database, Azure Synapse and Google BigQuery.

Additionally, Unity Catalog users can now define data access and governance policies in Unity Catalog and apply those policies to data stored in external data warehouses. Databricks previously added a hive Metastore interface to Unity Catalog, enabling it to connect to any software compatible with Apache Hive, including Amazon EMR, open-source Apache Spark, Amazon Athena, Presto and Trino. Databricks also recently announced version 3.0 of the open-source Delta Lake table storage project – including the Universal Format (or UniForm) – providing compatibility with tools designed for alternative table formats by allowing data stored in Delta tables to be read as if it were Apache Iceberg or Apache Hudi.

Multiple competing table formats had the potential to divide the various data lakehouse platform and tool providers. Databricks continues to develop differentiating capabilities. Its investment in developing UniForm is indicative of the company’s focus on higher-level business value differentiation via capabilities such as Unity Catalog rather than lower-level technical differentiators like table formats. Either way, Databricks is now well-established as an analytic data platform provider, and I recommend that organizations include Databricks in evaluations for data platforms to support data engineering, data science, data governance, data warehousing and generative AI.

Regards,

Matt Aslett