Ventana Research Analyst Perspectives

Databricks Lakehouse Platform Streamlines Big Data Processing

Posted by David Menninger on Oct 26, 2021 3:00:00 AM

Databricks is a data engineering and analytics cloud platform built on top of Apache Spark that processes and transforms huge volumes of data and offers data exploration capabilities through machine learning models. It can enable data engineers, data scientists, analysts and other workers to process big data and unify analytics through a single interface. The platform supports streaming data, SQL queries, graph processing and machine learning. It also offers a collaborative user interface — workspace — where workers can create data pipelines in multiple languages — including Python, R, Scala, and SQL — and train and prototype machine learning models.

Databricks recently closed its series H funding of $1.6 billion, reaching a post-money valuation of $38 billion. With this round of funding, Databricks has raised a total of nearly $3.6 billion. The company intends to use the funds to enter new markets and grow its partner ecosystem.

Databricks Lakehouse Platform is the flagship product, combining the aspects of data warehouse and data lake systems in a unified platform. Business workers can store both structured and unstructured data in the platform and use it for analytics workloads and data science. The Lakehouse also includes capabilities such as schema enforcement, auditing, versioning and access controls. Databricks Lakehouse is an example of emerging data platforms which we’ve written about previously.

Organizations are collecting large amounts of data from many different sources, and the storage of this big data, which can be in any form (images, audio files, other unstructured data), becomes challenging and requires a different architectural approach. We assert that by 2025, three-quarters of organizations will require unstructured data capabilities in their data lakes to maximize the value of audio, video and image data. Databricks enables workers to query the data lakes with SQL, build data sets to generate ML models, and create automated extract, transform, and load pipelines and visual dashboards.

VR_2021_Data_Lakes_Assertion_2_Square (1)Databricks also introduced Delta Sharing earlier this year, which is included within the open-source Delta Lake 1.0 project, establishing a common standard for sharing all data types – structured and unstructured – with an open protocol that can be used in SQL, visual analytics tools and programming languages such as Python and R. Large-scale datasets can be shared in the Apache Parquet and Delta Lake formats in real time without copying.

Organizations are using a multitude of systems which introduce complexity and, more importantly, introduce delay as workers invariably need to move or copy data between different systems. Teams must grapple with data silos that prevent a single source of truth, the expense of maintaining complicated data pipelines and reduced decision-making speed. Using a unified platform such as Databricks allows traditional analytics, data science and machine learning to coexist in the same system. With Delta Sharing, workers can connect to the shared data through pandas, Tableau or other systems that implement the open protocol, without having to deploy a specific platform first. This can reduce the access time and work for data providers.

Databricks continues to expand its portfolio of big data software around the Databricks Lakehouse Platform, adding more capabilities and integrations to tap the broader market. Databricks can connect to a variety of popular cloud storage offerings including AWS S3, Azure and Google Cloud Storage. It also offers several built-in tools to support data science, business intelligence reporting and machine learning operations. I recommend that organizations looking to organize data and analytics operations into a single platform consider the capabilities of the Databricks platform.

Regards,

David Menninger

Topics: embedded analytics, Analytics, Business Intelligence, Collaboration, Data Governance, Information Management, Data, data lakes, AI and Machine Learning

David Menninger

Written by David Menninger

David is responsible for the overall research direction of data, information and analytics technologies at Ventana Research covering major areas including Analytics, Big Data, Business Intelligence and Information Management along with the additional specific research categories including Information Applications, IT Performance Management, Location Intelligence, Operational Intelligence and IoT, and Data Science. David is also responsible for examining the role of cloud computing, collaboration and mobile technologies as they affect these areas. David brings to Ventana Research over twenty-five years of experience, through which he has marketed and brought to market some of the leading edge technologies for helping organizations analyze data to support a range of action-taking and decision-making processes. Prior to joining Ventana Research, David was the Head of Business Development & Strategy at Pivotal a division of EMC, VP of Marketing and Product Management at Vertica Systems, VP of Marketing and Product Management at Oracle, Applix, InforSense and IRI Software. David earned his MS in Business from Bentley University and a BS in Economics from University of Pennsylvania.