IBM Provides Architectural Choice for Data Platforms

Written by Matt Aslett | Aug 27, 2024 10:00:00 AM

Enterprises face a bewildering level of choice in relation to data platforms, as evidenced by the number of software providers and products assessed in our recent Data Platforms Buyers Guide. There are numerous data platform providers and products to choose from, but also a diverse array of functional and architectural options. Is the workload primarily operational or analytic? Will it be deployed on-premises or in the cloud? Should it be distributed or centralized? Data warehouse or data lakehouse? Data fabric or data grid? These are often presented as binary choices, but they very rarely are. Many enterprises are looking for data platforms that support the flexibility to address a combination of functional and architectural options. Multi-product providers, such as IBM, are arguably at an advantage, if they can clearly articulate how and when their different options work together. At its recent Think 2024 customer event, IBM announced several enhancements to its watsonx.data lakehouse offering and articulated how it can coexist with IBM Cloud Pak for Data as part of a data fabric architecture.

IBM unveiled its watsonx brand at Think 2023, delivering an artificial intelligence (AI) and data platform designed to address the AI development life cycle, data storage processing and AI governance. The watsonx platform consists of three components. watsonx.ai is a development environment running on the Red Hat OpenShift Container Platform for data scientists to train, validate, tune and deploy generative AI (GenAI) and machine learning (ML) models. watsonx.data is a data lakehouse environment that enables users to store data in object storage and process it using a variety of query engines. Our research indicates that this approach is quickly becoming mainstream, with the use of object stores for analytics in production in more than one-half of enterprises. watsonx.governance provides an environment to collaboratively manage, catalog and monitor AI models in the context of ethical concerns and regulatory requirements. Together, these provide a comprehensive platform designed to support an enterprise’s strategic adoption of ML and foundation models, either standalone or in combination with IBM’s AI consulting services. IBM made numerous announcements at Think 2024 related to watsonx, including the preview of the watsonx BI Assistant natural language interface. Since it has become clear that data processing and management are essential to delivering trust in data for AI, I will focus the remainder of this perspective on recent enhancements at the data layer.

The concept of the data lake emerged over a decade ago in response to the demand for analytic data platforms that could economically store and process large volumes of raw data, either in the cloud or on premises. The concept has rapidly evolved into the data lakehouse in recent years, with the addition of query processing, data cataloging, transaction processing guarantees, and table format functionality normally associated with data warehouse environments. The data lakehouse is now accepted as a standard analytic data platform architecture, especially for storing and processing the large volumes of data required for strategic AI initiatives. I assert that by 2026, 9 in 10 current data lake adopters will be investing in data lakehouse architecture to improve the business value generated from the accumulated data.

IBM watsonx.data is on IBM Cloud, Amazon Web Services and Microsoft Azure, as well as on premises, and enables users to store large volumes of data in object storage using the Apache Iceberg table format for transactional consistency. The data can be processed using a choice of query engines, including Apache Spark and Presto, as well as IBM’s own Db2 and Netezza. At Think 2024, IBM announced that it had updated watsonx.data with support for Presto 2.0, including Presto C++, co-developed by IBM employees as part of the open-source Linux Foundation project, to run Presto with the Velox open source C++ acceleration library for improved performance. IBM also announced the addition of IBM Data Gate for watsonx to enable access to data in IBM zSystems environments, facilitating the development of AI models using transactional mainframe data. IBM also announced the integration of a semantic layer into IBM Knowledge Catalog, embeddable within watsonx.data, to accelerate data discovery and enrichment via semantic search capabilities. Previously, IBM had also announced the addition of vector database capabilities based on the open-source Milvus database, enabling the use of vector search to augment GenAI with context from enterprise content and data via retrieval-augmented generation.

As I previously noted, while some of the functionality delivered in watsonx is new, some is also available via other IBM products, such as IBM Cloud Pak for Data. It has subsequently become clearer how these products relate to each other, with the IBM watsonx.data license including entitlements to IBM Cloud Pak for Data platform software, as well as other prerequisites such as the Red Hat OpenShift Container Platform. IBM recently announced Cloud Pak for Data 5.0, including a new Immersive Experience feature which enables administrators to easily toggle between the dedicated user experiences for Cloud Pak for Data and watsonx, facilitating the use of watsonx.data as a data lakehouse within a larger data fabric architecture supported by Cloud Pak for Data and its combination of data integration, data governance, data observability, master data management and data lineage functionality. The Immersive Experience feature also provides access to IBM Data Product Hub, which was unveiled at Think 2024, to facilitate the development and sharing of data products. Included as part of IBM Cloud Pak for Data 5.0 and integrated with IBM watsonx.data, IBM Data Product Hub is built on Watson Knowledge Catalog to enable and control data access based on metadata and governance rules, with additional functionality for defining and enforcing data contracts between data producers and consumers. As such, IBM Data Product Hub has the potential to support the holistic view of data production and consumption that we see as critical to data intelligence.

IBM has also boosted its data fabric capabilities with the recently closed acquisition of the StreamSets real-time data integration capabilities from Software AG, along with the webMethods Integration Platform as a Service. These investments complement IBM’s previous acquisitions of Databand.ai for data observability in 2022 and Manta for data lineage in 2023, as well as the company’s ongoing internal development of data processing and management capabilities. The breadth of functionality available from IBM has the potential to be overwhelming for would-be customers, but the product positioning is becoming clearer with watsonx.data and Cloud Pak for Data as the delivery vehicles for data lakehouse and data fabric environments, respectively, and Cloud Pak for Data’s Immersive Experience feature facilitating co-existence. I recommend that any enterprises considering their data platform options include IBM in their evaluations. The company was rated Exemplary in our recent Data Platforms Buyers Guides for Operational Data Platforms, Analytic Data Platforms and overall Data Platforms.

Regards,

Matt Aslett

View full post