Explore Wide-Column Stores for Database Flexibility

Posted by Matt Aslett, Feb 7, 2024 3:00:00 AM, 6 minutes to read

I have previously written about the functional evolution and emerging use cases for NoSQL databases, a category of non-relational databases that first emerged 15 or so years ago and are now well established as potential alternatives to relational databases. NoSQL is a term used to describe a variety of databases that fall into four primary functional categories: key-value stores, wide-column stores, document-oriented databases and graph databases. Each is worthy of further exploration, which is why I have examined them over a series of Analyst Perspectives. Following a closer look at graph databases, document-oriented databases and key-value stores, I now conclude this series of Perspectives with wide-column stores.

The development and adoption of wide-column stores was initially driven by digital native enterprises in areas such as social media, the internet and gaming. Perhaps the most well-known, Apache Cassandra, was created at Facebook (now Meta), and early adopters included Reddit, Twitter (now X) and Rackspace. Other prominent open-source wide-column stores include Apache HBase and ScyllaDB, while commercial products include DataStax’s Enterprise and Astra DB (based on Apache Cassandra), ScyllaDB Enterprise and ScyllaDB Cloud from Scylla and Google Cloud Bigtable.

Managed cloud services are also available from Amazon Web Services, Microsoft Azure and Aiven, amongst others and are typically based on, or compatible with, Apache Cassandra. Tech companies continue to dominate adoption of wide-column stores alongside organizations in industries such as retail and communications. Known Apache Cassandra users include Apple, Best Buy, eBay, Home Depot, Hulu, Macy’s, Netflix, Spotify, Target, Uber and Walmart, while users of ScyllaDB include Comcast, Discord, Epic Games, Expedia, GE Digital, Grab, Rakuten and Strava.

Wide-column stores should not be confused with columnar relational databases. Although they have similar names, they have very different data models and use cases. Columnar databases are relational databases in which data is stored as columns rather than rows to improve query performance in analytic data processing use cases. Wide-column stores are non-relational and are typically used in operational use cases. The two have different approaches to storing data as columns: While each column in a columnar database is stored separately to disk, groups of related columns (called “column families”) can be stored to disk together in a wide-column store. More significantly, the two have different data models: Columnar databases are a type of relational database, while the wide-column data model is non-relational and provides a flexible approach that does not have the strict schema requirements associated with the relational model.

I previously explained that the key-value model that underpins key-value stores also forms the basis of wide-column stores. Key-value databases store data as simple pairs of keys and associated values, but wide-column stores extend the key-value model by enabling multiple values to be associated with an individual key. Each additional value is added as a new column, which results in a combination of rows and columns similar to a table in a relational database.

Unlike relational databases, however, wide-column stores do not require a strictly defined schema for all rows and columns in a table. Adding a new column to a row in a relational Ventana_Research_2024_Assertion_DataPlat_Distributed_Architecture_90_S database requires all other rows in the table to have values in the same column, necessitating the use of null or default values if no data exists. Storing and indexing null values can lead to performance and complexity implications. Wide-column stores are not impacted by storing and indexing null values, as there is no requirement for each row in a wide-column store to have the same set of columns. Another important characteristic of wide-column stores is that the storage of data can be distributed across multiple database nodes. Like Distributed SQL databases, wide-column stores can therefore be used to provide scalability, resiliency and availability by replicating data across multiple servers in a single data center, multiple servers across multiple data centers or even multiple servers across multiple cloud providers in multiple geographic regions. I assert that by 2027, more than one-third of enterprises will adopt data platforms that span distributed architecture, supporting applications that require data processing across geographic and availability zones. Unlike Distributed SQL databases, which by default provide strong data consistency across a distributed architecture, wide-column stores can be configured to deliver strong or relaxed (eventual) consistency, depending on the requirements of the associated application. It should be noted, however, that many wide-column stores do not currently fully support atomic, consistent, isolated and durable transactions, although it has been slated for inclusion in the forthcoming version 5.0 of Apache Cassandra.

The flexible data model makes wide-column stores well suited to write-intensive sparse and diverse datasets, while the distributed architecture is well aligned to the needs of storing and processing very large datasets, particularly those with high-performance and localized data sovereignty requirements. Primary use cases for wide-column stores include the storage and processing of application and infrastructure log data, sensor data from internet of things devices, time-series data and user preferences data. The latter can be used to drive personalization and recommendations as well as fraud detection and authentication, and wide-column stores are well-suited to intelligent operational applications driven by artificial intelligence and machine learning models that depend on the processing of large and diverse datasets. Wide-column stores are not suitable for all use cases, but I recommend that all enterprises considering options for databases evaluate the most appropriate data model to fulfill the task at hand and consider the potential suitability of wide-column stores where appropriate.

Regards,

Matt Aslett

Data, data operations

Authors:

Matt Aslett

Director of Research, Analytics and Data

Matt Aslett leads the software research and advisory for Analytics and Data at Ventana Research, now part of ISG, covering software that improves the utilization and value of information. His focus areas of expertise and market coverage include analytics, data intelligence, data operations, data platforms, and streaming and events.