Orchestrating Data Pipelines Facilitates Data-Driven Analytics

Written by Matt Aslett | Oct 25, 2022 10:00:00 AM

I have written a few times in recent months about vendors offering functionality that addresses data orchestration. This is a concept that has been growing in popularity in the past five years amid the rise of Data Operations (DataOps), which describes more agile approaches to data integration and data management. In a nutshell, data orchestration is the process of combining data from multiple operational data sources and preparing and transforming it for analysis. To those unfamiliar with the term, this may sound very much like the tasks that data management practitioners having been undertaking for decades. As such, it is fair to ask what separates data orchestration from traditional approaches to data management. Is it really something new that can deliver innovation and business value, or just the rebranding of existing practices designed to drive demand for products and services?

Key to understanding why data orchestration is different, and necessary, is viewing data management challenges through the lens of modern data-processing requirements and challenges. As I have noted, data-driven organizations stand to gain competitive advantage, responding faster to worker and customer demands for more innovative, data-rich applications and personalized experiences. Being data-driven requires a combination of people, processes, information and technology improvements involving data culture, data literacy, data democracy, and data curiosity. Encouraging employees to discover and experiment with data is a key aspect of being data-driven that requires new, agile approaches to data management. Meanwhile, the increasing reliance on real-time data processing is driving requirements for more agile, continuous data processing. Additionally, the rapid adoption of cloud computing has fragmented where data is accessed or consolidated, with data increasingly spread across multiple data centers and cloud providers.

Traditional approaches to data management are rooted in point-to-point batch data processing, whereby data is extracted from its source, transformed for a specific purpose, and loaded into a target environment for analysis. These approaches are unsuitable for the demands of modern analytics environments, which instead require agile data pipelines that can traverse multiple data-processing locations and can evolve in response to changing data sources and business requirements. I assert that by 2024, 6 in ten organizations will adopt data-engineering processes that span data integration, transformation and preparation, producing repeatable data pipelines that create more agile information architectures. Given the increasing complexity of evolving data sources and requirements, there is a need to enable the flow of data across the organization through new approaches to the creation, scheduling, automation, and monitoring of workflows. This is the realm of data orchestration. In the context of DataOps, which provides an overall approach to automate data monitoring and the continuous delivery of data into operational and analytical processes, data orchestration provides the capabilities to automate and accelerate the flow of data from multiple sources to support analytics initiatives and drive business value. At the highest level of abstraction, data orchestration covers three key capabilities: collection (including data ingestion, preparation and cleansing); transformation (additionally including integration and enrichment); and activation (making the results available to compute engines, analytics and data science tools, or operational applications). These capabilities will be familiar to existing data practitioners. Specific tasks related to these capabilities have traditionally been addressed with a variety of tools as well as manual effort and expertise. In comparison, data orchestration tools are designed to automate and coordinate the sequential or parallel execution of a complete set of tasks via data pipelines based on direct acyclic graphs (DAGs) that represent the relationships and dependencies between the tasks.

As is often the case with new approaches to data and analytics, the requirements for data orchestration were first experienced by digital-native brands at the forefront of data-driven business strategies. One of the most prominent data orchestration tools, Apache Airflow, began as an internal development project within Airbnb, becoming an Apache Software Foundation project in 2016; workflow automation platform Flyte was originally created and subsequently open-sourced by Lyft; and Metaflow was developed and open-sourced by Netflix. Data orchestration is not just for digital natives, however, and a variety of vendors have sprung up with offerings based around these open-source projects, as well as other development initiatives, to bring the benefits of data orchestration to the masses. In addition to stand-alone data orchestration software products and cloud services, data orchestration capabilities are also being built into larger data-engineering platforms addressing broader data management requirements, including data observability, often in the context of data fabric and data mesh. Whether stand-alone or embedded in larger data-engineering platforms, data orchestration has the potential to drive improved efficiency and agility in data and analytics projects. Participants in Ventana Research’s Analytics and Data Benchmark Research cite preparing data for analysis and reviewing data for quality and consistency issues as the two most time-consuming tasks in analyzing data.

Adoption of data orchestration is still in the early stages and is closely linked to larger data transformation efforts that introduce greater agility and flexibility. If an organization’s data processes and skills remain rooted in traditional products and manual intervention, then data orchestration is not likely to be a quick fix. However, alongside the cultural and organizational changes involved in people, processes, and information improvements, data orchestration has the potential to play a key role in the technological improvement involved in becoming more data driven. As such, I recommend that all organizations investigate the potential advantages of data orchestration with a view to improving their use of data and analytics.

Regards,

Matt Aslett

View full post