The Evolving Data Infrastructure Landscape: A Year in Review

The data infrastructure industry has experienced phenomenal growth over the past year, defying any notions of slowing down. Every key metric imaginable reached record highs, with entirely new product categories appearing faster than most data teams could keep up. The industry even witnessed a resurgence of the “benchmark wars” and “billboard battles” of yesteryear.

To empower data teams to navigate this ever-changing landscape, a comprehensive analysis was conducted. This analysis identified the current best-in-class stacks across both analytic and operational systems, drawing insights from interviews with numerous data practitioners. Each architectural blueprint showcases the key changes compared to the previous version, along with explanations for the observed trends.

One of the most intriguing observations was the persistence of core data processing systems. While supporting tools and applications multiplied rapidly, the foundational systems remained remarkably stable. This stability extends to the debate surrounding architectural patterns. The question of convergence between analytic and operational ecosystems, a hot topic last year, appears to have been settled – for now – with both segments thriving independently. Cloud data warehouses like Snowflake continue their meteoric rise, catering primarily to SQL users and business intelligence needs. Meanwhile, data lakehouses like Databricks are experiencing explosive customer adoption, demonstrating the continued embrace of heterogeneous data stacks by many data teams.

Similar resilience is evident in core data systems for ingestion and transformation. The modern business intelligence pattern exemplifies this, with the combination of Fivetran and dbt (or similar technologies) becoming near-ubiquitous. Operational systems also exhibit this trend, with de facto standards like Databricks/Spark, Confluent/Kafka, and Astronomer/Airflow solidifying their positions.

However, the stable core is surrounded by a rapidly evolving data stack periphery. This past year witnessed a surge in activity within two primary areas:

  1. New Tools for Key Data Processes and Workflows: These tools address various data workflows, including data discovery, observability, and even ML model auditing. Examples include solutions for data lineage tracking, data quality monitoring, and bias detection in machine learning models.
  2. New Applications for Value Generation: Emerging applications empower data teams and business users to unlock greater value from their data. This includes data workspaces for collaborative data exploration, tools for “reverse ETL” (updating operational systems with insights from the data warehouse), and frameworks for building and deploying ML applications.

Furthermore, the past year saw the introduction of innovative technologies aimed at enhancing core data processing systems. Notable areas of advancement include the metrics layer within the analytical ecosystem and the lakehouse pattern for operational systems. Both these areas are converging towards standardized definitions and architectures.

With this expanded context, let’s delve deeper into the major data infrastructure blueprints, examining the updated diagrams and analyzing the key changes within each:

  • Modern Business Intelligence Pattern: This pattern continues to center around the combination of data replication tools (like Fivetran), cloud data warehouses (like Snowflake), and SQL-based data modeling tools (like dbt). These technologies have all witnessed significant adoption growth, driving funding and early-stage competition (e.g., Airbyte and Firebolt). Traditional dashboards remain the most common output layer application, with established players like Looker, Tableau, and Power BI joined by newer entrants like Superset. However, a surge of interest in the metrics layer has emerged, focusing on providing a standard set of definitions on top of the data warehouse. This area is still under debate, encompassing discussions about capabilities, ownership, and specification standards. Several promising pure-play products (like Transform and Supergrain) have materialized, alongside expansion efforts from existing players like dbt. Additionally, reverse ETL vendors like Hightouch and Census have gained significant traction, enabling teams to leverage insights from the data warehouse to update operational systems (CRM, ERP). Finally, data teams are displaying a growing interest in new applications that augment standard dashboards, particularly data workspaces like Hex. This trend likely stems from the increasing standardization in cloud data warehouses, as readily accessible and well-structured data naturally prompts a desire for more advanced data manipulation capabilities.
  • Data Engineering & Advanced Analytics Pattern: This pattern prioritizes core data processing systems (Databricks, Starburst, Dremio), data transport solutions (Confluent, Airflow), and storage options (AWS) for robust data handling. These core systems continue to witness rapid growth and form the backbone of this architectural model. The essence of this pattern lies in its multi-modal nature, allowing companies to select the systems best suited to their specific analytical and operational data application needs. Clarity and recognition surrounding the lakehouse architecture have significantly improved over the past year. This approach is supported by a wide range of vendors, including major cloud providers.