...
close
Whitepaper Whitepaper
Universal Semantic Layer : The foundation for instant, actionable, agentic analytics

What Is Data Vault?

Data vault refers to a data modeling technique designed for enterprise data warehousing to handle large-scale and complex data from disparate sources in a way that provides flexibility, scalability and adaptability. The concept was first introduced by Dan Linstedt in the early 2000s to overcome the challenges of traditional data warehousing models like 3rd normal form (3NF) and star schema.

The 3NF approach splits data into multiple tables and each table represents a unique entity. Relationships between these tables are maintained using a foreign key that acts as a reference to the main source of that data. While 3NF is good for maintaining data integrity and eliminating redundancy, the approach leads to an explosion of tables. Its design creates a complex schema where even simple queries require multiple joins to retrieve data. Each join increases computational cost as database engines have to perform several resource-intensive operations. If an enterprise has a large ever-growing dataset, this structure becomes unsuitable to handle multiple complex analytical queries, as it results in slow performance.

Another technique is star schema, which uses a multidimensional data modeling approach where data is structured into a central fact table that is connected to multiple dimension tables. The central fact table holds transactional or measurable data, such as revenue or sales. Dimension table, on the other hand, stores descriptive attributes, such as customer details, product categories, etc. This technique denormalizes dimension tables by storing all the related data in one table, which improves the query performance by reducing the number of joins while retrieving data. However, due to this approach, redundant data also gets stored across multiple rows in a dimension table, leading to higher storage and risks of update anomalies. This redundancy can also cause data integrity issues. If one record is updated, and others are not, different reports or queries might return different values for the same entity. Another challenge with star schema is its rigid structure. Any changes in business requirements, like defining new relationships or dimensions, require remodeling and reloading of data.

Data vaults solve these challenges and provide a hybrid approach that combines the strengths of 3NF and star schema. It structures data by breaking it into three components: hubs, links and satellites. Each

  • Hubs represent core business entities and store only unique business keys, such as customer IDs, product SKUs or order numbers. They act as a central reference point for storing entities’ business keys without descriptive information, such as the customer’s name or product details, to avoid redundancy and ensure data consistency. This makes hubs a reliable point of reference for linking data. New business keys can also be added easily without impacting the existing model, which improves scalability.
  • Links are responsible for establishing relationships between different business entities, and they act as a bridge between multiple hubs. Each link contains foreign keys from the connected hubs. Since links are independent of hubs, new relationships can be added without restructuring the entire schema.
  • Satellites contain all the descriptive attributes and historical changes related to either a hub or a link. It captures all the details of entities, such as names, addresses, product descriptions and transaction amounts. Satellite records every change as a new row instead of updating the existing records, helping businesses to track how data has evolved over time. Multiple satellites can be attached to the same hub or link to integrate data from different sources without modifying the core structure.

The modular structure with these three components makes the data vault a suitable choice for modern analytics environments. It ensures that all changes made to data are recorded accurately, relationships between entities are properly maintained and integrity is preserved without redundancy and manual interventions. This way, with data vault, organizations can manage large complex datasets efficiently.

How Does Data Vault Architecture Work?

Data vault offers a scalable approach to data warehousing that builds upon the traditional three-layer architecture. In the diagram below, the data flows from left to right. On the left, there are all the data source systems of the enterprise. The data generated in the source systems goes through the staging layer and then moves to the enterprise data warehouse layer. Finally, it lands in the information mart, which is used by end users to interact with their data.

Data Vault Architecture

Staging Layer

In this stage, raw data is collected and imported from multiple sources (such as databases, applications, APIs or external files) into a system where it can be processed. This layer is just a temporary stop for raw data to create a structured and standardized format before it moves deeper into the data vault.

No changes are applied to the source data; only hard business rules are applied, such as ensuring that all numerical fields contain only numbers, there are no text or special characters and critical fields like customer ID, order number, or timestamps are not missing. These rules are important to ensure that data remains accurate, consistent and reliable over its lifecycle.

Enterprise Data Warehouse

This is the data vault layer that contains multiple components: raw vault, business vault, matrix vault and operational vault.

  • Raw Vault works as a repository that stores everything coming from the source systems. It doesn’t transform this data but stores it as-is. Every single record is kept in this vault for auditability.
  • Business Vault is where soft business rules are applied to data. New computed fields can be added based on business logic. For instance, if a business wants to categorize customers as creditworthy based on some specific conditions and wants to create a true/false indicator, this can be done in the business vault.
  • Matrix Vault is where businesses can track runtime metadata, such as query performance and anomalies, if it exists. This information can help fine-tune performance by identifying inefficiencies in data processing.
  • Operational Vault is designed for streaming data and instances when real-time access to raw data is important. As soon as data is generated or updated, it can be ingested, stored and made accessible almost instantaneously. This is particularly useful for monitoring live system performance, tracking inventory in real-time or when up-to-date information is needed without any latency.

Information Mart

At the far-right end of the data vault architecture sits the information mart. This is where raw data is transformed into meaningful reports, dashboards and analytical views. Business users, analysts and decision-makers are all able to finally access the data they need. All the previous stages focus on collecting, storing and processing data, but this layer delivers insights in a user-friendly format. To build an effective information mart, data teams must work closely with business users to understand their needs and tailor it according to actual business needs.

Data vault architecture is designed to ensure data integrity, scalability, flexibility and auditability. But how to utilize this architecture to maximize the potential of modern data storage systems?

As data volumes and complexity grow at an inevitable speed, companies are transitioning from data lakes and warehouses to lakehouses to unify the benefits of both storage systems. However, without a robust data modeling approach like a data vault, managing data consistency, historical tracking and governance in a lakehouse can be challenging. Therefore, it’s important to implement data vault modeling in lakehouse architecture.

Why Does Data Vault Fit into Lakehouse Architecture?

A lakehouse architecture provides a unified platform where raw and structured data co-exist. However, with growing data volumes, it becomes difficult for a lakehouse to manage governance and lineage tracking alongside performance and scalability. This is where data vault modeling comes into the picture.

With data vault in the lakehouse, businesses can capture every change while maintaining a clear lineage of their data. It tracks all historical changes by storing raw, unmodified data along with metadata to keep a record of when and how the data was updated. Since all the changes made to data are stored as new records, businesses can trace them back to the original source for full transparency and auditability.

Data vault modeling also improves performance by optimizing queries. It organizes data into hubs, links and satellites so that queries willonly scan the necessary data and bring back results fast. Due to its three-component modular architecture, businesses can accommodate increasing volumes and users while maintaining high performance. Whenever they want to add new data sources, they can introduce additional hubs without altering the existing structure. Links can establish relationships between these hubs. Therefore, scaling the system does not become a problem for businesses as they don’t need to rework the entire model.

This hybrid approach of using data vault modeling with lakehouse architecture helps organizations to efficiently manage historical data, optimize performance and maintain compliance while harnessing full potential of their data.

Back to Glossary