Compressing a datastream with a data vault

Imagine you would have a SAP BW system as your datawarehouse. Your datawarehouse is filled with all kind of customer data, risk related data and time series based data. Let’s assume that SAP BW is getting it’s customer crm data (quotes and so on) from Siebel, customer billing and erp data (contract and more) from SAP and pricing time series data from an Oracle database. The Oracle database is located at another European location then the central data warehouse, bandwidth between this Oracle database and the central datawarehouse is limited due to location constraints.

 

Datawarehouse architecture
Datawarehouse architecture with several data sources

 

One of those time series data in the Oracle database is the price of natural gas in the western hemisphere over the last 20 years, this time series data changes every month. Time series data has the potential to become big data, especially if many time series are combined. In this case the problem is not the volume of data, but the fact that this time series data is stored not in the central datawarehouse but elsewhere in Europe in another country.

Gas Price Time Series

Time series data in the shape of gas prices

 

With limited bandwidth between the central datawarehouse and the storage for time series data available, it’s clear that simply copying all data over from one site to another every month is not an option. Some way of bandwidth conservation is needed, and compressing the data won’t do the job. The best way to preserve bandwidth is not to copy any data that hasn’t changed, but only the data that has changed or new data. An excellent way to keep track of data changes is the data vault model, part of the model is that it keeps track of data changes.

 

DataVault Model

Data Vault model

 

The satellite connected to the link contains the gas price data, country / price & date, the link itself contains all keys of the connected hub’s. The customer hub key could then be the country, while the product hub key could be the price. The customer satellite could contain all of the US states and European countries, while the product satellite could contain the gas calorific values.

Every entry in a satellite has an corresponding begindate and enddate in the hub. By this way the hub is enabled to track history. Using the hub’s history, one will be able to select just the data that has changed or is new the past month since last time one retrieved the data. The amount of data that needs to be copied is reduced significantly thus preserving bandwidth. Data vaults can be used for many more purposes, including history recording if that isn’t kept anywhere else.

About

Leave a Reply

Your email address will not be published. Required fields are marked *

Spam Control * Time limit is exhausted. Please reload the CAPTCHA.