Copying Big Data… and verifying it…

 

A large multinational once consisted out of several business units, let’s imagine the multinational is being split into separate companies. Being a single company before, they share the IT systems and data after the split. The datawarehouse is terabytes in size, and after several years of painstaking steps to separate the shared IT systems, the data split is next. Splitting shared data is a very complex thing to do, since data domains often belong to multiple business units which now belong to different companies.

The first step in such kind of data migration is to define the data domains, like customers and contracts. For every data-domain applies the data inside will most likely be mapped: where should it come from and where it should go to. Next step is the ETL process, Extracting, Transforming (if necessary) and then Loading. The easiest way to do the ETL process is to make an Oracle database export and import. Usually this will be an almost fool and fail proof process. Let’s imagine an export into text is choosen, and executed. The process would result in 2.5 Terabytes in text files, which have to be transported and loaded into the destination Oracle database. So far, so good…

Since this whole process is fragile, the question rises if all data is exported, transported and imported correctly so that the source and  destination databases are identical. Several methods are available to perform this task. In business one would prefer the most time and cost efficient procedure. An SQL SUM to be calculated over the primary key of each copied table might be the answer to this question. Let’s investigate this method.

Although an SQL SUM may appear to be a cheap an solid solution, the answer is a simple one. An SQL SUM over the primay key of each copied table is not a mathematical proof that the source and destination tables are identical. Why? Because row 2 primary key in the destination table could be one less than the source table and row 3 primary key in the destination table could at the same time be one more than the source table. This kind of data error will not be detected by using an SQL SUM method.

pwc-sum

Verifying data copy using SQL SUM method, but does it detect every data error?

 

Better is to use the Oracle hash function to compute hash numbers for each row, and SUM that into a single number for the entire table. Something like this would have delivered a mathematical fail proof evidence that the destination table is identical to the source table:

SELECT   SUM(DBMS_UTILITY.get_hash_value (column1||column2, 1, POWER (2, 16) – 1))

FROM   table1;

The picture below is demonstrating this principle on the Oracle example table of emp.

symmetric_difference_hash_table

Using the Oracle get_hash_value function to calculate the hash value for each row and summarize that into a single hash value for the entire table.

 

One can now verifiy the data integrity of each row, each table or all copied tables at the same time and with mathematical proof that the copy is identical to the source table. When there is any difference in hash sum, you can compare hash sums for each row, in order to detect which copied row has a data error compared to the source row.

 

About

Leave a Reply

Your email address will not be published. Required fields are marked *

Spam Control * Time limit is exhausted. Please reload the CAPTCHA.