Compare data after migration

19 Jul, 2024

I have done many different migrations and data transfers in my career. After all of them, the last task is to compare the data at the source and at the destination, to confirm all data has been migrated. There are multiple ways to compare such data. Here's are mine.

Spreadsheet

For smaller amount of data, up to 100 rows, a spreadsheet may be a good choice. Most of the people know how to use it; many companies would have been closed if spreadsheet software disappear.

To compare data, we need three columns: Source, Destination, Compare. In the Source column, per each row, put (import or whatever) data at source. In Destination column, put data at destination. Sort both columns separately in ascending order. In Compare column, put the following formula: =A1=B2.

This will compare data in both columns. If values are the same, the formula will return TRUE; otherwise - FALSE.

If the amount of rows in Source and Destination columns is not the same, the values will be different. Just add new row after last row with TRUE, move some data a bit, and you go.

This method is not the best one, but it is good for smaller tasks. No need to know some fancy tools, no need to write complex queries.

Database

Many of my migrations have been done within some database engines, like Oracle or PostgreSQL. If the migration is inside the same database, it is just a matter of writing a simple join to see what is missing between two datasets.

select d.filename as "missing at destination"
from source_table s 
full join destination_table d on s.filename = d.filename 
where d.filename is null;

Above will list all the files that are present at source but missing at destination.

This method requires more knowledge from the user, and access to some database engine, however it is able to process huge amount of data in fairly short period of time (depends on the database engine, free compute power and throughput).

My recent migration case

Recently, I had to check the migration of files between two different systems. The total number of rows to compare was counted in billions. While I didn't have any database at hand, I decided to try SQLite to import and compare the data.

At first, it seems to be a good idea, SQLite is quite fast and easy to use. I was able to import both sets of data (CSV file with list of files at source and destination).

After the import, it was a matter of just executing the SQL query and waiting for the results. Waiting... And waiting... And waiting. The query used all of my laptop resources, so after 15 or 20 minutes of waiting, I had to kill it.

DuckDB

Here is where DuckDB shines. The same query took only 15 seconds with DuckDB. And the whole process can be easily scripted with duckdb -c option:

#!/bin/sh

DBFILE=data.db

rm -rf $DBFILE

# create table for source and destination data
duckdb $DBFILE -c "create table source (filename varchar);"
duckdb $DBFILE -c "create table destination (filename varchar);"

# copy data from csv files
duckdb $DBFILE -c "copy source from 'source.csv' (format csv, delimiter ';');"
duckdb $DBFILE -c "copy destination from 'destination.csv' (format csv, delimiter ';');"

# query both tables and save result in miss table
duckdb $DBFILE -c "create table miss as select s.filename from source s full join destination d on s.filename = d.filename where d.filename is null;"

# export miss table to csv
duckdb $DBFILE -c "copy miss to 'missing.csv' (format csv, delimiter ';')"

Really, it took only 15 seconds to execute all the commands and process all the data.

#work