Speeding up PostgreSQL ETL pipeline with the help of GODS
Problem to solve When working on the new Federated Reporting feature for CFEngine we had to solve the problem of collecting data from multiple CFEngine hubs (feeders) on a single hub (superhub). CFEngine hubs are using PostgreSQL to store data, so, more specifically, the problem was how to collect data from multiple PostgreSQL databases in one PostgreSQL database. And because we are talking about ~1 GiB of SQL data per feeder hub and for example 10 feeders connected to a superhub here, the initial and trivial solution using basically this ETL (Extract Transform Load) pipeline - pg_dump | gz | ssh | gunzip | psql - provided really poor performance. The problem was in the last part of the pipeline - importing data using psql. Reading and writing 10 GiB of data of course takes a while, but we soon realized that I/O speed was not the bottleneck in this case.