Scaling ETL: From 5 Hours to Under 1 Hour

Financial planning systems are only as reliable as the data pipelines behind them. Dashboards, variance analysis, headcount planning, statement reporting, and AI-assisted insights all depend on the same foundation: clean, normalized, timely financial data.

At Precanto, the ETL layer had to ingest messy customer data from spreadsheets and third-party systems, normalize it into a common model, map it to master data, and generate analytics-ready aggregates. As data volumes grew, the original pipeline started taking more than 5 hours to complete. That delay affected customer onboarding, internal iteration, and the freshness of every downstream report.

The goal was not simply to make one slow job faster. The real goal was to redesign the pipeline so it could handle larger tenants, reduce database contention, remain predictable during peak processing, and support future incremental processing.

The Core Problem

The application deals with financial data that is both large and highly dimensional. A single reporting view may need to combine accounting periods, departments, locations, GL hierarchy levels, vendors, customers, employees, plan versions, aggregation types, and forecast categories.

The challenge was not only the number of rows. The harder problem was that each customer represented the same business concepts differently. Department names, location names, GL accounts, plan versions, and source system fields often varied across systems and tenants. That meant the pipeline had to do more than load files. It had to interpret, normalize, map, validate, and aggregate data into a structure the product could query consistently.

Why the Original Pipeline Did Not Scale

The earlier implementation worked when data volume was smaller, but it became expensive as tenants grew into tens of millions of rows. Several patterns contributed to the runtime increase:

Too much transformation logic was pushed into complex SQL
Large tables were repeatedly scanned and rewritten
Intermediate states temporarily increased table size
Active and inactive rows coexisted during processing
Indexes had to absorb heavy write activity during ETL
Application queries competed with ETL writes on the same tables
Pipeline stages were difficult to isolate, profile, and optimize

One particularly expensive pattern was updating production aggregate tables in place. New rows were inserted, old active rows were removed or marked inactive, and then the new rows became active. During peak ETL, this temporarily increased data volume and created contention on the exact tables used by reporting screens.

Design Principle: Separate Build Time from Read Time

The most important architectural shift was to stop treating reporting tables as both the build target and the live query surface. Instead, the pipeline was redesigned around a simple principle: build the next version of the data separately, validate it, and expose it only when it is ready.

This reduced the amount of time live reporting tables spent in a partially updated state. It also made ETL runs easier to reason about: either the old version was still serving users, or the new version had been fully built and swapped in.

Step 1: Moving Transformation Logic to Golang

The first major improvement came from moving a large part of the transformation workload out of the database and into a Golang processing layer. SQL remained important for persistence, joins, and final query surfaces, but business-heavy transformation logic became explicit application code.

This gave the pipeline much better control over execution:

Independent workloads could be processed in parallel
Transformation steps could be profiled separately
Intermediate data structures could be optimized in memory
Business rules became easier to test outside the database
Database writes could be batched instead of performed row by row

This was not about replacing SQL everywhere. The database was still the system of record and the serving layer for analytics. The change was about using the database for what it is best at, while moving procedural transformation logic into a runtime that offered clearer control over concurrency, memory, batching, and error handling.

Step 2: Making Pipeline Stages Explicit

The pipeline was broken into smaller stages with clear boundaries. This made the system easier to debug and improved the ability to optimize one stage without destabilizing the rest of the pipeline.

Raw ingestion: load files and third-party extracts into staging structures without losing source fidelity.
Normalization: convert inconsistent source fields into the application’s common financial model.
Mapping: align departments, locations, GL accounts, vendors, customers, employees, and plan versions to master data.
Transformation: apply business rules, derive reporting fields, and prepare analytics-ready records.
Aggregation: generate summary tables optimized for dashboards, statements, drill-downs, and conversational queries.
Validation: check row counts, totals, reconciliation rules, and expected dimensional coverage before publishing results.

Once each stage had a clear input and output, performance problems became easier to locate. A slow run was no longer just “the ETL is slow.” It could be narrowed down to mapping, aggregation, persistence, indexing, or validation.

Step 3: Using a Table Swap Strategy

The largest structural improvement came from replacing in-place updates with a table swap strategy. Instead of constantly mutating the live aggregate table, the pipeline created and populated a separate next version of the table.

The flow looked like this:

Create a new table with the same structure as the live table
Load and transform the next version of the data into that table
Create or rebuild the required indexes on the new table
Run validation checks before exposing the new table
Atomically rename or swap the new table into place
Keep the old table temporarily available for rollback if needed

This approach reduced contention because the heaviest writes happened away from the table serving user-facing queries. It also made the final publication step much smaller and more predictable.

Why Table Swapping Helped

Table swapping changed the performance profile of the pipeline. Instead of mixing read traffic, deletes, inserts, updates, index maintenance, and active-flag transitions on the same table, the system built a clean copy and exposed it only at the end.

This improved the system in several ways:

Reduced lock contention on live aggregate tables
Avoided temporary doubling of active and inactive production rows
Made failed ETL runs safer because live data was not half-updated
Allowed indexes to be built against the next table before publish
Created a cleaner rollback path when validation failed

For reporting workloads, predictability matters as much as raw speed. A pipeline that sometimes blocks dashboards or leaves tables in a heavy transitional state is difficult to operate. The table swap model made the system easier to reason about operationally.

Step 4: Optimizing Database Interaction

Moving work to Golang did not eliminate the need for database optimization. The pipeline still had to write large volumes of data and support fast analytical reads. The optimization work included:

Batching inserts instead of writing rows one at a time
Reducing unnecessary full-table scans
Designing indexes around the most common filter patterns
Avoiding expressions in filters that prevented index usage
Separating heavy ETL writes from live read paths where possible
Keeping aggregate tables shaped around product query patterns

The most common query paths filtered by accounting period, department, location, plan version, aggregation type, and GL hierarchy. Indexing was therefore treated as part of the product design, not as a late database tuning task.

Step 5: Designing Aggregates for the Product

The output of the ETL pipeline was not just normalized data. It was a set of product-ready aggregates designed for interactive screens.

For example, headcount and financial reporting screens needed to support combinations of:

Actuals, budget, forecast, committed, and predicted values
Month, quarter, and year-level accounting periods
Department and location filters
GL account and rollup-level filters
Person, vendor, customer, and employee-level drill-downs
Plan version comparisons

Precomputing the right summaries allowed the UI to answer common financial questions quickly without recomputing the same expensive joins and rollups on every request.

Reliability and Validation

Speed alone was not enough. A faster pipeline that produced incorrect numbers would be worse than a slow one. Validation became a core part of the redesigned flow.

Compare source and target row counts where applicable
Validate financial totals across key aggregation levels
Check that required dimensions were mapped successfully
Detect missing or unmapped departments, locations, and GL accounts
Fail safely before publishing the new table version

This was especially important because customer data was often messy. A pipeline for financial systems must be built with the assumption that source data will be incomplete, inconsistent, or surprising.

Current Direction: Change Detection

After the major runtime reduction, the next opportunity was to avoid unnecessary recomputation. Even a much faster batch pipeline still does more work than needed if only a subset of source data changed.

The next design direction is change detection:

Identify changed source files, records, mappings, and dimensions
Determine which accounting periods and aggregates are affected
Recompute only the impacted slices of data
Preserve full rebuild capability for recovery and major changes

This shifts the pipeline from a purely batch-heavy model toward an incremental model. The important design constraint is to keep correctness first: incremental processing must produce the same output as a full rebuild for the affected data range.

Results

Reduced ETL runtime from more than 5 hours to under 1 hour
Reduced contention on live reporting tables
Improved predictability of large tenant processing
Made transformation logic easier to test and evolve
Created clearer pipeline stages for profiling and debugging
Laid the foundation for incremental processing

What This Enabled

The redesign improved more than backend runtime. It directly affected product velocity. Faster ETL made it easier to onboard customers, validate new data models, refresh reports, and support larger datasets without degrading the user experience.

It also created a stronger foundation for AI-assisted analytics. A conversational interface can only answer financial questions reliably if the underlying data model is consistent, fresh, and queryable at interactive speeds.

Key Takeaways

The biggest performance gains came from changing the structure of the system, not from isolated query tuning. Query optimization mattered, but the larger win came from rethinking how data moved through the pipeline.

Use SQL where it is strongest, but avoid turning the database into a procedural runtime
Build new data separately before exposing it to live users
Design aggregate tables around real product query patterns
Make pipeline stages explicit and measurable
Validate before publishing, especially when source data is messy
Use incremental processing only when correctness can be preserved

ETL performance is not just a backend concern. In an analytics product, it controls how quickly customers can trust the system, how often data can be refreshed, and how confidently the product can scale.

This design also ties closely with how the system is structured at a tenant level. In our case, we used a database-per-tenant architecture to isolate workloads and improve predictability.