Scaling ETL: From 5 Hours to Under 1 Hour
Financial planning systems are only as reliable as the data pipelines behind them. Dashboards, variance analysis, headcount planning, statement reporting, and AI-assisted insights all depend on the same foundation: clean, normalized, timely financial data.
At Precanto, the ETL layer had to ingest messy customer data from spreadsheets and third-party systems, normalize it into a common model, map it to master data, and generate analytics-ready aggregates. As data volumes grew, the original pipeline started taking more than 5 hours to complete. That delay affected customer onboarding, internal iteration, and the freshness of every downstream report.
The goal was not simply to make one slow job faster. The real goal was to redesign the pipeline so it could handle larger tenants, reduce database contention, remain predictable during peak processing, and support future incremental processing.
The Core Problem
The application deals with financial data that is both large and highly dimensional. A single reporting view may need to combine accounting periods, departments, locations, GL hierarchy levels, vendors, customers, employees, plan versions, aggregation types, and forecast categories.
The challenge was not only the number of rows. The harder problem was that each customer represented the same business concepts differently. Department names, location names, GL accounts, plan versions, and source system fields often varied across systems and tenants. That meant the pipeline had to do more than load files. It had to interpret, normalize, map, validate, and aggregate data into a structure the product could query consistently.
Why the Original Pipeline Did Not Scale
The earlier implementation worked when data volume was smaller, but it became expensive as tenants grew into tens of millions of rows. Several patterns contributed to the runtime increase:
- Too much transformation logic was pushed into complex SQL
- Large tables were repeatedly scanned and rewritten
- Intermediate states temporarily increased table size
- Active and inactive rows coexisted during processing
- Indexes had to absorb heavy write activity during ETL
- Application queries competed with ETL writes on the same tables
- Pipeline stages were difficult to isolate, profile, and optimize
One particularly expensive pattern was updating production aggregate tables in place. New rows were inserted, old active rows were removed or marked inactive, and then the new rows became active. During peak ETL, this temporarily increased data volume and created contention on the exact tables used by reporting screens.
Design Principle: Separate Build Time from Read Time
The most important architectural shift was to stop treating reporting tables as both the build target and the live query surface. Instead, the pipeline was redesigned around a simple principle: build the next version of the data separately, validate it, and expose it only when it is ready.
This reduced the amount of time live reporting tables spent in a partially updated state. It also made ETL runs easier to reason about: either the old version was still serving users, or the new version had been fully built and swapped in.
Step 1: Moving Transformation Logic to Golang
The first major improvement came from moving a large part of the transformation workload out of the database and into a Golang processing layer. SQL remained important for persistence, joins, and final query surfaces, but business-heavy transformation logic became explicit application code.
This gave the pipeline much better control over execution:
- Independent workloads could be processed in parallel
- Transformation steps could be profiled separately
- Intermediate data structures could be optimized in memory
- Business rules became easier to test outside the database
- Database writes could be batched instead of performed row by row
This was not about replacing SQL everywhere. The database was still the system of record and the serving layer for analytics. The change was about using the database for what it is best at, while moving procedural transformation logic into a runtime that offered clearer control over concurrency, memory, batching, and error handling.
Step 2: Making Pipeline Stages Explicit
The pipeline was broken into smaller stages with clear boundaries. This made the system easier to debug and improved the ability to optimize one stage without destabilizing the rest of the pipeline.
- Raw ingestion: load files and third-party extracts into staging structures without losing source fidelity.
- Normalization: convert inconsistent source fields into the application’s common financial model.
- Mapping: align departments, locations, GL accounts, vendors, customers, employees, and plan versions to master data.
- Transformation: apply business rules, derive reporting fields, and prepare analytics-ready records.
- Aggregation: generate summary tables optimized for dashboards, statements, drill-downs, and conversational queries.
- Validation: check row counts, totals, reconciliation rules, and expected dimensional coverage before publishing results.
Once each stage had a clear input and output, performance problems became easier to locate. A slow run was no longer just “the ETL is slow.” It could be narrowed down to mapping, aggregation, persistence, indexing, or validation.
Step 3: Using a Table Swap Strategy
The largest structural improvement came from replacing in-place updates with a table swap strategy. Instead of constantly mutating the live aggregate table, the pipeline created and populated a separate next version of the table.
The flow looked like this:
- Create a new table with the same structure as the live table
- Load and transform the next version of the data into that table
- Create or rebuild the required indexes on the new table
- Run validation checks before exposing the new table
- Atomically rename or swap the new table into place
- Keep the old table temporarily available for rollback if needed
This approach reduced contention because the heaviest writes happened away from the table serving user-facing queries. It also made the final publication step much smaller and more predictable.
Why Table Swapping Helped
Table swapping changed the performance profile of the pipeline. Instead of mixing read traffic, deletes, inserts, updates, index maintenance, and active-flag transitions on the same table, the system built a clean copy and exposed it only at the end.
This improved the system in several ways:
- Reduced lock contention on live aggregate tables
- Avoided temporary doubling of active and inactive production rows
- Made failed ETL runs safer because live data was not half-updated
- Allowed indexes to be built against the next table before publish
- Created a cleaner rollback path when validation failed
For reporting workloads, predictability matters as much as raw speed. A pipeline that sometimes blocks dashboards or leaves tables in a heavy transitional state is difficult to operate. The table swap model made the system easier to reason about operationally.
Step 4: Optimizing Database Interaction
Moving work to Golang did not eliminate the need for database optimization. The pipeline still had to write large volumes of data and support fast analytical reads. The optimization work included:
- Batching inserts instead of writing rows one at a time
- Reducing unnecessary full-table scans
- Designing indexes around the most common filter patterns
- Avoiding expressions in filters that prevented index usage
- Separating heavy ETL writes from live read paths where possible
- Keeping aggregate tables shaped around product query patterns
The most common query paths filtered by accounting period, department, location, plan version, aggregation type, and GL hierarchy. Indexing was therefore treated as part of the product design, not as a late database tuning task.
Step 5: Designing Aggregates for the Product
The output of the ETL pipeline was not just normalized data. It was a set of product-ready aggregates designed for interactive screens.
For example, headcount and financial reporting screens needed to support combinations of:
- Actuals, budget, forecast, committed, and predicted values
- Month, quarter, and year-level accounting periods
- Department and location filters
- GL account and rollup-level filters
- Person, vendor, customer, and employee-level drill-downs
- Plan version comparisons
Precomputing the right summaries allowed the UI to answer common financial questions quickly without recomputing the same expensive joins and rollups on every request.
Reliability and Validation
Speed alone was not enough. A faster pipeline that produced incorrect numbers would be worse than a slow one. Validation became a core part of the redesigned flow.
- Compare source and target row counts where applicable
- Validate financial totals across key aggregation levels
- Check that required dimensions were mapped successfully
- Detect missing or unmapped departments, locations, and GL accounts
- Fail safely before publishing the new table version
This was especially important because customer data was often messy. A pipeline for financial systems must be built with the assumption that source data will be incomplete, inconsistent, or surprising.
Current Direction: Change Detection
After the major runtime reduction, the next opportunity was to avoid unnecessary recomputation. Even a much faster batch pipeline still does more work than needed if only a subset of source data changed.
The next design direction is change detection:
- Identify changed source files, records, mappings, and dimensions
- Determine which accounting periods and aggregates are affected
- Recompute only the impacted slices of data
- Preserve full rebuild capability for recovery and major changes
This shifts the pipeline from a purely batch-heavy model toward an incremental model. The important design constraint is to keep correctness first: incremental processing must produce the same output as a full rebuild for the affected data range.
Results
- Reduced ETL runtime from more than 5 hours to under 1 hour
- Reduced contention on live reporting tables
- Improved predictability of large tenant processing
- Made transformation logic easier to test and evolve
- Created clearer pipeline stages for profiling and debugging
- Laid the foundation for incremental processing
What This Enabled
The redesign improved more than backend runtime. It directly affected product velocity. Faster ETL made it easier to onboard customers, validate new data models, refresh reports, and support larger datasets without degrading the user experience.
It also created a stronger foundation for AI-assisted analytics. A conversational interface can only answer financial questions reliably if the underlying data model is consistent, fresh, and queryable at interactive speeds.
Key Takeaways
The biggest performance gains came from changing the structure of the system, not from isolated query tuning. Query optimization mattered, but the larger win came from rethinking how data moved through the pipeline.
- Use SQL where it is strongest, but avoid turning the database into a procedural runtime
- Build new data separately before exposing it to live users
- Design aggregate tables around real product query patterns
- Make pipeline stages explicit and measurable
- Validate before publishing, especially when source data is messy
- Use incremental processing only when correctness can be preserved
ETL performance is not just a backend concern. In an analytics product, it controls how quickly customers can trust the system, how often data can be refreshed, and how confidently the product can scale.
This design also ties closely with how the system is structured at a tenant level. In our case, we used a database-per-tenant architecture to isolate workloads and improve predictability.