From raw data to insight: How to choose the best data transformation solution
-12.avif)
The journey from raw data to insight is a vital process in any data-driven organisation. Raw data, in its initial state, is often a chaotic mix of incomplete, inconsistent, and unstructured information, making it unsuitable for direct analysis. The prepared layer serves as the critical bridge, transforming this raw material into a curated, reliable, and accessible asset. Its primary purpose is to create a "single source of truth", a clean, well-structured, and trustworthy dataset that the entire business can depend on. This not only makes data more usable for analysts and decision-makers but also has a profound impact on the business by enabling confident, data-backed strategies and preventing the costly errors that arise from flawed analytics.
In a modern data system, raw data is first loaded into a cloud data warehouse (CDW). The next step is where that raw data is transformed into a trustworthy data asset that is fit for analysis. This process involves cleaning the data, checking its quality, and organising it into standard structures, which we typically refer to as “prepared layers”. These curated data tables are then stored in so-called data marts. The choice of solution used for this transformation dictates how reliable, efficient, and costly your entire data platform will be in the long run.
The solutions used for transformation fall into three broad categories:
- Database-centric solutions: These use the native tools of the CDW, such as plain SQL scripts and views. They are quick to start but often create long-term problems.
- Dedicated frameworks (e.g. dbt and Dataform): These tools treat data transformation like software code. They enforce good practices like testing and version control, making them the standard choice for professional data teams.
- Custom code pipelines (e.g. Spark and Cloud Dataflow): These are specialised external systems needed for dealing with extremely large volumes of data or data that arrives in real time (streaming data). They require dedicated engineering expertise.
When choosing an approach, organisations must balance initial simplicity against long-term stability and cost, and also make many trade-offs in the process of selecting the appropriate software for the job. The major trade-offs discussed below are version control, data lineage, and the total cost of ownership.
Version Control
Firstly, Version Control is the ability to track, manage, and revert changes to data transformation logic over time and is fundamental for maintaining trust and auditability in the data platform. For Database-centric solutions, native SQL scripts and views are extremely difficult to manage. Changes are typically performed manually within the CDW environment and are not automatically linked to external version control systems like Git. This results in a manual, high-risk auditing process. Dedicated frameworks such as dbt and Dataform are architected around the idea of transformation as software code, ensuring that the logic is automatically stored and tracked in Git. This provides teams with a clear, immutable, and auditable history of every modification made, essential for collaborative development and legal compliance. For Custom code pipelines like Spark and Cloud Dataflow, version control must be custom-engineered; while the code is typically stored in Git, the high total cost of operation (TCO) reflects the need for specialised engineers to build and maintain this versioning and deployment infrastructure externally.
Data Lineage
Secondly, Data Lineage is the ability to trace the flow of data from its raw source through every step of transformation to its final report or dashboard destination. For Database-centric solutions, data flow is often implicit and hidden when complex stored procedures or many simple views are used. This lack of visibility frequently leads to systems that resemble "spaghetti code." If an upstream source table changes, figuring out exactly which downstream reports or dashboards will break becomes a difficult, manual task. Dedicated frameworks automatically create a detailed lineage graph that visualises all dependencies between data models. This inherent clarity greatly simplifies maintenance, allowing analysts to quickly assess the precise impact of any potential change and fix errors with minimal effort. For Custom code pipelines, establishing reliable data lineage requires specialised engineering effort to build and maintain custom metadata tracking systems, adding significant complexity and cost to the overall system.
Total Cost of Ownership (TCO)
Finally, the Total Cost of Ownership (TCO) is more than just licensing fees; it is primarily driven by labour and compute expenses over the lifespan of the platform. The Custom code pipelines approach results in a high TCO because it requires expensive, specialised engineers to set up and maintain the dedicated infrastructure for transformation, scheduling, and testing. Conversely, Database-centric solutions appear cheap initially, but the lack of governance they impose inevitably leads to high labour costs later on: undocumented, redundant tables cause "data sprawl," forcing analysts to waste significant time maintaining tangled logic or resolving data accuracy concerns. Dedicated frameworks offer the most sustainable Total Cost of Ownership (TCO) because they effectively reduce the two primary cost drivers: labour and compute. Their superior governance features (lineage, version control) drastically reduce the labour hours spent on maintenance and debugging. At the same time, built-in features such as optimised SQL compilation, minimise the amount of expensive compute power required from the CDW by generating highly efficient queries. This dual efficiency makes them the most economical choice in the long run.
The recommended architecture
For most production environments that run batch processing (e.g. producing daily or hourly updates), the most sustainable and trustworthy architecture involves a combination of tools:
- Dedicated frameworks (e.g. dbt and Dataform): These tools manage the detail of the transformation logic and ensure data quality inside the data warehouse.
- General Orchestration (e.g. Airflow): A workflow manager like Airflow manages the big picture, co-ordinating the overall flow of data, including ingestion, running the framework job, and triggering reporting tools afterwards.
This modular setup separates concerns, automating crucial governance tasks and preventing your data environment from becoming overly complex or reliant on manual processes.
Choosing your data transformation architecture isn't just a technical decision; it's a strategic one that impacts your organisation's ability to use data effectively. While simpler, database-centric approaches can be tempting for short-term projects, investing in a framework-based, code-driven methodology pays long-term dividends in reliability, scalability, and trust. By prioritising governance and clarity from the start, you build a data platform that empowers analysts and drives confident decision-making across the organisation.
-13.avif)
-4.avif)
