ETL vs. ELT
If you’re learning about data engineering for the first time, you might encounter two very similar acronyms: ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform). At first you might think that it’s dumb that there’s two acronyms for basically the same thing. Then if you see someone distinguish between the two, you might think they’re being pedantic. However, there are important differences between the two that need to be understood to grasp the history of data warehousing.
The earlier of the two paradigms is ETL. This is what people are likely to use by default if they don’t care to distinguish between the two. This term was invented when the most common way for a company to do data warehousing was to own and maintain your own servers which host the database system for your data warehouse. During this period, you had to know ahead of time about how much storage and compute you needed for your data warehouse so your IT team could buy and setup the servers that you needed. If you changed your mind, getting more compute or storage could require months of lead time. A significant change in compute and storage needs might even requiring hiring more IT people to run your servers. For this reason, people tended to be thrifty with their limited resources. You needed to plan your new tables carefully and needed to make sure the ETL process that populates them didn’t waste space through unnecessary data duplication.
Then came the cloud. All of a sudden, scaling could be done much more easily. At first, this meant that people just ran their formerly on-prem databases in the cloud. Products like Amazon Redshift made this easy. You could provision whatever size servers you needed today, and you could always change your mind later without too much overhead. However, this was already an outdated approach from the beginning. For one thing, it still required making a decision upfront instead of on-demand. For another, the world was already moving on to new kinds of data systems. Systems like HDFS stored data by distributing it across many nodes, and MapReduce-style compute on Hadoop could be scaled independently of HDFS storage. Moreover, if you wanted to change how much compute or storage your cluster had, you could spin up new worker or storage nodes on-demand. In fact, you didn’t even need to make a decision about how much compute. Google would run it’s MapReduce jobs on whatever servers happened to be sitting idle at the time.
The reign of Hadoop didn’t last long, but it had an impact. A new generation of data warehouses were developed that distributed compute and separated it from storage: Google BigQuery, Databricks, and Snowflake are the big players of this next generation of data warehouses. Databricks and Snowflake both rely on an external storage layer from any of the major cloud vendors. They use object storage rather than a distributed file system like HDFS, but the idea is that same. Snowflake and Databricks are also designed in a way that allows you to scale your compute both horizontally and vertically on-demand. You can choose the type of compute you want to use for your query and change your mind on a query-by-query basis. BigQuery is also designed similarly but manages more of the details for you. From the outside, it’s serverless and doesn’t rely on an external storage layer, but on the inside it uses Dremel (a distributed query engine) for compute and GFS (a distributed file system) for storage.
So what does all this mean for data warehousing? Well, it means you can afford to have a different attitude towards the resources of your data warehouse, especially storage. Cloud storage layers like S3 are dirt cheap, and doesn’t require any setup or maintenance. At the same time, data has gotten more complex to transform. That means that now more than ever, it makes sense to trade storage for a better development workflow. This is exactly the premise of ELT: Load data into your data warehouse first, so that transformations can be run and re-run against raw data as many times as it takes to get it right. This means that bugs in the transformation layer can be fixed easily after the fact, which makes developers’ lives easier and improves data quality. This also means that you can modularize your transformation logic in order to bring the benefits of modular software to complicated SQL queries. This is facilitated by tools like dbt. However, managing your transformation as modular SQL only works if you load your data into your data warehouse before you begin transformation. This means that you will have a least two copies of your data: the raw, un-transformed data that initially gets loaded and the final, transformed data to be consumed by analysts. Often in complex pipelines, there maybe be more than two copies.
So that’s the real story behind the difference between ETL and ELT. It’s not just about where you do your transformation, but also about the changing landscape of the tech industry as a whole. Actually, in practice you often end up with something like ETLT, whether you are trying to do “ETL” or “ELT”. The difference between the two becomes more about which of the transformations is lightweight and which is heavyweight. However, in either case, it’s important to understand the way that constraints have changed over the years to support new kinds of workflows.