Tristan Handy, Founder and President of Fishtown Analytics
I find myself regularly having conversations with analytics leaders who are structuring the role of their team’s data engineers according to an outdated mental model. This mistake can significantly hinder your entire data team, and I’d like to see more companies avoid that outcome.
This post represents my beliefs about when, how, and why you should hire data engineers as a part of your team. It’s based on my experience at Fishtown Analytics working with over 100 VC-backed startups to build their data teams, and on conversations with hundreds of companies in the wider data community.
If you run a data team at a VC-backed startup, this post was written for you.
The Role of the Data Engineer is Changing.
Software is increasingly automating the boring parts of data engineering.
In 2012, if you wanted to have a sophisticated analytics practice at your VC-backed startup, you needed one or more data engineers. These engineers were responsible for extracting data from your operational systems and piping it somewhere that analysts and business users could get at it. Often they would do some transformation work to make the data easier to analyze. Without the data engineers, analysts and scientists didn’t have any data to work with, so frequently engineers were the very first members of a new data team.
Coming into 2019, you can buy technologies off-the-shelf to do most of that work. In most scenarios, you and your data analysts and scientists could build the entire pipeline without the need for anyone with hardcore data eng experience. And you wouldn’t be building some second-rate, shitty pipeline: off-the-shelf tools are actually the best-in-class way to solve these problems today.
This ability for data analysts and scientists to build self-service pipelines is new—about 2–3 years old at this point. The driver of this is three specific products: Stitch, Fivetran, and dbt. These products were initially launched in the wake of the release of Amazon Redshift, when startup data teams discovered a tremendous latent hunger to build data warehouses. It took several years for the products to get good, though—back in 2016 we were still in early-adopter land.
At this point a pipeline built on top of Stitch / Fivetran / dbt is far more reliable than one built on top of custom-built Airflow tasks. This is an empirical statement, not a theoretical one: I’m not saying it’s not possible to build a reliable Airflow infrastructure, I’m just saying that most startups don’t. At Fishtown Analytics, we’ve worked with 100+ VC-backed data teams and have seen this play out over and over again. We’re consistently migrating people from custom-built pipelines onto off-the-shelf infrastructure and in literally every single case the impact has been tremendously positive.
Engineers Shouldn’t Write ETL.
In 2016, Jeff Magnusson wrote a foundational blog post called Engineers Shouldn’t Write ETL. It was the first post I’m aware of where someone called out this change. Here’s my favorite part:
Data processing tools and technologies have evolved massively over the last five years. Unless you need to process over many petabytes of data, or you’re ingesting hundreds of billions of events a day, most technologies have evolved to a point where they can trivially scale to your needs. Unless you need to push the boundaries of what these technologies are capable of, you probably don’t need a highly specialized team of dedicated engineers to build solutions on top of them. If you manage to hire them, they will be bored. If they are bored, they will leave you for Google, Facebook, LinkedIn, Twitter, … — places where their expertise is actually needed. If they are not bored, chances are they are pretty mediocre. Mediocre engineers really excel at building enormously over complicated, awful-to-work-with messes they call “solutions”.
I love this section so much because it not only highlights why you don’t needdata engineers to solve most ETL problems today, it also states why you’re better off not asking them to solves these problems at all.
If you hire a data engineer and ask them to build pipelines, they will think their job is to build pipelines. This will mean that tools like Stitch and Fivetran and dbt will seem like threats to their existence instead of tremendous force multipliers. They’ll find reasons why off-the-shelf pipelines won’t actually suit your very custom data needs, and reasons why analysts shouldn’t actually be building their own data transformations. They’ll write code that is fragile, hard to maintain, and non-performant. And you’ll come to rely on this code because it’s underneath everything else your team does.
Avoid this situation like the plague.The pace of innovation on your data team will plummet and you’ll spend all of your time thinking about infrastructure issues that aren’t actually revenue-generating for the business.
If Not ETL, Then…What?
So, do you still need data engineers on your startup data team? You do.
Even with the availability of new tools that empower data analysts and scientists to build self-service pipelines, data engineers are still a critical part of any high-functioning data team. However, the tasks they should focus on have changed, as has the sequencing in which you hire them. I’ll discuss the “when” question in a later section; for now, let’s talk about what data engineers are responsible for on modern startup data teams.
Data Engineers are still a critical part of any high-functioning data team.
Instead of building ingestion pipelines that are available off-the-shelf and implementing SQL-based data transformations, here’s what your data engineers should be focused on:
managing and optimizing core data infrastructure, building and maintaining custom ingestion pipelines, supporting data team resources with design and performance optimization, and building non-SQL transformation pipelines.
Managing and Optimizing Core Data Infrastructure
While data engineers no longer need to manage Hadoop clusters or scale hardware for Vertica at VC-backed startups, there is still real engineering to do in this area. Making sure that your data technology is operating at its peak results in massive improvements to performance, cost, or both. That typically involves:
building monitoring infrastructure to give visibility into the pipeline’s status, monitoring all jobs for impact on cluster performance, running maintenance routines regularly, tuning table schemas (i.e. partitions, compression, distribution) to minimize costs and maximize performance, and developing custom data infrastructure not available off-the-shelf.
These types of efforts are often overlooked at earlier stages of a data team’s maturity, but become incredibly important as that team and the dataset grow. In one project we were able to cut BigQuery costs for building a table incrementally from $500/day to $1/day by optimizing table partitions. This stuff is important.
One company who has gone far down this path is Uber. Data engineers at Uber built a tool called Queryparser that automatically monitors all queries run against their data infrastructure and gathers statistics about the resources utilized and utilization patterns. Uber data engineers can use metadata to tune infrastructure accordingly.
Data engineers are also often responsible for building and maintaining the CI/CD pipeline that runs the data infrastructure. While many data teams had extremely poor VCS, environment management, and testing infrastructure in 2012, that’s changing, and it’s data engineers leading this charge.
Finally, data engineers at leading companies are often also involved in building tooling that doesn’t exist off-the-shelf. For instance, data engineers at Airbnb built Airflow because they didn’t have a way to effectively build and schedule DAGs. And data engineers at Netflix are responsible for building and maintaining a sophisticated infrastructure for developing and running tens of thousands of Jupyter notebooks.
You can get most of your core infrastructure off-the-shelf today, but someone still needs to monitor it and make sure it’s performing. And if you’re truly a cutting-edge data organization, you’ll likely want to push the boundaries on existing tooling. Data engineers can help with both.
Build and Maintain Ingestion Pipelines
While data engineers no longer need to hand-roll Postgres or Salesforce data transport, there are “only” about 100 integrations available off-the-shelf from the modern data integration vendors. Most of the companies we work with have off-the-shelf coverage of between 75 and 90% of the data sources they work with.
In practice, integrations are implemented in waves. Typically, the first phase includes core application database and event tracking, with the second phase including marketing systems like an ESP and advertising platforms. These first two phases are available completely off the shelf today. Once you go deeper into your more domain-specific SaaS vendors, you’ll need data engineers to build and maintain these more niche data ingestion pipelines.
For example, ecommerce companies end up dealing with a ton of different products in the ERP / logistics / shipping domain. Many of these products are very specific to particular verticals, and almost none of them are available off the shelf. Expect your data engineers to build these for the foreseeable future.
Building and maintaining reliable ingestion pipelines is hard. If you decide to expend the resources to build one out, expect it to take longer than you initially budgeted for, and expect it to require more maintenance than you’d like. Getting to V1 is easy, but getting a pipeline to consistently deliver data to your warehouse is hard. Don’t make the commitment to supporting a custom data ingestion pipeline until you’re sure the business case is there. Once you do, invest the time and build it to be robust. Consider using Stitch’s open source Singer framework — we’ve built ~20 custom integrations using it.
Supporting Data Team Resources with Design and Performance Optimization for SQL Transformations
One of the shifts we’ve seen in data engineering in the past five years is the rise of ELT: the new flavor of ETL that transforms the data after it’s been loaded into the warehouse instead of before. The what and the why of this change are well-covered elsewhere; the reason I mention it here is that this shift has a tremendous impact on who builds these pipelines.
If you’re writing Scalding code to scan terabytes of event data in S3 and aggregating it to a session level so that it can be loaded into Vertica, you’re probably going to need a data engineer to write that job. But if your events data is already in BigQuery (loaded by Google Analytics 360), then it’s already fully addressable in a performant, scalable environment. The difference is that this environment speaks SQL. This means that data analysts can now build their own data transformation pipelines.
This trend started in earnest with Looker’s PDT feature release in 2014. It ramped up aggressively with entire data teams building DAGs of 500+ nodes and processing many-TB datasets using dbt over the past two years. At this point, the pattern is deeply entrenched in modern data teams, and it has enabled analysts to self-serve in a way they never could before.
This shift to ELT means that data engineers don’t have to build most data transformation jobs. It also means that data teams without any data engineers can still get a long way with data transformation tools built for analysts. Data engineers still have a meaningful role to play in building these transformation pipelines, however. There are two key areas where data engineers should get involved:
- When performance is critical.Sometimes business logic requires some particularly heavyweight transformation, and it’s helpful to have a data engineer involved to assess the performance implications of a particular approach to building a table. Many analysts aren’t deeply experienced with performance optimization within MPP analytic databases and this is a great opportunity for collaboration with someone more technical.