Let’s talk about compilers. Python files are human legible but not computer legible. To become executable, the files are transformed into bytecode which looks something like the below. Computer legible does not mean human legible, and it doesn’t have to.
The template gitignore
file Github suggests for Python projects always has *.pyc
in it. Why? Because .pyc
files don’t really matter for the average developer. Sure, if you’re trying to really get creative with computational efficiency maybe they matter, but at that point why not go to a non-interpreter language like C or C++. For many developers and even more analytics professionals, the juice isn’t worth the squeeze. Python rose in popularity due to its ease of use and readability, and is still among the top 5 most popular languages today.
I’d consider the Python language an abstraction over compilers. I appreciate it. Although it interests some, no part of me wants to think about compilers in my day-to-day.
When should we care about what happens under the hood?
Abstraction: Efficient or Harmful?
I haven’t met anyone that thinks Python’s existence is harmful or makes worse programmers (and if you’re reading this and are that person, I’m really curious about your point of view so please reach out). Abstractions exist that make people’s jobs easier and more efficient, but others can bring users down a misleading path.
Example 1: AWS Lambda
AWS Lambda is a serverless compute service. What this really means is that Lambda, as a service, offers users the ability to run short spurts of workload without having to worry about the infrastructure underneath. However, using Lambda functions for absolutely everything can spool out of control if not managed properly. The user still has to think about security, cost, and interdependency. This is more of a comment on AWS Lambda specifically, not serverless compute as a concept.
Verdict: Mostly efficient but can be harmful if not managed.
Example 2: Fivetran
Maybe a slightly larger group than the people who are genuinely interested in compilers, but the group of people interested in writing ETLs all day isn’t a tremendously large one. Fivetran sells ETL as a service—moving data from cloud software to a warehouse is as easy as providing a credit card and credentials for the source and destination. The existence of such a service allows analytics teams to focus on what they’re uniquely good at: analysis.
Verdict: Efficient.
Example 3: Domo’s DataFlow product
The epitome of a UI on top of SQL without reviews, versioning, tests, or great alerting. I’m being a little harsh as Domo DataFlows can be certified after creation serving as a review process, but the real issue is lack of pipeline testing. Joining two datasets without thinking analytically about the relationship between the join keys could easily lead to inflated numbers without even realizing it. I’m all for education on how to think analytically, but allowing anyone and everyone to join data out in the wild is a recipe for disaster.
Verdict: Harmful.
So I’ve put a line in the sand about which abstractions are “efficient” and “harmful”—harm can mean a lot of different things in this context.
The cost of too much DIY
Abstraction in a technical sense practically results in more people being able to do the abstracted thing. Abstracting over ETL? Anyone who understands the data but not APIs can run an extraction. Abstracting over SQL? Anyone who understands how to use a UI but not SQL can merge data together.
I’d argue you don’t really need to understand how the API works to fully understand the data it produces. Instead, you need to understand how the source tool works and the object representations within it to make proper use of the data. Interacting with an API? It’s just a means to an end. If done wrong, data can be missing. However, trusting a third party to do all that extraction for you means there’s a trust relationship that all the data will in fact be extracted correctly (this is an assumption in itself, but I’ll gloss over that for now).
The fundamental difference between ETL tools and SQL GUIs is that SQL GUIs (like Domo Dataflows) aren’t just a means to an end—they make aggregating data faster by foregoing typical data aggregation steps like testing. Consider one dataset that represents orders in a store and another that represents purchase orders (POs) for shipments. It would be easy to assume that one order necessarily has one PO, however this is often not the case. How often have you received emails about items in one order from Zappos or Amazon being shipped in two shipments because it’s faster? Personally, many times. Joining these two datasets to get ship date while also summing over order margin won’t be accurate—orders with multiple POs will have their margin counted multiple times.
Usually in data transformation, we have data tests to check uniqueness on the primary key. In this join that would be uniqueness on the order, which would fail.
The primary issue with SQL GUIs is neither testing nor manual QA happen because the user optimizes for speed to results.
A quick but wrong result leads to incorrect decisions being made—in this case, which orders or vendors perform better and thus potentially severing relationships with vendors that aren’t actually performing poorly.
The cost of democratizing data aggregation is making wrong decisions—in my book, that’s too high of a cost.
What next?
It really boils down to deciding when to use a no-code tool (call it an abstraction) or not. In this decision making process I would encourage asking the following questions:
Does the no-code tool remove layers of QA that are necessary to ensure accuracy? If so, don’t go with it unless you replace those layers elsewhere.
Does the no-code tool accomplish a particular task that otherwise would be done by a team that doesn’t add any unique value to the solution? If so, it’s full steam ahead.
Thanks for reading! I love talking data stacks so please reach out. I’d love to hear from you.