Repairing the Plumbing of Data-Science


Would you hire a pipefitter to install or repair a pipe rather than a plumber? Possibly, the area of expertise is similar in both instances, right?

A pipefitter and plumber are both experts in dealing with pipes. While both are accurate overviews of each, in terms of skill and toolset, both are very different. This same analogy can be used when discussing the hiring process used by some companies within the data science sector.

With a lack of clear understanding regarding the purpose of a data scientist, some companies automatically assume a data scientist is a one-size-fits-all solution and a more efficient option than investing in a data engineer as well.

This misunderstanding can be detrimental to both the efficiency of a company, as well as the perceived competency of a data scientist.

Let’s go back to the simple analogy of hiring a pipefitter to install or repair a pipe – yes, a pipefitter, one can assume, is very familiar with pipes. However, their knowledge regarding pipes is concerned with the very particular type of pipes, primarily made of materials designed to handle high pressure and contain chemicals or acids.

The point I’m getting at here is that while the basic concept is similar, the actual materials and tools being used are completely different. Simply put, data engineers are “plumbers” required to build a data pipeline for a company, whereas data scientists are the “painters” or “artists” giving meaning to an otherwise static entity.

Without one, the other’s job cannot be wholly fulfilled. However, when combined effectively, they have the potential to transform raw, useless data in ways that could give their company a competitive edge.

However, this is not to suggest there is no overlap between the two; both share similar skills and responsibilities. The difference is where the focus of each lies. For a data engineer, the focus is building, maintaining and ensuring optimal performance of infrastructure and architecture for data generation.

Data scientists use this infrastructure to focus on advanced mathematics and statistical analysis to conduct high-level market and business operational research to identify trends and relationships.

In order for a company to run at maximum efficiency, investment is needed to recruit experts in each of these fields. However, in today’s world, where shareholders are concerned with saving money, we are seeing companies hiring data scientists expecting them to be able to do the work of a data engineer.

The result is a data scientist performing below his ability as the infrastructure they require to perform effectively is missing, forcing them to build it. Some companies only work at 20-30% efficiency as a result of this malpractice.

While data scientists do learn some of the skills associated with data engineering, they do so out of necessity rather than desire. Therefore, while they are equipped with the basic toolset to be a data engineer, they lack the knowledge of knowing the best tool for each job – something a data engineer is specifically trained to do.

A less common case is the data engineer being required to do the work of a data scientist. In this instance, an upward push is required as the data engineer needs to improve their statistical knowledge.

However, as Noah Gift discusses in his article, data science is becoming more standardised. Software such as Google’s AutoML and DataRobot, are taking over areas where data scientists were previously required, this upward push is becoming more common. It is resulting in the emergence of a new breed of engineer: the machine learning engineer.

Data scientists generally come from an academic background and therefore, they would usually prefer to write a paper on a specific problem, rather than get something into production.

In terms of programming abilities, a data scientist is usually limited to creating something in R – which is an issue unto itself. Data scientists rarely think of creating a system like a data engineer.

The role of a machine learning engineer is to take the work done by the data scientist and recode it to make it run more efficiently, before passing it on to the data engineer who will be able to incorporate it into their infrastructure.

This results in a far more smooth running machine overall. Without the machine learning engineer, the data would reach the data engineer in a format that works, but far from optimal.

As machine learning engineers generally come from a data engineering background, they will have the tools and know-how to complement both the data scientist’s and data engineer’s work.

In simple language, the data engineer serves as the cream filling of the Oreo between the data-science and data-engineer biscuits.

In order for companies to get the most out of their data science division, they need to take a step back and determine if they have indeed hired the right people. The must ask themselves: Do they have the correct skillset to perform the job they were hired to do? Do they have the correct toolset to help maximise their company’s efficiency?

While saving money may be important, hiring a skilled data scientist and expecting them to perform to the best of their ability in both their own field and another they are not trained sufficiently in, is damaging both to the company and the individual.

It’s time to hire a plumber to repair the pipeline, and an interior designer to decorate the office.

Liam McNamara

Leave a Comment

* Indicates a required field