Why SQL is the base knowledge for data science

Untitled Design (16)

What is SQL?

SQL (Structured Query Language) is a standard database language that is used to create, maintain and retrieve relational databases. Started in the 1970s, SQL has become a very important tool in a data scientist’s toolbox since it is critical in accessing, updating, inserting, manipulating and modifying data. It helps in communicating with relational databases to be able to understand the dataset and use it appropriately.

Here are some practical examples of why SQL is the base knowledge  for data science

It seems like whenever I get caught up in this debate, the showdown always seems to be Python vs R. However, on the first day of your first job in data science, it is likely that you will be introduced to the data warehouse. This is the data you will use to analyze data, by writing SQL queries.

A lot of companies are using it. Besides, even if you don't use it in your current or first data job, you will definitely use it at some stage in your data science career.

Even major cloud providers are now offering relational databases in the cloud:

sql database

As an example: You work with unstructured data in the big data environment and you find a variable that is relevant for repeated use. Typically the next stage is to have the big data team make it available in the data warehouse. Then at some point in the near future that data will be available to use in one of the tables in the database.

Now you could just spend more time in the big data environment, but queries run much faster in the relational database.

Now you are probably thinking, well I can just have the big data team pull that for me and everything will be great! Well, you and I both know the importance of understanding how the tables are related and the logic behind the data. Understanding all of the intricacies and nuances of the fields. Having a full understanding of the potential bias and caveats that will need to go along with your model allows you to communicate these caveats with the business.

If you're putting your name on a query that is building a model, then you need to be able to investigate the data if something goes wrong. You will need to find answers immediately and that will be difficult when you are relying on data that someone else has provided.

Additionally, when pulling data from the database into Python or R, particularly for complex queries and joining multiple tables then it makes sense to write your query in SQL first. The errors when you misspell something are much easier to catch and track down when directly in SQL rather than when writing a query directly in Python and then find that it doesn’t run for some reason.

Python just lets you know that there is an error, SQL can give you hints as to where the problem lies.

For most data science jobs, you will need to be proficient in SQL. Data science involves dealing with large datasets in databases and it will require expertise in SQL to be able to solve the problems in your project.

As a programming language, It's a simple skill to learn, but a very valuable one. A walk in the park compared to Java or R.

Leave a Comment

* Indicates a required field