SQL is Not Dead and How to Make it Part of Your Toolbox
Even as SQL draws ever closer to being half a century old, it continues to be relevant for the majority of the business world. One reason for SQL’s continued reign as the main data query language is its ease of use. Back when SQL was being built at IBM, their goal was to create a language that mirrored English language. For new users, SQL becomes easily readable once you learn the order and meaning of each keyword.
SQL allows faster and easier access to data rather than creating a Python script or an Excel spreadsheet. An often overlooked benefit is that once you know one iteration of SQL, picking up other iterations becomes trivial, which opens a door to a wide variety of technology for data management. I will break the benefits and reasons down below by job title: Product Managers, Data Analysts, Data Scientists, and Data Engineers.
Product managers are responsible for the success of their products, and are interested in understanding how users engage with all areas of the product. Being able to answer these questions is often difficult, as you often must rely on data analysts to get you these answers. While there are analytics tools like Mixpanel and Google Analytics, these don’t capture all relevant details about the product you’re interested in exploring. Learning SQL gives you less overhead to gain insights about your product.
Many data analysts initially use Excel for data management and analysis. There’s absolutely nothing wrong with Excel. Excel is more flexible for iterative analysis than SQL. However, Excel fails miserably in one area: scalability. Eventually, all users of Excel will run into scalability issues. Larger CSV files can crash Excel easily. This is where SQL comes in. Now you have the ability to analyze much larger datasets than before. Essentially, SQL becomes your scalable pre-processing database system that sits in front of Excel, which continues to become that “last mile” of analysis.
“Data Scientists spend 90% of their time cleaning data and 10% analyzing it.” The biggest showstopper for data scientists isn’t an algorithm or lack of domain knowledge but quick access to clean data. New data scientists, particularly those with a computer science background, tend to use a variety of scripting languages to gain access and manipulate data. This approach tends to be more tedious,time-consuming, and brittle than using a tool built specifically for data access and manipulation. Learning SQL makes you a more self-reliant data scientist and allows you to expand the range of accessible data sources and do it more easily and iteratively.
Data Engineers are the backbone of every data pipeline. They are the architects, the builders, and the maintainers of each piece of the data pipeline, from collection, ingestion, storage and processing. Data Engineers do all the heavy lifting so that everyone else in the organization can access data safely and efficiently. For Data Engineers, knowing SQL is a necessity because relational and analytics databases with a SQL interface continue to be the most popular. MySQL, PostgreSQL, Redshift, BigQuery, DashDB, and Hive all fall under this category. Unlike the other three groups, Data Engineers not only need to master writing queries, they also need to know how to manage the administration piece of databases through SQL. Since Data Engineers tend to be looked up to as the experts of the system, knowing how to optimize for query performance is also needed.
So where can beginners start learning SQL? Luckily, there are plenty of online resources to learn from. Here are some places to get you started:
- Learn SQL by Calculating Customer Lifetime Value Part 1: Setup, Counting and Filtering
- Learn SQL by Calculating Customer Lifetime Value Part 2: GROUP BY and JOIN
- Pivot your Excel skills into SQL
- Postgres Guide
- Amazon Redshift COPY Command Guide (for Data Scientists and Engineers)
- Periscope Data‘s Redshift Guide for Analysts