From Business Analyst to Data Scientist: the 4 S’s
I used to work in finance as a quantitative analyst, a role best described as “data scientist in finance.” Because of that, my friends with a business background sometimes ask me how they can prepare themselves to become data scientists. “Do I need to know machine learning?” “R or Python?” “What’s the best way to learn about Hadoop?”
Here are the spoilers: it’s neither sufficient nor necessary that you know any particular technology, technique or tool. Instead, focus on honing your skills in the following areas: Statistics, Scripting, SQL and Skepticism.
Statistics is the common language to reason intelligently about data, and it’s really important that you have a solid understanding of it. The emphasis here is on the basics: what’s the difference between sample mean and population mean? What are the assumptions behind linear regression (ex: what do you need the normality assumption of error terms for)? It’s okay if you do not know what random forest or SVM are when you are staring out. However, do make sure you get into the habit of deeply understanding how different statistical techniques work and what their limitations are.
Also, you need to be familiar with statistical software to carry out your computation and communicate your findings. Python (Pandas) and R are popular among data scientists, but Excel is surprisingly capable (for example, read John Foreman’s excellent blog series and book).
A big part of a data scientist’s job is collecting and cleaning data, which means you need to know how to program. Good news is that you do not need to be a prodigy hacker or master architect. You just need to know enough to get your job done, whether that is downloading data via API or writing a data transformation script. Python seems to be a popular choice among data scientists (thanks to NumPy/SciPy/Pandas), but any language suffices. Make sure you know at least one programming language well so that you can work with data on your own. If you are constantly bugging software engineers (and DBAs, see the next section), you aren’t a very good data scientist.
(SQL authoring interface for Treasure Data)
It’s absolutely essential that you know your way around SQL. Whether appropriate or not, much of interesting data is stored in storage systems that speak SQL, and without a good handle of SQL, you can’t access your data effectively. Ideally, you shouldn’t need to bother DBAs more than once, and that’s to get read-only database access. From that point on, you should be able to write your own queries to get the data you want.
There are many good tutorials to learn the basics of SQL. SQL has different dialects, but the general idea is more or less the same, so don’t worry too much about which dialect you learn first.
Data science in the trenches is quite messy. Data is hardly ever clean on day one, and a lot of real world data doesn’t form a nice bell-shaped curve. Big data is the phrase du jour, but sometimes you do not have enough data points to yield sufficient statistical power. In other words, there are a lot of ways in which your analysis can lead you astray.
One of your key roles as a data scientist is to prevent your organization from abusing/misusing data. And that starts with having a healthy level of skepticism about each step of data analysis. Does the data look correct? Does the data satisfy all the assumptions required by the technique I am about to use? What level of uncertainty do we have about our conclusion?
Unlike the preceding three S’s, skepticism is not something you can learn by reading books or doing tutorials. The best way to acquire this skill is through experience: by working with a lot of data, you form a habit of questioning various basic assumptions and catching issues before they propagate.