Ever since I’ve started writing about my transition from academia to industry (both my reasons for leaving and what I think about the transition in retrospect), I’ve started receiving a lot of requests for advice on making that transition. Sometimes these requests come from former peers, former professors looking to advise students, or just someone who read one of my blog posts online.
I’m always happy to reply to these requests as best as I can given my experiences, but I figured since so many people seem interested I it might be useful to put my thoughts in a blog post.
What do I need to know to be a data scientist?
My main message to most of the people that have been asking me for advice is not to worry too much. People seem to think you need to know everything to be a data scientist. If you have a PhD in the sciences or any quantitative discipline, you’re likely most of the way to having the skills you need, so don’t be scared to take that next step (by just looking at and applying to jobs). Put yourself out there and let potential employers decide whether you have the skills to work for them or not, don’t do their job for them.
If you really want to know what I think are the essentials to get a first data science position, the main classes of skills you need to have should come as no surprise:
-
Be a decent programmer: Python and R are the languages that are most common. Being comfortable in both opens more doors, but you can make due with one or the other. SQL is also used frequently, so at least do a short tutorial on it (basic queries are simple enough that it’s not super difficult to pick up the necessities).
-
Know some machine-learning: You should understand basic machine-learning concepts and have at some experience with some of the most common algorithms (like tree-based methods, logistic regression, dimensionality reduction, etc.)
-
Know some traditional stats: You should know about p-values, regressions, t-tests, statistical significance.
These are the bare essentials. You’ll need to be more than just ‘decent’ in at least one of these categories (which one depends on the job, read on), but the point is, these should be skills you already have or could gain in a short period of time through some additional reading/practice. The skills involved in data science are so broad, no one is an expert in all of them. As long as you know the basics, and have some areas where you’re particularly strong, you should be able to find a somewhere that your skills could be useful.
Other skills that are useful
Some people seem convinced that they need an understanding of every big data technology under the sun to be a data scientist. To the uninitiated, the bizarre names you might hear (Hadoop, Spark, MongoDB, etc.) just sound intimidating. The truth is, they’re not that complicated, and you don’t need to be an expert in them to get a data science job.
Different data science teams use different tools. Some may use some big data tools for everything and really need someone who can jump into that workflow right away. But in my experience, the majority of data science jobs have these skills listed as ‘Preferred but not required’, if they’re listed at all.
Knowing specific packages and algorithms might be required by specific jobs. The best way to figure out what you might need to know for the kinds of jobs your interested in is to just start looking. Go to Indeed or any other job posting site and just start looking at jobs. Find some that interest you, and figure out what skills they want, and get to work on those skills.
Where do I learn these skills?
Once you get a sense of what skills you might need to gain or strengthen, where do you go to get them? For almost any package under the sun, it’s easy to just google a tutorial or free online course to learn the basics.
Some people opt to go to boot camps. My overall opinion is that they are good for getting a general overview if you can afford to take a couple of months off full-time, but I don’t think they are necessary.
Start applying/interviewing as early as possible
I find I learn best by being forced to solve problems. I think what taught me the most about being a data scientist was applying and interviewing. Interviewers will often give you a coding challenge or small project to work on at home before they bring you in for an in-person interview. These can be anything from a quick one hour coding exercise to a multi-day project where you have to implement a recommendation engine. These are great for learning, because they’re nice small projects that are, by definition, what interviewers want you to know. Treating them as a learning exercise helps in two ways. First, they’re efficient for learning skills. Second, it will take the edge off when, inevitably, you get rejected repeatedly after interviewing. Interviewing is hard and it can really suck to get rejected after you’ve put in a lot of time and effort and fallen in love with the idea of working at a particular company. A better mentality is to approach interviews as being a success as long as you’ve learned something.
There are plenty of other good resources for learning some of the essential skills to being a data scientist. Here’s a woefully incomplete guide:
-
To brush up on programming, try the Cracking the Coding Interview resources at HackerRank
-
For machine-learning, An Introduction to Statistical Learning is excellent. The Elements of Statistical Learning is more comprehensive but denser.
-
To just learn some basic packages and data skills, try working your way through some of the Titanic tutorials at Kaggle
Conclusion
I think my main piece of advice is this: You’re probably closer than you know to getting a data science interview, and you should take the next steps (looking at job listings, or applying to some of them) as early as possible to better guide your search. Good luck!