Welcome, Data School students! If you’re interested in the exciting world of data science, but don’t know where to start, Data School is here to help.
Step 0: Figure out what you need to learn
Data science can be an overwhelming field. Many people will tell you that you can’t become a data scientist until you master the following: statistics, linear algebra, calculus, programming, databases, distributed computing, machine learning, visualization, experimental design, clustering, deep learning, natural language processing, and more. That’s simply not true.
So, what exactly is data science? It’s the process of asking interesting questions, and then answering those questions using data. Generally speaking, the data science workflow looks like this:
-
Ask a question
-
Gather data that might help you to answer that question
-
Clean the data
-
Explore, analyze, and visualize the data
-
Build and evaluate a machine learning model
-
Communicate results
This workflow doesn’t necessarily require advanced mathematics, a mastery of deep learning, or many of the other skills listed above. But it does require knowledege of a programming language and the ability to work with data in that language. And although you need mathematical fluency to become really good at data science, you only need a basic understanding of mathematics to get started.
It’s true that the other specialized skills listed above may one day help you to solve data science problems. However, you don’t need to master all of those skills to begin your career in data science. You can begin today, and I’m here to help you!
Step 1: Get comfortable with Python
Python and R are both great choices as programming languages for data science. R tends to be more popular in academia, and Python tends to be more popular in industry, but both languages have a wealth of packages that support the data science workflow. I’ve taught data science in both languages, and generally prefer Python. (Here’s why.)
You don’t need to learn both Python and R to get started. Instead, you should focus on learning one language and its ecosystem of data science packages. If you’ve chosen Python (my recommendation), you may want to considering installing the Anaconda distribution because it simplifies the process of package installation and management on Windows, OSX, and Linux.
You also don’t need to become a Python expert to move on to step 2. Instead, you should focus on mastering the following: data types, data structures, imports, functions, conditional statements, comparisons, loops, and comprehensions. Everything else can wait until later!
If you’re not sure whether you know “enough” Python, scan through my Python Quick Reference. If most of that material is familiar to you, you can move on to step 2!
If you’re looking for a course to help you learn Python, here are a few recommendations:
-
DataCamp (affiliate) offers a short, interactive course in beginning Python.
-
Introduction to Python is a more substantial course in beginning Python that feels like an interactive textbook.
-
Google’s Python Class is best for people with some programming experience, and includes lecture videos and downloadable exercises.
-
Python Jumpstart by Building 10 Apps is an excellent video course taught by Michael Kennedy (host of the “Talk Python To Me” podcast).
Step 2: Learn data analysis, manipulation, and visualization with pandas
For working with data in Python, you should learn how to use the pandas library.
pandas provides a high-performance data structure (called a “DataFrame”) that is suitable for tabular data with columns of different types, similar to an Excel spreadsheet or SQL table. It includes tools for reading and writing data, handling missing data, filtering data, cleaning messy data, merging datasets, visualizing data, and so much more. In short, learning pandas will significantly increase your efficiency when working with data.
However, pandas includes an overwhelming amount of functionality, and (arguably) provides too many ways to accomplish the same task. Those characteristics can make it challenging to learn pandas and to discover best practices.
That’s why I created a pandas video series (30 videos, 6 hours) that teaches the pandas library from the ground up. Each video answers a question using a real dataset, and the datasets are posted online so you can follow along at home. (I also created a well-commented Jupyter notebook that includes the code from every video.)
“Your videos are extremely helpful. I like that you use actual data sets and try a lot of different applications of the concept being discussed rather than just overly simplistic examples. Your content has helped me immensely!” - Sean Montague
If you would prefer a non-video resource for learning pandas, here are my recommended resources.
Step 3: Learn machine learning with scikit-learn
For machine learning in Python, you should learn how to use the scikit-learn library.
Building “machine learning models” to predict the future or automatically extract insights from data is the sexy part of data science. scikit-learn is the most popular library for machine learning in Python, and for good reason:
-
It provides a clean and consistent interface to tons of different models.
-
It offers many tuning parameters for each model, but also chooses sensible defaults.
-
Its documentation is exceptional, and it helps you to understand the models as well as how to use them properly.
However, machine learning is still a highly complex and rapidly evolving field, and scikit-learn has a steep learning curve. That’s why I created a scikit-learn video series (9 videos, 4 hours), which will help you to gain a thorough grasp of both machine learning fundamentals and the scikit-learn workflow. The series doesn’t presume any familiarity with machine learning or advanced mathematics. (You can find all of the code from the series on GitHub.)
“Your videos are absolutely incredible. I have just completed the course on Machine Learning with Python and I can say I understood every single thing thanks to your excellent teaching style and skills.” - Guillaume B
If you would prefer a non-video resource for learning scikit-learn, I recommend either Python Machine Learning (Amazon / GitHub) or Introduction to Machine Learning with Python (Amazon / GitHub).
Step 4: Understand machine learning in more depth
Machine learning is a complex field. Although scikit-learn provides the tools you need to do effective machine learning, it doesn’t directly answer many important questions:
-
How do I know which machine learning model will work “best” with my dataset?
-
How do I interpret the results of my model?
-
How do I evaluate whether my model will generalize to future data?
-
How do I select which features should be included in my model?
-
And so on…
If you want to become great at machine learning, you need to be able to answer those questions, which requires both experience and further study. Here are some resources to help you along that path:
Step 5: Keep learning and practicing
Here is my best advice for improving your data science skills: Find “the thing” that motivates you to practice what you learned and to learn more, and then do that thing. That could be personal data science projects, Kaggle competitions, online courses, reading books, reading blogs, attending meetups or conferences, or something else!
-
Kaggle competitions are a great way to practice data science without coming up with the problem yourself. Don’t worry about how high you place, just focus on learning something new with every competition. (Keep in mind that you won’t be practicing important parts of the data science workflow: asking questions, gathering data, and communicating results.)
-
If you create your own data science projects, you should share them on GitHub and include writeups. That will help to show others that you know how to do reproducible data science. (If you don’t know how to use Git and GitHub, I have a short video series that will help you to master the basics.)
-
There are an overwhelming number of data science blogs, but DataTau will help you to find the latest and greatest content.
-
If you like email newsletters, my favorites are Data Elixir, Data Science Weekly, and Python Weekly.
-
If you want to truly experience the Python community, I highly recommend attending PyCon US. (There are also smaller PyCon conferences elsewhere.) As a data scientist, you should also consider attending SciPy and the nearest PyData conference.
Your data science journey has only begun! There is so much to learn in the field of data science that it would take more than a lifetime to master. Just remember: You don’t have to master it all to launch your data science career, you just have to get started!
Join Data School (for free!)
My name is Kevin Markham, and I’m the founder of Data School. I’d be honored if you would join the Data School community by subscribing to the email newsletter:
-
Fill out your name and email address in the left sidebar, and click “Join the Newsletter.”
-
Find the confirmation email from Data School in your inbox, and click the link to confirm your email address.
As a subscriber, you’ll receive priority access to my online courses and live webcasts, and you’ll get notified about new Data School tutorials and videos.
Have a question? Please let me know in the comments section below!
Want to follow Data School?
Thank you so much for reading!