When faced with a new classification problem, machine learning practitioners have a dizzying array of algorithms from which to choose: Naive Bayes, decision trees, Random Forests, Support Vector Machines, and many others. Where do you start? For many practitioners, the first algorithm they reach for is one of the oldest in the field: logistic regression.
Here are just a few of the attributes of logistic regression that make it incredibly popular: it’s fast, it’s highly interpretable, it doesn’t require input features to be scaled, it doesn’t require any tuning, it’s easy to regularize, and it outputs well-calibrated predicted probabilities.
But despite its popularity, it is often misunderstood. Here are a few common questions about logistic regression:
-
Why is it called “logistic regression” if it’s used for classification?
-
Why is it considered a linear model?
-
How do you interpret the model coefficients?
As a teacher, I’ve found that my best lessons are the ones in which I explain a topic step-by-step in the way that I wish it had been taught to me. I struggled when I was learning logistic regression, which is why I’m so pleased to have written a lesson that may help you to grasp this challenging topic.
In order to give you additional context for the lesson, I created this guide that includes suggested prerequisites, a practical exercise, and a lengthy set of additional resources to allow you to go deeper into this topic.
Please note that the lesson code is written in Python, and so you will get the most out of it if you are a user of Python and scikit-learn. However, most components of this guide cover conceptual or mathematical material, and should be useful to all readers regardless of programming background.
I’d love to hear from you in the comments below! What questions do you have about logistic regression? Is this kind of guide helpful to you for learning a new topic? Are there other guides you would like me to create?
Prerequisite Knowledge
Mathematical terminology:
Machine learning:
Linear regression:
scikit-learn (optional):
- For a walkthrough of the classification process using Python’s scikit-learn library, watch videos 3 and 4 (35 minutes) from my scikit-learn video series. (Here are the associated notebooks.)
Logistic Regression Lesson
My logistic regression lesson notebook covers the following topics using the glass identification dataset:
-
Refresh your memory on how to do linear regression in scikit-learn
-
Attempt to use linear regression for classification
-
Show you why logistic regression is a better alternative for classification
-
Brief overview of probability, odds, e, log, and log-odds
-
Explain the form of logistic regression
-
Explain how to interpret logistic regression coefficients
-
Demonstrate how logistic regression works with categorical features
-
Compare logistic regression with other models
Practical Exercise
As a way to practice applying what you’ve learned, participate in Kaggle’s introductory Titanic competition and use logistic regression to predict passenger survival. Kaggle links to helpful tutorials for Python, R, and Excel, and their Scripts feature lets you run Python and R code on the Titanic dataset from within your browser.
Further Reading
Further Reading (for scikit-learn users)
-
If you’re a scikit-learn user, it’s worth reading the user guide and class documentation for logistic regression to understand the particulars of its implementation.
-
If you’d like to improve your logistic regression model through regularization, read part 5 of my regularization lesson notebook.