I'm all about ML, but let's talk about OR

You’ve studied machine learning, you’re a dataframe master for massaging data, and you can easily pipe that data through a bunch of machine learning libraries.

You go for a job interview at a SAAS company, you’re given some raw data and labels and asked to predict churn, and come on - are these guys even trying? You generate the shit out of some features, you overfit the hell out of that multidimensional manifold just so you can back off and show off your knowledge of regularization, and then you put the icing on the cake by cross validating towards a better metric for the business problem than simple accuracy.

Next.

You roll up on an ecommerce company, and they trick you by basically giving you no features. Just users, products, and clicks? Ha! Nice try, but you know that’s a classic recommender system. Boom! Out of nowhere, the ratings matrix is factorized, and you’re serving up top-notch recommendations a-plenty.

Next.

Your uncle calls you and asks for a “family favor”. He manufactures artisinal dog collars in a Brooklyn warehouse and sells directly to the consumer. He doesn’t know what the word churn means. He needs to decide how many of his 5 types of dog collars he should build next month with his limited supplies. Each collar in the catlog has a different profit margin, a different minimum build quantity, and there’s only a 10K budget. Uncle asks you, “How many of each collar should we build to maximize profit?”

Uhhhhh.

You panic and try to apply ML to everything. Hmmm, profit is a not a label but more a continuous variable. Okay, not a classification problem. Maybe this is a regression problem? Do I even want to predict the profit? Wait, what are my features? Shit, there’s like no data…

And so you write a program to enumerate all possible combinations of dog collar quantities along with the associated profit, and you find the maximum of the list of answers. Your uncle’s happy, but you feel like an idiot because you know that this isn’t scalable and god help you if you’re ever asked this in an interview and there’s just got to be a better way, right?

Well, there’s a better way.

There’s a better way! And I feel like nobody talks about it because the Data’s not Big, you’re not Learning Deep things, and there’s nary a chatbot in sight. It’s boring, old operations research which was something that I guess your university offered but nobody really knew what it meant. Full disclosure: I still don’t really know what it means. I do know that the job of the data scientist is to bring value to the company, and having some operations and optimization in your toolbelt is quite valuable!

Problem formulation¶

Let’s try to solve Uncle’s problem to get a feel for how we think about these things. I am not going to in any way cover how programs actually solve these problems (because I still don’t know) but instead how one sets up this problem and asks a solver to solve this in Python.

What was the original question? “How many of each collar should we build to maximize profit?” Maximize profit is the key phrase here. The way to think of optimization problems like this are in terms of the objective function and the constraints. We want to maximize profit, so our objective function will be the total profit from all collars produced (assuming all get sold). If we say that we have a variable, $c_{j}$, which says how many of collar $j$ we will produce, and $p_{j}$ is a constant denoting the profit that we will make on that collar, then our objective function is simply

We want to maximize this function subject to specific constraints. One constraint is that we only have a 10K budget. The simplest method of solving these types of problems is to keep all constraints linear. If we have a known constant $w_{j}$ which is the cost in dollars of producing collar $j$, then the constraint could be expressed in mathematical form as

Isn’t this fun? All of these optimization problems consist of figuring out how to define your constraints in terms of math which is basically like the ultimate nerd-puzzle. The constraint on figuring out your constraints (see what I did there?) is that you should keep your constraints linear. This can get fairly tricky (read: fun).

Alright, let’s figure out the last two constraints, and then we can start coding this thing up. If $r_{j}$ is the minimimum run size for each collar, then this is relatively simple:

By the way, that means “for all $j$, $c_{j}$ is greater than or equal to $r_{j}$”. We can lastly deal with the limited “supplies” for building collars by assuming that $m_{i}$ is the max quantity of supply $i$. If each collar $j$ requires $s_{ji}$ quantity of supply $i$, then the limited supplies constraint is written

Math $\Rightarrow$ Code¶

At Birchbox, I programmed and solved a number of optimization problems, and I had the luxury to use Gurobi, a state of the art mixed-integer programming solver with API’s in many languages. What does that mean? That means you can build your problem in the language of your choice, the API’s will translate your code into whatever Gurobi likes as input (C?), and then the Gurobi solver will solve the problem. Gurobi is most definitely not free, so I won’t use it in this blog post on the off chance that you want to program along. Instead, we’ll use GLPK as our solver and pulp as the Python API. I’ve included instructions for installing both libraries on Linux at the bottom of this post.

We’ll start by importing our libraries and placing some made up data into two dataframes to get an idea of how the data’s organized. The columns with material names are assumed to denote the “quantity” that the collar requires. For the sake of just trying to learn something, let’s ignore what “0.70 metal” might mean and move along.