Python vs R： Head to Head Data Analysis

Which is better for data analysis?

There have been dozens of articles written comparing Python and R from a subjective standpoint. We’ll add our own views at some point, but this article aims to look at the languages more objectively. We’ll analyze a dataset side by side in Python and R, and show what code is needed in both languages to achieve the same result. This will let us understand the strengths and weaknesses of each language without the conjecture. At Dataquest, we teach both languages, and think both have a place in a data science toolkit.

We’ll be analyzing a dataset of NBA players and their performance in the 2013-2014 season. You can download the file here. For each step in the analysis, we’ll show the Python and R code, along with some explanation and discussion of the different approaches. Without further ado, let’s get this head to head Python vs R matchup started!

Importing a CSV

library(readr)
nba <- read_csv("nba_2013.csv")

Python

import pandas
nba = pandas.read_csv("nba_2013.csv")

The above code will load the CSV file nba_2013.csv, which contains data on NBA players from the 2013-2014 season, into the variable nba in both languages. The only real difference is that in Python, we need to import the pandas library to get access to Dataframes. In R, while we can import the data using the base R function read.csv(), using the readr library function read_csv() has the advantage of greater speed and consistent interpretation of data types. Dataframes are available in both R and Python, and are two-dimensional arrays (matrices) where each column can be of a different datatype. At the end of this step, the CSV file has been loaded by both languages into a dataframe.

Finding the number of rows

dim(nba)

nba.shape

head(nba, 1)

nba.head(1)

library(purrr)
library(dplyr)

nba %>%
 select_if(is.numeric) %>%
map_dbl(mean, na.rm = TRUE)

nba.mean()

library(GGally)

nba %>% 
 select(ast, fg, trb) %>%
 ggpairs()

Python

import seaborn as sns
import matplotlib.pyplot as plt
sns.pairplot(nba[["ast", "fg", "trb"]])
plt.show()

We get very similar plots in the end, but this shows how the R data science ecosystem has many smaller packages (GGally is a helper package for ggplot2, the most-used R plotting package), and many more visualization packages in general. In Python, matplotlib is the primary plotting package, and seaborn is a widely used layer over matplotlib. With visualization in Python, there is usually one main way to do something, whereas in R, there are many packages supporting different methods of doing things (there are at least a half dozen packages to make pair plots, for instance).

Make clusters of the players

One good way to explore this kind of data is to generate cluster plots. These will show which players are most similar.

library(cluster)
set.seed(1)
isGoodCol <- function(col){
 sum(is.na(col)) == 0 && is.numeric(col) 
}
goodCols <- sapply(nba, isGoodCol)
clusters <- kmeans(nba[,goodCols], centers=5)
labels <- clusters$cluster

Python

from sklearn.cluster import KMeans
kmeans_model = KMeans(n_clusters=5, random_state=1)
good_columns = nba._get_numeric_data().dropna(axis=1)
kmeans_model.fit(good_columns)
labels = kmeans_model.labels_

In order to cluster properly, we remove any non-numeric columns, or columns with missing values (NA, Nan, etc). In R, we do this by applying a function across each column, and removing it if it has any missing values or isn’t numeric. We then use the cluster package to perform k-means and find 5 clusters in our data. We set a random seed using set.seed to be able to reproduce our results.

In Python, we use the main Python machine learning package, scikit-learn, to fit a k-means clustering model and get our cluster labels. We perform very similar methods to prepare the data that we used in R, except we use the get_numeric_data and dropna methods to remove non-numeric columns and columns with missing values.

Plot players by cluster

We can now plot out the players by cluster to discover patterns. One way to do this is to first use PCA to make our data 2-dimensional, then plot it, and shade each point according to cluster association.

nba2d <- prcomp(nba[,goodCols], center=TRUE)
twoColumns <- nba2d$x[,1:2]
clusplot(twoColumns, labels)

Python

from sklearn.decomposition import PCA
pca_2 = PCA(2)
plot_columns = pca_2.fit_transform(good_columns)
plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=labels)
plt.show()

Made a scatter plot of our data, and shaded or changed the icon of the data according to cluster. In R, the clusplot function was used, which is part of the cluster library. We performed PCA via the pccomp function that is built into R.

With Python, we used the PCA class in the scikit-learn library. We used matplotlib to create the plot.

Split into training and testing sets

If we want to do supervised machine learning, it’s a good idea to split the data into training and testing sets so we don’t overfit.

trainRowCount <- floor(0.8 * nrow(nba))
set.seed(1)
trainIndex <- sample(1:nrow(nba), trainRowCount)
train <- nba[trainIndex,]
test <- nba[-trainIndex,]

Python

train = nba.sample(frac=0.8, random_state=1)
test = nba.loc[~nba.index.isin(train.index)]

You’ll notice that R has many more data-analysis focused builtins, like floor, sample, and set.seed, whereas these are called via packages in Python (math.floor, random.sample, random.seed). In Python, the recent version of pandas came with a sample method that returns a certain proportion of rows randomly sampled from a source dataframe – this makes the code much more concise. In R, there are packages to make sampling simpler, but aren’t much more concise than using the built-in sample function. In both cases, we set a random seed to make the results reproducible.

Univariate linear regression

Let’s say we want to predict number of assists per player from field goals made per player.

fit <- lm(ast ~ fg, data=train)
predictions <- predict(fit, test)

Python

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(train[["fg"]], train["ast"])
predictions = lr.predict(test[["fg"]])

Scikit-learn has a linear regression model that we can fit and generate predictions from. R relies on the built-in lm and predict functions. predict will behave differently depending on the kind of fitted model that is passed into it – it can be used with a variety of fitted models.

Calculate summary statistics for the model

summary(fit)

import statsmodels.formula.api as sm
model = sm.ols(formula='ast ~ fga', data=train)
fitted = model.fit()
fitted.summary()

Dep. Variable: astR-squared: 0.568Model: OLSAdj. R-squared: 0.567[output truncated]