Introduction
One of the great things about the R world has been a collection of R packages called tidyverse that are easy for beginners to learn and provide a consistent data manipulation and visualisation space. The value of these tools has been so great that many of them have been ported to Python. That’s why we thought we should provide an introduction to tidyverse for Python blog post.
What is tidyverse?
Tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. The core R tidyverse packages are: ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr and forcats.
Python implementation of dplyr
The tidyverse package dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges. Here are some of the functions dplyr provides that are commonly used:
mutate() - adds new variables that are functions of existing variables
-
select() - picks variables based on their names.
-
filter() - picks cases based on their values.
-
summarise() - reduces multiple values down to a single summary.
-
arrange() - changes the ordering of the rows.
Dplython is a Python implementation of dplyr which can be installed using pip and the following command:
pip install dplython
Instructions on how to use pip to install python packages can be found here.
The Dplython README provides some clear examples of how the package can be used. Below is an summary of the common functions:
- select() - used to get specific columns of the data-frame.
sift() - used to filter out rows based on the value of a variable in that row. sample_n() and sample_frac() - used to provide a random sample of rows from the data-frame.
-
arrange() - used to sort results.
-
mutate() - used to create new columns based on existing columns.
For more functions and example code visit the Dplython README page.
At the bottom of the README a comparison is provided to pandas-ply which is another python implementation of dplyr.
Dplython comes with a sample data-set called ‘diamonds’. Here are some basic examples of how to use Dplython.
Import Python packages and the ‘diamonds’ data-frame:
1 |
|
Create a new data-frame by selecting columns of the ‘diamonds’ data-frame:
1 |
|
Display the top 4 rows of the ‘diamondsSmall’ data-frame:
1 |
|
Filter the data-frame for rows where the price is higher than 18,000 and the carat less than 1.2 and sort them by depth:
1 |
|
Provide a random sample of 5 rows from the data-frame
1 |
|
Add a column to the data-frame containing the rounded value of ‘carat’
1 |
|
Python implementation of ggplot2
The tidyverse package ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
A Python port of ggplot2 has long been requested and there are now a few Python implementations of it; Plotnine is the one we will explore here. Plotting with a grammar is powerful, it makes custom (and otherwise complex) plots easy to think about and create, while the plots remain simple.
Plotnine can be installed using pip:
pip install plotnine
Plotnine splits plotting into three distinct parts which are data, aesthetics and layers. The data step adds the data to the graph, the aesthetics (aes) step adds visual attributes and the layers step creates the objects on a plot. Multiple aesthetics and layers functions can be added to a Plotnine graph.
If you are a python user used to Matplotlib it can take some getting used to a Grammar of Graphics plotting tool which is partly due to the difference in philosophy. Plotnine provides some tutorials to help with getting to grips with the package and there is also the Plotnine README. However if you are new to Grammar of Graphics plotting then this highly recommended kaggle notebook for Plotnine is probably the best place to start.
Here are some examples of how to use plotnine to visualize data from the ‘diamonds’ data-frame that comes with Dplython.
Import Python packages, the ‘diamonds’ data-frame and create a sample data-frame:
1 |
|
Create a scatter plot of ‘carat’ vs ‘price’:
1 |
|
Add additional layers e.g. a line of best fit:
1 |
|
Add another aesthetic, here the data is coloured by the ‘cut’ variable:
1 |
|
Add a layer which separates the data into graphs based on ‘colour’
1 |
|
This article compares a variety of alternative plotting packages for Python.
Next steps
-
Read the documents that are linked in this blog post.
-
Learn the basics of Pandas.
Use Dplython and Plotnine to practice data manipulation & visualization. For example complete some of the exercises at kaggle.
Do you know of other good Python implementations of tidyverse? If so let us know about them!