Building a Data Science Portfolio: Storytelling with Data (Part 2: Data Exploration)

The following post (Part 2 of two parts) by Vik Paruchuri, founder of data science learning platform Dataquest, offers some detailed and instructive insight about data science workflow (regardless of the tech stack involved, but in this case, using Python). We re-publish it here for your convenience.

Before we dive into exploring the data [see Part 1 for steps relating to data preparation], we’ll want to set the context, both for ourselves, and anyone else that reads our analysis. One good way to do this is with exploratory charts or maps. In this case, we’ll map out the positions of the schools, which will help readers understand the problem we’re exploring.

In the below code, we:

  • Setup a map centered on New York City.

  • Add a marker to the map for each high school in the city.

  • Display the map.

  In [82]: import folium         from folium import plugins          schools_map = folium.Map(location=[full[‘lat’].mean(), full[‘lon’].mean()], zoom_start=10)         marker_cluster = folium.MarkerCluster().add_to(schools_map)         for name, row in full.iterrows():             folium.Marker([row[“lat”], row[“lon”]], popup=”{0}: {1}”.format(row[“DBN”], row[“school_name”])).add_to(marker_cluster)         schools_map.create_map(‘schools.html’)         schools_map Out[82]:

         from folium import plugins

         schools_map = folium.Map(location=[full[‘lat’].mean(), full[‘lon’].mean()], zoom_start=10)

         for name, row in full.iterrows():

         schools_map.create_map(‘schools.html’)

 

This map is helpful, but it’s hard to see where the most schools are in NYC. Instead, we’ll make a heatmap:

  In [84]: schools_heatmap = folium.Map(location=[full[‘lat’].mean(), full[‘lon’].mean()], zoom_start=10)         schools_heatmap.add_children(plugins.HeatMap([[row[“lat”], row[“lon”]] for name, row in full.iterrows()]))         schools_heatmap.save(“heatmap.html”)         schools_heatmap Out[84]:

         schools_heatmap.add_children(plugins.HeatMap([[row[“lat”], row[“lon”]] for name, row in full.iterrows()]))

         schools_heatmap

Out[84]:

District-Level Mapping

Heatmaps are good for mapping out gradients, but we’ll want something with more structure to plot out differences in SAT score across the city. School districts are a good way to visualize this information, as each district has its own administration. New York City has several dozen school districts, and each district is a small geographic area.

We can compute SAT score by school district, then plot this out on a map. In the below code, we’ll:

  • Group full by school district.

  • Compute the average of each column for each school district.

  • Convert the school_dist field to remove leading 0s, so we can match our geograpghic district data.

  In [ ]: district_data = full.groupby(“school_dist”).agg(np.mean)        district_data.reset_index(inplace=True)        district_data[“school_dist”] = district_data[“school_dist”].apply(lambda x: str(int(x)))

        district_data.reset_index(inplace=True)

We’ll now we able to plot the average SAT score in each school district. In order to do this, we’ll read in data in GeoJSON format to get the shapes of each district, then match each district shape with the SAT score using the school_dist column, then finally create the plot:

  In [85]: def show_district_map(col):         geo_path = ‘schools/districts.geojson’         districts = folium.Map(location=[full[‘lat’].mean(), full[‘lon’].mean()], zoom_start=10)         districts.geo_json(             geo_path=geo_path,             data=district_data,             columns=[‘school_dist’, col],             key_on=’feature.properties.school_dist’,             fill_color=’YlGn’,             fill_opacity=0.7,             line_opacity=0.2,         )         districts.save(“districts.html”)         return districts show_district_map(“sat_score”)

         geo_path = ‘schools/districts.geojson’

         districts.geo_json(

             data=district_data,

             key_on=’feature.properties.school_dist’,

             fill_opacity=0.7,

         )

         return districts

show_district_map(“sat_score”)

Exploring Enrollment and SAT Scores

Now that we’ve set the context by plotting out where the schools are, and SAT score by district, people viewing our analysis have a better idea of the context behind the dataset. Now that we’ve set the stage, we can move into exploring the angles we identified earlier, when we were finding correlations. The first angle to explore is the relationship between the number of students enrolled in a school and SAT score.

We can explore this with a scatter plot that compares total enrollment across all schools to SAT scores across all schools.

  In [87]: %matplotlib inline          full.plot.scatter(x=’total_enrollment’, y=’sat_score’) Out[87]:

 

As you can see, there’s a cluster at the bottom left with low total enrollment and low SAT scores. Other than this cluster, there appears to only be a slight positive correlation between SAT scores and total enrollment. Graphing out correlations can reveal unexpected patterns.

We can explore this further by getting the names of the schools with low enrollment and low SAT scores:

  In [88]: full[(full[“total_enrollment”] < 1000) & (full[“sat_score”] < 1000)][“School Name”] Out[88]: 34     INTERNATIONAL SCHOOL FOR LIBERAL ARTS         143                                      NaN         148    KINGSBRIDGE INTERNATIONAL HIGH SCHOOL         203                MULTICULTURAL HIGH SCHOOL         294      INTERNATIONAL COMMUNITY HIGH SCHOOL         304          BRONX INTERNATIONAL HIGH SCHOOL         314                                      NaN         317            HIGH SCHOOL OF WORLD CULTURES         320       BROOKLYN INTERNATIONAL HIGH SCHOOL         329    INTERNATIONAL HIGH SCHOOL AT PROSPECT         331               IT TAKES A VILLAGE ACADEMY         351    PAN AMERICAN INTERNATIONAL HIGH SCHOO         Name: School Name, dtype: object

 

         143                                      NaN

         203                MULTICULTURAL HIGH SCHOOL

         304          BRONX INTERNATIONAL HIGH SCHOOL

         317            HIGH SCHOOL OF WORLD CULTURES

         329    INTERNATIONAL HIGH SCHOOL AT PROSPECT

         351    PAN AMERICAN INTERNATIONAL HIGH SCHOO

Some searching on Google shows that most of these schools are for students who are learning English, and are low enrollment as a result. This exploration showed us that it’s not total enrollment that’s correlated to SAT score – it’s whether or not students in the school are learning English as a second language.

Exploring English Language Learners and SAT Scores

Now that we know the percentage of English language learners in a school is correlated with lower SAT scores, we can explore the relationship. The ell_percent column is the percentage of students in each school who are learning English. We can make a scatterplot of this relationship:

  In [89]: full.plot.scatter(x=’ell_percent’, y=’sat_score’) Out[89]: <matplotlib.axes._subplots.AxesSubplot at 0x10fe824e0>

 

It looks like there are a group of schools with a high ell_percentage that also have low average SAT scores. We can investigate this at the district level, by figuring out the percentage of English language learners in each district, and seeing it if matches our map of SAT scores by district:

  In [90]: show_district_map(“ell_percent”) Out[90]:

 

As we can see by looking at the two district level maps, districts with a low proportion of ELL learners tend to have high SAT scores, and vice versa.

Correlating Survey Scores and SAT Scores

It would be fair to assume that the results of student, parent, and teacher surveys would have a large correlation with SAT scores. It makes sense that schools with high academic expectations, for instance, would tend to have higher SAT scores. To test this theory, lets plot out SAT scores and the various survey metrics:

  In [91]: full.corr()[“sat_score”][[“rr_s”, “rr_t”, “rr_p”, “N_s”, “N_t”, “N_p”, “saf_tot_11”, “com_tot_11”, “aca_tot_11”, “eng_tot_11”]].plot.bar() Out[91]: <matplotlib.axes._subplots.AxesSubplot at 0x114652400>

 

Surprisingly, the two factors that correlate the most are N_p and N_s, which are the counts of parents and students who responded to the surveys. Both strongly correlate with total enrollment, so are likely biased by the ell_learners. The other metric that correlates most is saf_t_11. That is how safe students, parents, and teachers perceived the school to be. It makes sense that the safer the school, the more comfortable students feel learning in the environment. However, none of the other factors, like engagement, communication, and academic expectations, correlated with SAT scores. This may indicate that NYC is asking the wrong questions in surveys, or thinking about the wrong factors (if their goal is to improve SAT scores, it may not be).

Exploring Race and SAT Scores

One of the other angles to investigate involves race and SAT scores. There was a large correlation differential, and plotting it out will help us understand what’s happening:

  In [92]: full.corr()[“sat_score”][[“white_per”, “asian_per”, “black_per”, “hispanic_per”]].plot.bar() Out[92]: <matplotlib.axes._subplots.AxesSubplot at 0x108166ba8>

 

It looks like the higher percentages of white and Asian students correlate with higher SAT scores, but higher percentages of black and Hispanic students correlate with lower SAT scores. For Hispanic students, this may be due to the fact that there are more recent immigrants who are ELL learners. We can map the hispanic percentage by district to eyeball the correlation:

  In [93]: show_district_map(“hispanic_per”) Out[93]:

 

It looks like there is some correlation with ELL percentage, but it will be necessary to do some more digging into this and other racial differences in SAT scores.

Gender differences in SAT scores

The final angle to explore is the relationship between gender and SAT score. We noted that a higher percentage of females in a school tends to correlate with higher SAT scores. We can visualize this with a bar graph:

  In [94]: full.corr()[“sat_score”][[“male_per”, “female_per”]].plot.bar() Out[94]: <matplotlib.axes._subplots.AxesSubplot at 0x10774d0f0>

 

To dig more into the correlation, we can make a scatterplot of female_per and sat_score:

  In [95]: full.plot.scatter(x=’female_per’, y=’sat_score’) Out[95]: <matplotlib.axes._subplots.AxesSubplot at 0x104715160>

 

It looks like there’s a cluster of schools with a high percentage of females, and very high SAT scores (in the top right). We can get the names of the schools in this cluster:

  In [96]: full[(full[“female_per”] > 65) & (full[“sat_score”] > 1400)][“School Name”] Out[96]: 3             PROFESSIONAL PERFORMING ARTS HIGH SCH         92                    ELEANOR ROOSEVELT HIGH SCHOOL         100                    TALENT UNLIMITED HIGH SCHOOL         111            FIORELLO H. LAGUARDIA HIGH SCHOOL OF         229                     TOWNSEND HARRIS HIGH SCHOOL         250    FRANK SINATRA SCHOOL OF THE ARTS HIGH SCHOOL         265                  BARD HIGH SCHOOL EARLY COLLEGE         Name: School Name, dtype: object

 

         92                    ELEANOR ROOSEVELT HIGH SCHOOL

         111            FIORELLO H. LAGUARDIA HIGH SCHOOL OF

         250    FRANK SINATRA SCHOOL OF THE ARTS HIGH SCHOOL

         Name: School Name, dtype: object

Searching Google reveals that these are elite schools that focus on the performing arts. These schools tend to have higher percentages of females, and higher SAT scores. This likely accounts for the correlation between higher female percentages and SAT scores, and the inverse correlation between higher male percentages and lower SAT scores.

AP Scores

So far, we’ve looked at demographic angles. One angle that we have the data to look at is the relationship between more students taking Advanced Placement exams and higher SAT scores. It makes sense that they would be correlated, since students who are high academic achievers tend to do better on the SAT.

  In [98]: full[“ap_avg”] = full[“AP Test Takers “] / full[“total_enrollment”]          full.plot.scatter(x=’ap_avg’, y=’sat_score’) Out[98]: <matplotlib.axes._subplots.AxesSubplot at 0x11463a908>

 

It looks like there is indeed a strong correlation between the two. An interesting cluster of schools is the one at the top right, which has high SAT scores and a high proportion of students that take the AP exams:

  In [99]: full[(full[“ap_avg”] > .3) & (full[“sat_score”] > 1700)][“School Name”] Out[99]: 92             ELEANOR ROOSEVELT HIGH SCHOOL         98                    STUYVESANT HIGH SCHOOL         157             BRONX HIGH SCHOOL OF SCIENCE         161    HIGH SCHOOL OF AMERICAN STUDIES AT LE         176           BROOKLYN TECHNICAL HIGH SCHOOL         229              TOWNSEND HARRIS HIGH SCHOOL         243    QUEENS HIGH SCHOOL FOR THE SCIENCES A         260      STATEN ISLAND TECHNICAL HIGH SCHOOL         Name: School Name, dtype: object

 

         98                    STUYVESANT HIGH SCHOOL

         161    HIGH SCHOOL OF AMERICAN STUDIES AT LE

         229              TOWNSEND HARRIS HIGH SCHOOL

         260      STATEN ISLAND TECHNICAL HIGH SCHOOL

Some Google searching reveals that these are mostly highly selective schools where you need to take a test to get in. It makes sense that these schools would have high proportions of AP test takers.

Wrapping Up the Story

With data science, the story is never truly finished. By releasing analysis to others, you enable them to extend and shape your analysis in whatever direction interests them. For example, in this post, there are quite a few angles that we explored inmcompletely, and could have dived into more.

One of the best ways to get started with telling stories using data is to try to extend or replicate the analysis someone else has done. If you decide to take this route, you’re welcome to extend the analysis in this post and see what you can find. If you do this, make sure to comment below so I can take a look.

Next Steps

If you’ve made it this far, you hopefully have a good understanding of how to tell a story with data, and how to build your first data science portfolio piece. Once you’re done with your data science project, it’s a good idea to post it on Github so others can collaborate with you on it.

Vik Paruchuri is the founder of Dataquest, a platform that teaches data science interactively in your browser. Dataquest’s unique approach to learning blends theory and practice, then helps you build your portfolio with projects.