Another year, another big soccer/football tournament! This time it’s thetop international competition in Asia, the Asian Cup hosted in theU.A.E. In this blog post I’ll be covering (responsible) web-scraping, data wrangling(tidyverse FTW!), and of course, data visualization with ggplot2
.
Let’s get started!
Packages
1 |
|
Top Goalscorers of the Asian Cup
The first thing I looked at was, “Who are the top goalscorers in thehistory of the Asian Cup?”
Here I use the polite package totake a look at the robots.txt
for the web page and see if it is OK toweb scrape from it. First you pass the URL to the bow()
function, check that you areindeed allowed to scrape, then use scrape()
to retrieve data, and therest is the usual rvest
web-scraping workflow.
1 |
|
For brevity, let’s only take a look at the top 5 goal scorers. I’ll alsomutate()
in a nice image of a soccer ball for the data points on theplot.
1 |
|
I made something slightly different to your standard bar graph as Iuse the geom_isotype_col()
function from ggtextures
to create a barof soccer ball images. Compared to other functions in ggtextures
,geom_isotype_col()
allows each image to correspond to the value of thevariable you are plotting, in this case 1 ball = 1 goal!
1 |
|
OK, not bad. However, wouldn’t it be nice to add a bit more context? Specifically,which country these players came from. So let’s add some flags along the y-axis!
There are lots of different ways to do this (like geom_flag()
from theggimage
package) but I ended up doing it the cowplot
way. I had totweak the scales a bit as the flags came in different sizes. When youplot, you just insert the image strip into the bar plot withaxis_canvas()
and combine all the parts together with ggdraw()
!
1 |
|
Ideally I wanted the soccer balls to be the official balls from thetournament that the player scored in. However, I couldn’t find a niceemoji-fied/icon-ized version and there was also the “small” problem inthat there was no “official” Asian Cup ball until the 2004 tournament inChina! You can take a look at the official Asian Cup ballshere.
Winners of the Asian Cup
We saw that the top goal scorers came from Iran, South Korea, Japan,Iraq, and Kuwait but did their goal scoring exploits lead their nationsto glory? Let’s find out!
When web-scraping I really like using flatten_df()
afterhtml_table()
as I don’t have to use the awkward looking .[[1]]
within my piped workflow.
1 |
|
Now I can use the clean_names()
function to quickly clean up my column names(mainly when I can’t be bothered to set_names()
them myself…).
The next steps are splitting up the number of times a team placedbetween 1st and 3rd and the year that occurred with separate()
. Then variants of mutate()
are used to tidy the string columns of the data into numeric type. I use gather()
so each team will have a row for each of the rank positions (1st-3rd). Finally, I arrange the data in a way that the facets will be ordered in the way that I want.
1 |
|
I plot using facets on the “key” variable (containing the rank data) sothat we can see how many times each team placed as Champions to ThirdPlace. I also use the glue()
function here to format the multi-linecaptions and titles in a neat way.
1 |
|
Goals per Game
One new thing I learned very recently, while working on this viz infact, was using magrittr aliases! In this workflow I always wind up having to use .[x]
or.[[x]]
but now I can just use extract()
or extract2()
respectivelyto do the same thing!
1 |
|
Another cool thing I found while scraping this data was the jump_to()
function that allows you to navigate to a new URL. This makesmap()
-ing over multiple URL links from a base URL very easy! Here, thebase URL is the AFC Asian Cup Wikipedia page and the function iteratesover each of the URL links of the respective tournament pages.Another way that I could’ve done this was to map()
over the differentdates of the tournaments as the Wikipedia page of each edition of theAsian Cup only differed in the “year” appended at the beginning of theURL.
1 |
|
Next, I clean it up a bit and add in the number of teams that participatedin each tournament.
1 |
|
1 |
|
Now we make a line graph but with lots of annotate()
code to add incomments, labels, and segments for the labels. At the end I usegeom_emoji()
to add a soccer ball to the plot for each of the datapoints.
1 |
|
1 |
|
However, I’m not finished yet! I wanted to try to make this look a bitmore “official” so I attempted to add the Asian Cup logo on the topright corner. There are probably alternative ways to how I did it below,especially by using grobs, but I was reminded ofthis blog post by DanielHadley who used the magick
packageto add a footer with a logo onto a ggplot
object. I’ve used magick
before for animations and this was a good chance to try it out for imageediting. Compared to Daniel Hadley’s example I needed to have the logoon the right corner so I had to create a blank canvas with image_blank()
and then placing everything on top of that with image_composite()
and image_append()
.
1 |
|
All in all it took a while to tweak the positions of the text and logoimage but for my first try it worked well. There is definitely room forimprovement in regards to sizing and scaling though.
Ultimately, I couldn’t find much information on why those tournaments inthe 80s in particular were such low scoring affairs. I wasn’t alive towatch those games on TV nor could I find any illuminating articles orblog posts on the style of Asian football back then… This was alsobefore Japan really got into soccer so there wasn’t anything I couldfind in Japanese either.
Japan’s Record vs. Historical Rivals and Group D Opponents
Japan is the most successful team in the competition with 4championships but who are their opponents in the group stages and howhave they fared against them in the past? While I’m at it I will also check Japan’srecords against long-time continental rivals such as Iran, South Korea,Saudi Arabia and more recently, Australia.
The data I’m going to use comes fromKagglewhich has all international football results from 1872 to the World Cupfinal last year. To add in the federation affiliation (UEFA, AFC, etc.)for each of the countries I slightly modified some code from one of thekernels, “A Journey Through The History ofSoccer”by PH Julien.
1 |
|
Now to load the results data and then join it with the affiliationsdata.
1 |
|
Next I need to edit some of the continents for teams that didn’t have amatch in the federation affiliation data set, for example, “South Korea”is “Korea Republic” in the Kaggle data set.
1 |
|
Now that it’s nice and cleaned up I can reshape it so that the data isset from Japan’s perspective.
1 |
|
With all that done we can take a look at how Japan have done againstcertain opponents by using filter()
.
1 |
|
1 |
|
Unfortunately, this data set doesn’t go into extra-time or penalty winsas Japan’s Quarter-Final meeting with Jordan in 2004 ended with Japansecuring a route to the semis, 4-3 on penalties!
I can create a function that’ll filter for certain opponents andtournaments and aggregate the results. With the second argument being...
, tidyeval
allows me to input any kind of filter condition for anopponent, tournament, etc. The if else
statement protects againstcases where Japan never had that type of result against an opponent andmakes sure that a column populated by 0s is created.
1 |
|
Now let’s try it out a bit.
1 |
|
1 |
|
I can put in multiple filter conditions if needed as well.
1 |
|
1 |
|
As you can see Japan has never lost or drawn against India, Palestine,or Vietnam so in the data there wouldn’t have been any rows with “Loss”in the results column. With the function I created I was able to imputeresults that didn’t exist and fill them in with 0s!
Let’s check Japan’s performance against our main rivals in the AsianCup. Here I make the tables look a lot nicer with the options in thekable
and kableExtra
packages.
1 |
|
Result |——