Temperatures
Personally, I thrive in cooler weather. In fact, any temperature above 20°C is summer for me, and anything above 25°C is hot. Most people don’t understand me, but I prefer cold over warm. Fact: if it’s cold, you can put on another piece of clothing until you’re feeling warm again. If, however, it’s warm… Well, there’s only a limited amount of clothes you can take off until you’re accused of crimes of moral turpitude. Plus, if it’s above 25°C, even going full monty won’t get you to cool off.
I recently had an discussion if this year’s March was warmer than the average. Generally speaking, with global average temperatures steadily increasing since the 1970s, I think it’s safe to say that for a given random month, you’re better off stating that it was warmer than the average compared to colder. Which is exactly what I did. Turns out, I was wrong… But I learnt how to change the colour of a ggplot()
line mid-way between two points, based on an arbitrary cutoff value!
Data
As a data scientist, I not only blurt out statements, but I am also in the (judging the current political discussions, unique) opportunity to gather facts to check my ejaculations (teehee, he said “ejaculate”). The official source for German weather data is the Deutscher Wetterdienst, the DWD (translated as the German Meteorological Service). Since it runs the national climate archive, it also provides data on temperatures, amongst many other data like precipitation and historical vine ripeness data (now, there’s another blogpost just waiting to be written, I’d say!).
We’re only interested in the monthly regional air temperature averages. Let’s grab the monthly data for all German regions from the DWD’s FTP:
1 |
|
Pulling the 12 files (one for each month) is pretty straightforward with a for
loop. I found read_csv()
from library(readr)
to choke on non-locale decimals and negative numbers. A lesson learnt: I spent half a day trying to figure out what went wrong with my download script. To “fix” the problem, I read the file as a text file with read_lines()
(since readr
functions nicely default to downloading files if it’s a web resource), and then feeding this text into standard R’s read.csv()
function, specifying the non-standard decimal. I skip the first line (since it’s not data, but an explanation), and the last row (since it’s garbage). Skipping the last row needs some checking of the length of the file, done in f.get.length()
.
Once you’ve set up the function, it’s easy to just grab the data and save it for later use:
1 |
|
All in all, relatively easy to get the data (modulo the readr
problem).
Graphing
Right, let’s have a look at the data. I live in North Rhine-Westphalia currently, so let’s just look at that data, for March only and generate a rolling 10-year average.
1 |
|
The usual Brownian motion around an average. The average which has been increasing since around the 1970s, matching the world-wise NASA data.
First problem, and the most painful one: I was wrong. March 2016 was in fact slightly cooler than the 10-year average.
Second problem: grouping the data by a cutoff point (the 10-year average temperature) works, but since it’s point-wise data, ggplot colours the line continuously from one dot to the next, even when it crosses the average temperature line and should change colour exactly at that point.
Is there a way for ggplot to handle this? With some quick googleing, I found the answer on stackoverflow: not directly, but you can generate a new “high resolution” dataset which would allow ggplot to set the cutoff point exactly at the intersection between the two lines.
I can claim no ownership on the solution, the credit goes fully to the stackoverflow user jbaums. I took his solution and “updated” it to newer R (dplyr
and pipes, specifically; the answer was written four years ago, at which time my solution setup wasn’t available).
1 |
|
Let’s go through the code step by step (which is made easy since it’s piped code).
-
start a new dataset for testing, and set the high-resolution scale factor (0.01, or factor 100 in this case)
-
generate point coordinates for the line end point by using
lead()
-
remove last row from each group
-
remove any
x
values for whichz
haveNA
s. These result from the rolling average, and will choke theseq()
calls later
for each row, we now have the start point and end point of each “section”: (x1, x2)
, (x2, x3)
, …
-
we now want to generate a “high resolution” version of these point, essentially filling 100 data points linearly between
(x1, x2)
, and(x2, x3)
, and so on -
to do this, we use
rowwise() %>% do()
-
each row pulls the start points and the end points of each data section, and then using these as start and end points for a sequence of length
yres
, the high-resolution scale factor -
all of this is done as a
data.frame()
, and by the magic ofrowwise-do()
the result is a single, large dataframe with everything inside!
With the high-res data now set up, we can graph the chart in a much nicer way:
1 |
|
Ooh, it’s so pretty! ☺
Conclusion
One can clearly see the underlying trend of temperature increases. I believe I’ll have to find a job in Hammerfest in the coming years…
What could have been answered with a calculator and 2 minutes on the website turned out to be two day project to pull and clean data, fix and report bugs, and learn even more about ggplot()
. Time well spent, I’d say!
Code and data for this analysis is availabe on github, as always.