January 4, 2021
I was surprised to see Jürgen Klopp win the FIFA 2020 best coach award ahead of Hans-Dieter Flick. I therefore wanted to study where the votes in favour of Klopp came from. On FIFA’s website I found a PDF with all the votes. FIFA obviously don’t provide a nice csv, but only this PDF. However, a quick search for an R function that would allow to extract data from a table in a PDF was successful. The answer is the extract_tables()
function from the tabulizer library. Let’s see it in action.
# Load library
library("tabulizer")
# Load file
file <- if (file.exists("votes.pdf")) {
"votes.pdf"
} else {
"https://resources.fifa.com/image/upload/men-s-coach.pdf?cloudid=fewk8jzbkqxgjdthofsb"
}
# Extract the data from the PDF
data_from_pdf<- extract_tables(file)
head(data_from_pdf[[1]])
## [,1] [,2] [,3] [,4]
## [1,] "Vote" "Country" "Name" "First (5 points)"
## [2,] "Captain" "Afghanistan" "Farshad Noor" "Jürgen Klopp"
## [3,] "Captain" "Albania" "Etrit Berisha" "Jürgen Klopp"
## [4,] "Captain" "Algeria" "Riyad Mahrez" "Marcelo Bielsa"
## [5,] "Captain" "American Samoa" "Nicky Salapu" "Jürgen Klopp"
## [6,] "Captain" "Andorra" "Marc Pujol Pons" "Hans-Dieter Flick"
## [,5] [,6]
## [1,] "Second (3 points)" "Third (1 point)"
## [2,] "Hans-Dieter Flick" "Marcelo Bielsa"
## [3,] "Hans-Dieter Flick" "Marcelo Bielsa"
## [4,] "Jürgen Klopp" "Zinedine Zidane"
## [5,] "Zinedine Zidane" "Marcelo Bielsa"
## [6,] "Jürgen Klopp" "Marcelo Bielsa"
This looks promising. With extract_tables(file)
we were able to extract the tables from the PDF. The tables are stored in a list of character matrices. Let’s now convert these list elements into one big data frame.
# Append list of matrices to one data frame
df<-data.frame(do.call("rbind", data_from_pdf) )
head(df)
## X1 X2 X3 X4 X5
## 1 Vote Country Name First (5 points) Second (3 points)
## 2 Captain Afghanistan Farshad Noor Jürgen Klopp Hans-Dieter Flick
## 3 Captain Albania Etrit Berisha Jürgen Klopp Hans-Dieter Flick
## 4 Captain Algeria Riyad Mahrez Marcelo Bielsa Jürgen Klopp
## 5 Captain American Samoa Nicky Salapu Jürgen Klopp Zinedine Zidane
## 6 Captain Andorra Marc Pujol Pons Hans-Dieter Flick Jürgen Klopp
## X6
## 1 Third (1 point)
## 2 Marcelo Bielsa
## 3 Marcelo Bielsa
## 4 Zinedine Zidane
## 5 Marcelo Bielsa
## 6 Marcelo Bielsa
Very good! Let’s do some cleaning.
# Load libraries
library("dplyr")
library("janitor")
# Remove first frame and assign names
df<-df%>%
# Convert first row to column names
row_to_names(row_number = 1)%>%
# Rename columns
rename(region=Country,p5=contains("First"),p3=contains("Second"),p1=contains("Third"))%>%
# Create a new column with Klopp and Flick voets
mutate(Klopp=case_when(p5=="Jürgen Klopp"~5 ,p3=="Jürgen Klopp"~3 ,p1=="Jürgen Klopp"~1,TRUE ~0),
Flick=case_when(p5=="Hans-Dieter Flick"~5,p3=="Hans-Dieter Flick"~3,p1=="Hans-Dieter Flick"~1,TRUE ~0))%>%
# Select variables that we need
select(-p1,-p3,-p5,-Name)
head(df)
## Vote region Klopp Flick
## 1 Captain Afghanistan 5 3
## 2 Captain Albania 5 3
## 3 Captain Algeria 3 0
## 4 Captain American Samoa 5 0
## 5 Captain Andorra 3 5
## 6 Captain Angola 1 5
Good! This looks reasonable. I have not create a map for a while, so let us create a map. We need to adjust some of the region names before we can create a map. We also need to collapse Northern Ireland, England, Wales, and Scotland to United Kingdom.
# Collapse UK votes to one
uk<-df%>%
filter(region%in%c("England","Wales", "Northern Ireland", "Scotland"))%>%
group_by(Vote)%>%
summarise(Klopp=sum(Klopp),Flick=sum(Flick))%>%
mutate(region="UK")
# Append UK to original dataset
df<-rbind(df,uk)
# Change country names (this is a bit ad hoc)
df<-df%>%
mutate(region=case_when(region=="Chinese Taipei"~"Taiwan",
region=="China PR"~"China",
region=="Congo"~"Republic of Congo",
region=="Congo DR"~"Democratic Republic of the Congo",
region=="IR Iran"~"Iran",
region=="Antigua and Barbuda"~"Antigua",
region=="Côte d'Ivoire"~"Ivory Coast",
region=="Korea Republic"~"South Korea",
region=="Korea DPR"~"North korea",
region=="Trinidad and Tobago"~"Trinidad",
region=="Republic of Ireland"~"Ireland",
region=="Mozambique"~"Mayotte",
region=="Eswatini"~"Swaziland",
TRUE~region))
In the data we have votes from the captain and coach of the national team and from a media representative. We are just going to focus on the sum of the difference in votes for Klopp and the votes for Flick across these three.
# Sum difference in votes
md<-df%>%
mutate(dif=Klopp-Flick)%>%
group_by(region)%>%
summarise(difference=sum(dif))
head(md)
## # A tibble: 6 x 2
## region difference
## <chr> <dbl>
## 1 Afghanistan 1
## 2 Albania 9
## 3 Algeria -1
## 4 American Samoa 10
## 5 Andorra -12
## 6 Angola -8
Excellent! We have a dataset with two columns. The first column contains the country name. The second column the vote difference.
Getting non-standard fonts into my charts in R has always been a pain for me. Especially on Windows systems. It seems to work a lot easier on Linux systems. Now I am on a Windows system and it was time to check whether there was a new convenient solution. It turns out that showtext has the potential of becoming a game changer. Let’s just follow the default example and add a google font:
# Load library
library("showtext")
## Add google font
font_add_google("Schoolbell", "bell")
showtext_auto()
We have the dataset with the fonts. We have the fonts. Now we need just need a dataset with the values for the polygons. We use map_data()
from ggplot2 for that.
# Load library
library("ggplot2")
# Extract Map data
world_map <- map_data("world")
# Merge map data to data on votes
mapdata <- merge(md, world_map, by = "region",all=TRUE)
# Cleaning (remove Antarcatica and sort the data)
mapdata<-mapdata%>%
filter(!(region %in% c("Antarctica")))%>%
arrange(region,order)
And finally it is time for ggplot to do its work:
ggplot(mapdata, aes(x = long, y = lat,group = group)) +
geom_polygon(aes(fill=difference))+
theme_void()+ coord_quickmap()+
scale_fill_gradientn(breaks=seq(-16,13,4),labels=seq(-16,13,4),
colours=c("#bd2626", "grey", "#1a633a"))+
theme(legend.position = "bottom",
legend.text = element_text(family = "bell"),
legend.title = element_text(family = "bell",size=15),
plot.caption= element_text(family = "bell",size=11),
plot.title = element_text(hjust=0.5,size=25, family = "bell"))+
guides(fill = guide_colourbar(ticks = FALSE,
title.position = "top", label.hjust = .5,
title.hjust = 0.5,nbin=8, raster=F, barwidth=20,
label.position = "bottom"))+
labs(fill="Points for Klopp minus points for Flick",
title="FIFA Best Coach of 2020 Votes",
caption="The points are the total across media, coach, and captain.")
Voila.