The structure of this final exercise for the course is a bit different. Create and work with R Markdown documents (creating HTML output) to go through some of the things we covered and did in the previous sessions. There are no coding tasks in this document.

We have created an R Markdown document (which we have also already knitted to create an HTML output file) that demonstrates some of the things you can do/create with R Markdown and repeats a few of the topics and steps we went through in the sessions before this one.

This document uses Gapminder data and you can find it in the exercises folder: It is called explore_gapminder.Rmd.

1

The first thing we want you to do is to open the explore_gapminder.Rmdfile in RStudio and explore it a bit to see what it contains. You can also open explore_gapminder.html (in your browser) to see the .Rmd and the resulting output document side-by-side.
You can open the .Rmd file via the File tab or the menu (File -> Open File) in RStudio.

You might notice that there are quite a few things specified in the YAML header. Let’s briefly go through them:

toc: true -> The document will contain a table of contents (ToC)

toc_depth: 3 -> The ToC will contain header levels 1 to 3

number_sections: true -> The sections divided by headers will be numbered

toc_float: true -> The ToC is floating, meaning that it moves when you scroll

code_folding: hide -> By default, the code chunks are hidden, but you display them by clicking the Code buttons in the HTML document

theme: flatly -> The Bootswatch them flatly is used to style the document

highlight: tango -> The document uses the Pandoc code highlighting style tango

code_download: true -> The document includes a button allowing you to download the full code

df_print: paged -> When data frames are printed in the document they are printed in paged tables

NB: To knit the document you need to have the packages it uses installed. These are the following ones: rmarkdown, knitr, tidyverse, visdat, janitor, pander, patchwork, correlation, GGally, broom, sjPlot, scales.

An easy option for checking whether you have these packages installed, doing so if that is not the case in one go, and loading them is the packages() function from the easypackages package. To use it for this purpose, you can run the following code:

if (!require(easypackages)) install.packages("easypackages")
library(easypackages)

packages("rmarkdown", "knitr", "tidyverse", "visdat", "janitor", "pander", "patchwork", "correlation", "GGally", "broom", "sjPlot", "scales", prompt = F)

Feel free to play around a bit with the explore_gapminder.Rmd and its output (you can also change parts of the YAML header to see how that influences the output).

2


Use data transformations and visualizations to answer the following questions.

a) Which five countries had the largest population size in 2018 and how has the population size changed since 1960 ?
  

3



b) Rank fertility rate (`fert`) in 2018 for countries like Vietnam, Poland, Afghanistan, Armenia, Denmark
c) What is the relationship between GDP and fertility ? Visualize it.
  

4



d) If you look at the continent level relationship between per-capita GDP and life_expectancy (`life_exp`) what are the trends for continents in 2018 ? Visualize it.
  

5

The big task for this exercise is to use the workflow with the titanic data set. (see in data/titanic.csv, and the codebookd titanic_codebook.csv). Using the explore_gapminder.Rmd as a starting point and guidance, do the following in this document:

  1. Load and wrangle the data the same way as before to create a subset in the exploratory data analysis (e.g., select variables like demographic variables, handle missing values, recode values, and calculate aggregate measures).

Select key demographic and survival-related variables: Survived, Pclass, Sex, Age, Fare, Embarked, SibSp, Parch

  1. Get an overview of the missing data using the vis_miss() function from the visdat package.

  2. Explore variable distributions. Look at the relative frequencies for the variables using a function from the janitor package for Sex, Survived, Pclass.

  3. Create bar plots with ggplot2 to visualize the relative frequencies (percentages) for the variables Pclass and Sex.

  4. Using the pander package, include a table with the output of the base R function summary() for the variables like Sex, Fare.

  5. Create a ggplot2 boxplot to show differences differences in Age by Survived, and differences in Fare by Pclass and Sex.

  6. Calculate correlations for continous numeric variables (Age, Fare)

  7. Create a plot with the GGally package to visualize these correlations.

  8. Calculate a logistic regression model using glm() with Survived as the outcome variable. Include predictors like: Pclass, Sex, Age, Fare, SibSp, Parch

  9. Visualize the regression results using the sjPlot package.

This is a lot, but you can find template code (explore_gapminder.Rmd) for most of this in the the solutions folder (and the rest in the slides for the previous sessions).