The structure of this final exercise for the course is a bit
different. Create and work with R Markdown documents
(creating HTML output) to go through some of the things we
covered and did in the previous sessions. There are no coding tasks in
this document.
We have created an R Markdown document (which we have
also already knitted to create an HTML output file) that
demonstrates some of the things you can do/create with
R Markdown and repeats a few of the topics and steps we
went through in the sessions before this one.
This document uses Gapminder data and you can find it in the
exercises folder: It is called
explore_gapminder.Rmd.
explore_gapminder.Rmdfile in RStudio and explore
it a bit to see what it contains. You can also open
explore_gapminder.html (in your browser) to see the
.Rmd and the resulting output document side-by-side.
.Rmd file via the File tab or
the menu (File -> Open File) in
RStudio.
You might notice that there are quite a few things specified in the
YAML header. Let’s briefly go through them:
toc: true -> The document will contain a table of contents (ToC)
toc_depth: 3 -> The ToC will contain header levels 1 to 3
number_sections: true -> The sections divided by headers will be numbered
toc_float: true -> The ToC is floating, meaning that it moves when you scroll
code_folding: hide -> By default, the code chunks are hidden, but
you display them by clicking the Code buttons in the
HTML document
theme: flatly -> The Bootswatch them flatly is used to style the document
highlight: tango -> The document uses the Pandoc code highlighting style tango
code_download: true -> The document includes a button allowing you to download the full code
df_print: paged -> When data frames are printed in the document they are printed in paged tables
NB: To knit the document you need to have the
packages it uses installed. These are the following ones:
rmarkdown, knitr, tidyverse,
visdat, janitor, pander,
patchwork, correlation, GGally,
broom, sjPlot, scales.
An easy option for checking whether you have these packages
installed, doing so if that is not the case in one go, and loading them
is the packages() function from the
easypackages package. To use it for this purpose, you can
run the following code:
if (!require(easypackages)) install.packages("easypackages")
library(easypackages)
packages("rmarkdown", "knitr", "tidyverse", "visdat", "janitor", "pander", "patchwork", "correlation", "GGally", "broom", "sjPlot", "scales", prompt = F)
Feel free to play around a bit with the
explore_gapminder.Rmd and its output (you can also change
parts of the YAML header to see how that influences the
output).
Use data transformations and visualizations to answer the following questions.
a) Which five countries had the largest population size in 2018 and how has the population size changed since 1960 ?
b) Rank fertility rate (`fert`) in 2018 for countries like Vietnam, Poland, Afghanistan, Armenia, Denmark
c) What is the relationship between GDP and fertility ? Visualize it.
d) If you look at the continent level relationship between per-capita GDP and life_expectancy (`life_exp`) what are the trends for continents in 2018 ? Visualize it.
The big task for this exercise is to use the workflow with the
titanic data set. (see in data/titanic.csv, and the
codebookd titanic_codebook.csv). Using the
explore_gapminder.Rmd as a starting point and guidance, do
the following in this document:
Select key demographic and survival-related variables:
Survived, Pclass, Sex,
Age, Fare, Embarked,
SibSp, Parch
Get an overview of the missing data using the
vis_miss() function from the visdat
package.
Explore variable distributions. Look at the relative frequencies
for the variables using a function from the janitor package
for Sex, Survived,
Pclass.
Create bar plots with ggplot2 to visualize the
relative frequencies (percentages) for the variables Pclass
and Sex.
Using the pander package, include a table with the
output of the base R function summary() for
the variables like Sex, Fare.
Create a ggplot2 boxplot to show differences
differences in Age by Survived, and differences in Fare by Pclass and
Sex.
Calculate correlations for continous numeric variables
(Age, Fare)
Create a plot with the GGally package to visualize
these correlations.
Calculate a logistic regression model using glm()
with Survived as the outcome variable. Include predictors
like: Pclass, Sex, Age, Fare, SibSp, Parch
Visualize the regression results using the sjPlot package.
explore_gapminder.Rmd) for most of this in the the
solutions folder (and the rest in the slides for the
previous sessions).