Seminar Learning Goals

  1. Analyse covid-19 data trends

  2. To run and extend some simple R code to compute book income

  3. Exercise answers

##0. Worksheet Introduction

Pre-requisites

You should:

  1. have completed the Week 1 Joining Seminar
  2. have (re-)acquainted yourself with Week 1 Seminar worksheet
  3. be familiar (listened to/read) the Week 2 Lecture “The Richness of Data”
  4. be able to write, edit, save and re-open your own RMarkdown files

If this is proving a challenge see “Prerequisites for Week 2 Seminars and Labs (CS5701/02)” for advice.

This seminar worksheet is organised as an RMarkdown file. You can read it. You can run the embedded R and you can add your own R. I suggest you save it as another file so, if necessary, you can revert to the original.

Whenever you click Preview in the RStudio menu it will render into nicely formatted html which you can look at it in the Viewing Window in RStudio or any web browser. You may find this easier to read, however, you must edit the .rmd file, i.e., the RMarkdown in the Edit Pane if you want to make any changes.

Remember, you are encouraged to explore and experiment. Change something and see what happens!

As per last week we will cover a lot of new ground but don’t be discouraged. We will revisit these concepts over the following weeks to help you consolidate your understanding.

1. Visualising covid-19 data trends

This example shows how we can fetch and visualise covid-19 data from John Hopkins University via GitHub.

Initialisation

We need some packages over and above base R. Since we may not be sure whether they are already installed we test for their presence. Most packages come from CRAN and are easy to install using install.packages() but the package {tidycovid19} is on GitHub (joachim-gassen) so we also need {devtools} in order to install packages that aren’t on CRAN.

This R code may appear daunting but don’t worry. We will revisit it in detail in Week 3. For the time being see it as a way to install and load necessary extra functionality beyond base ER.

# If a package is installed, it will be loaded and missing package(s) will be installed 
# from CRAN and then loaded.

# The packages we need are:
 
packages = c("tidyverse", "devtools")

# Load the package or install and load it
package.check <- lapply(
  packages,
  FUN = function(x) {
    if (!require(x, character.only = TRUE)) {
      install.packages(x, dependencies = TRUE)
      library(x, character.only = TRUE)
    }
  }
)

install_github("joachim-gassen/tidycovid19")
Skipping install of 'tidycovid19' from a github remote, the SHA1 (78254dca) has not changed since last install.
  Use `force = TRUE` to force installation
library(tidycovid19)

Download the data (cached on GitHub rather than directly from John Hopkins University). This is live data updated within the last 24 hours.

#Download the data into a data frame called cv.df using the 
#download_jhu_csse_covid19_data() function from the {tidycovid19} package.
#
cv.df <- download_jhu_csse_covid19_data(cached = TRUE)
Start downloading JHU CSSE Covid-19 data

Downloading cached version of JHU CSSE Covid 19 data...done. Timestamp is 2020-10-06 03:00:10

Data Info:

The COVID-19 Data Repository by the Johns Hopkins University Center for Systems
Science and Engineering (JHU CSSE) relies upon publicly available data from
multiple sources that do not always agree. It is updated daily. The data comes
in three data frames that you can select by the 'type' parameter. The 'country'
data frame contains the global country-level data reported by JHU CSSE by
aggregating over the regional data for countries that have regional data
available. The 'country_region' data frame provides regional data for the
countries that have regional data available (mosty Australia, Canada and China).
The 'us_county' data frame reports the data for the U.S. at the county level.
The column 'timestamp' reports the time the data was downloaded from its
authoritative source.

For further information refer to: https://github.com/CSSEGISandData/COVID-19.

Exercise 2.1: The dataframe which comprises all international covid-19 data recorded by John Hopkins since January 22, 2020 has 47545*7 observations (see the Environment Pane). Is this large? How much larger can R handle?

Explore the data

Let’s focus on the UK and then “eyeball” the data again.

# select only the UK data
cv.uk.df <- subset(cv.df, iso3c=="GBR")

head(cv.uk.df)
tail(cv.uk.df)

2. Exercise answers

2.1: Although ~47.5K observations might seem large in reality this only occopies 2.6Mb which is <0.1% of the capacity of R on a fairly standard laptop or PC.

2.2: Given the wide range of values for daily infections a log10 scale makes the plot easier to view, particularly for the smaller values.

2.3: There is a clear weekly cycle (or we can say the periodicity is 7-days). This is true of many countries. Why do you think this might be?

2.4: This shows the 95% confidence limit since there is an element of uncertainty as to exactly where the trend line should be. The broader the confidence limit band (shaded pale orange) the less confident we are about the exact location of the trend. Where the confidence limit potentially goes negative (which would be meaningless) we do not plot it. This principally occurs for the daily new death rate trend since many values are (mercifully) close to zero.

2.5: max(cv.uk.df$new.i)

2.6: Good luck!

2.7: new.i is not an integer? One way to find out is to use the built in function is.integer().

is.integer(cv.uk.df$new.i)
[1] FALSE

If you wanted new.i to be an integer you could need to code something like

cv.uk.df$new.i <- as.integer(cv.uk.df$new.i)

or make an assignment like cv.uk.df$new.i <- 10L where L is short for Long which is a long story!

2.8: cv.uk.df$recovered[236] Remember you need to use the $ operator to reference a particular vector (variable) in the dataframe cv.uk.df.

Exercise 2.9:

cv.uk.df$recovered[236] <- 0
