library(leaflet)leaflet() %>% addTiles() %>% setView(-1.264, 51.752, zoom = 17)
R
?R
is a free statistical programming software. You can download R here.R
?R
?It's free!
It's open-source!
R
?It's free!
It's open-source!
It's versatile! (In fact, all materials in this class have been created in R
! )
This course is designed to introduce you to the basics of R
programming.
By the end, you will know how to:
Generate and transform numeric, logical, and character vectors
Deal with missing values
Load and inspect data
Generate descriptives
library(coronavirus)corona_de <- coronavirus[coronavirus$country=="Germany" & coronavirus$type=="death", ]corona_de$death_7 <- zoo::rollmean(corona_de$cases, k = 7, fill = NA)plot(x = corona_de$date, y = corona_de$cases, cex = 0.3)lines(x = corona_de$date, y = corona_de$death_7, type = "l", cex = 1.5, col = "red")
This course only scratches the surface of what you can do in R
.
For more elaborate introductions and more advanced guides, see the following (free!) books:
YaRrr! The Pirate’s Guide to R by _Nathaniel D. Phillips
R for Data Science by Gareth Golemund and Hadley Wickham
tidyverse
and tidy R, which is a different way of writting R code than base R. R
operates. The R
community is very welcoming and inclusive. If you are feeling stuck, chances are someone has had the same issue before.
Here are some helpful resources and great groups to join:
R
and RStudioDownload and install R
from here https://cloud.r-project.org
For Macs, you may have to download different versions:
Intel chip (R-4.1.1.pkg)
Apple Silicon M1 chip (R-4.1.1-arm64.pkg)
Download RStudio here: https://www.rstudio.com/products/rstudio/download/
In this course we will be using R
exclusively through RStudio.
When you open RStudio it should look something like this:
RStudio is an integrated development environment specifically developed for R
, that lets you write code, run scripts, and view the results all in one.
Source: This is the code editor, where you write and save your code.
Console: This is where the output of your code will be printed.
Environment/History: This is where any objects, such as vectors, matrices, or dataframes, will be stored.
Viewer: This viewer previews any plots you create. You can also check your folder files and call for help here.
Always write code into the source code file, except for small checks and tests.
Always write code into the source code file, except for small checks and tests.
To execute the line of source code where the cursor currently resides you can press the Ctrl + Enter/Cmd + Enter key, rather than manually pressing the Run toolbar button.
Always write code into the source code file, except for small checks and tests.
To execute the line of source code where the cursor currently resides you can press the Ctrl + Enter/Cmd + Enter key, rather than manually pressing the Run toolbar button.
Annotate your code using #
# Introduction to R# 07.09.2021# Use head(x, n = 2) to see the first two rows of a dataframehead(mtcars, n = 2)## mpg cyl disp hp drat wt qsec vs am gear carb## Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4## Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
R comes with a list of built in functions, but often you will want to use other functions written not by the original creators of R
, but by other people.
If you want to use functions written by other people, you have to install it as a package
.
To do this, we have to first install the package once, and then load it whenever we would like to use it.
# install the package (you only need to do this once)install.packages('praise')
# load the package to use its functionslibrary(praise)praise()## [1] "You are awesome!"
Credit: YaRrr! The Pirate's Guide to R
Sometimes you only need to use a very specific function of a package one time, and loading the entire package may seem unnecessary.
You can use package::function
to call the function you are after. This tells R
to only load the package for this specific chunk of code.
cowsay::say("Welcome to the course!", by = "cow")## ## ----- ## Welcome to the course! ## ------ ## \ ^__^ ## \ (oo)\ ________ ## (__)\ )\ /\ ## ||------w|## || ||
You can use ?
whenever you want to read the documentation of a particular command.
# how should I specify the mean and standard deviation of a normal distribution?rnorm# how does a histogram work?hist# how does the mean() function work?mean
Object types
Vectors
Missingness
Vector functions
Dataframes
Loading data
Almost everything in R
is either an object or a function.
Object: number, vector, dataset, summary statistic, regression model, etc.
Function: takes objects as arguments, does something, and returns an object.
Almost everything in R
is either an object or a function.
Object: number, vector, dataset, summary statistic, regression model, etc.
Function: takes objects as arguments, does something, and returns an object.
# Create a vector object called heightheight <- c(189, 178, 166, 178, 190)# apply the mean() function to the object heightmean(height)## [1] 180.2
→ The function mean()
takes the object height
, calculates the average, and returns a single number.
When you use R
, you will mostly:
Define objects
Apply functions to those objects
Repeat!
3+5## [1] 810/2## [1] 5sqrt(4)## [1] 2
"Hello world!"## [1] "Hello world!""1" + "3"## Error in "1" + "3": non-numeric argument to binary operator
<-
operatorYou can assign values to variables using the <-
operator. You can then use the variable in subsequent operations.
x <- 9 + 11x## [1] 20y <- x / 2y## [1] 10
greetings <- "Hello world!"greetings## [1] "Hello world!"
Just by looking at the code, what do each of the following lines return?
12 - 2#A:x <- 12 - 2y <- x * 2yy/2yz <- "1 + 2"zz + 3
We can create longer vectors by using c()
(read: concatenate).
w <- 2 y <- c(1, 2, 3)z <- c(4, 5, 6)z## [1] 4 5 6
welcome <- c("Welcome", "to", "this", "course!")welcome ## [1] "Welcome" "to" "this" "course!"
For longer vectors, writing out each element can be tedious. In addition to c()
, there are other options.
Function | Example | Result |
---|---|---|
c(a, b, ...) |
c(1, 5, 9) |
1, 5, 9 |
a:b |
1:5 |
1, 2, 3, 4, 5 |
seq(from, to, by, length.out) |
seq(from = 0, to = 6, by = 2) |
0, 2, 4, 6 |
rep(x, times, each, length.out) |
rep(c(7, 8), times = 2, each = 2) |
7, 7, 8, 8, 7, 7, 8, 8 |
While numeric vectors can include any number and character values any character string, logical vectors can only take the values of either TRUE
or FALSE
.
Logical vectors are therefore often used to distinguish between two groups, or select a certain subset of variables.
While numeric vectors can include any number and character values any character string, logical vectors can only take the values of either TRUE
or FALSE
.
Logical vectors are therefore often used to distinguish between two groups, or select a certain subset of variables.
In the example below, we create a logical vector that distinguishes between ages below and above the age of 18.
age <- c(14, 19, 23, 13, 16, 19, 18)is_18 <- age >= 18is_18## [1] FALSE TRUE TRUE FALSE FALSE TRUE TRUE
The vector is_18
is TRUE
when age is 18 or higher, and FALSE
otherwise.
In the previous example we use >=
to distinguish between ages below and above
Some logical operators include:
Operator | Description |
---|---|
< |
less than |
<= |
less than or equal to |
> |
greater than |
>= |
greater than or equal to |
== |
exactly equal to |
!= |
not equal to |
!x |
Not x |
x \& y |
x AND y |
isTRUE(x) |
test if X is TRUE |
When we deal with data in the real world, there is often lots of missingness.
Missing values are denoted with NA
.
NA
s behave differently to other values.
num_vec <- c(5, NA, 15, 20, 25, NA)num_vec / 5## [1] 1 NA 3 4 5 NANA + 3## [1] NAc("hello", "my", "name", "is", NA)## [1] "hello" "my" "name" "is" NA
Length: Checks the length of a vector
x <- 2y <- c(0.5, 45, 7, 45, 0.5)z1 <- c("1 2 3 4 5 ", "6 7 8")length(x)## [1] 1length(y)## [1] 5length(z)## [1] 3
Length: Checks the length of a vector
x <- 2y <- c(0.5, 45, 7, 45, 0.5)z1 <- c("1 2 3 4 5 ", "6 7 8")length(x)## [1] 1length(y)## [1] 5length(z)## [1] 3
Sorting/unique: sorts or displays the unique values of a vector
sort(y)## [1] 0.5 0.5 7.0 45.0 45.0unique(y)## [1] 0.5 45.0 7.0
Function | Example | Result |
---|---|---|
sum(x), product(x) |
sum(1:10) |
55 |
min(x), max(x) |
min(1:10) |
1 |
mean(x), median(x) |
mean(1:10) |
5.5 |
sd(x), var(x), range(x) |
sd(1:10) |
3.0276504 |
summary(x) |
summary(1:10) |
Min = 1.00. 1st Qu. = 3.25, Median = 5.50, Mean = 5.50, 3rd Qu. = 7.75, Max = 10.0 |
Copy the following two vectors:
age <- c(22, 24,25, 25, 22, 21, 28, 23, 24, 27)welcome <- c("Welcome", "to", "this", "course!")
Use R
to generate:
The unique values in age
.
The length of welcome
The mean of age.
Round the mean of age to 0 decimals. Hint: You can use the round
function, and see how it works using ?round
.
Can you compute the mean of welcome
. Why/Why not?
If you look at the output of welcome
, each word is included in separate quotation marks.
welcome## [1] "Welcome" "to" "this" "course!"
This is because R
treats each of these words as a separate element in a vector.
If you look at the output of welcome
, each word is included in separate quotation marks.
welcome## [1] "Welcome" "to" "this" "course!"
This is because R
treats each of these words as a separate element in a vector.
There are times when we might want to tell R
to collapse the string into a single element. We can do this using the paste()
function, specifying the option collapse = TRUE
.
welcome2 <- paste(welcome, collapse = " ")welcome2## [1] "Welcome to this course!"welcome3 <- paste(welcome, collapse = "")welcome3## [1] "Welcometothiscourse!"
A lot of descriptive functions will throw up an error when there are missing values.
num_vec <- c(5, NA, 15, 20, 25, NA)sum(num_vec)## [1] NAmean(num_vec)## [1] NA
A lot of descriptive functions will throw up an error when there are missing values.
num_vec <- c(5, NA, 15, 20, 25, NA)sum(num_vec)## [1] NAmean(num_vec)## [1] NA
Descriptive functions include the argument na.rm = TRUE
, which explicitly tells R
to ignore missing values.
sum(num_vec, na.rm = TRUE)## [1] 65mean(num_vec, na.rm = TRUE)## [1] 16.25
A lot of descriptive functions will throw up an error when there are missing values.
num_vec <- c(5, NA, 15, 20, 25, NA)sum(num_vec)## [1] NAmean(num_vec)## [1] NA
Descriptive functions include the argument na.rm = TRUE
, which explicitly tells R
to ignore missing values.
sum(num_vec, na.rm = TRUE)## [1] 65mean(num_vec, na.rm = TRUE)## [1] 16.25
is.na
is a logical operation that allows us to identify missing values.
is.na(num_vec)## [1] FALSE TRUE FALSE FALSE FALSE TRUEsum(num_vec, na.rm = TRUE)## [1] 65
Vectors can include either character values or numeric values, not both!
x <- rep(c(5, "a"), times = 2)x## [1] "5" "a" "5" "a"x / 2## Error in x/2: non-numeric argument to binary operator
With x_num
, R
automatically treats the vector as a character vector, because it includes some characters.
Vectors can include either character values or numeric values, not both!
x <- rep(c(5, "a"), times = 2)x## [1] "5" "a" "5" "a"x / 2## Error in x/2: non-numeric argument to binary operator
With x_num
, R
automatically treats the vector as a character vector, because it includes some characters.
If we force R
to treat x
as a numeric vector, it will replace all non-numeric elements with NA
.
as.numeric(x)## Warning: NAs introduced by coercion## [1] 5 NA 5 NA
[]
Often we don't want to retrieve the whole vector, but only a specific element.
We can do this using []
.
a[index]
, where a
is the vector, and index
is a vector of index values.
colors <- colors()# What is the first color? colors[1]## [1] "white"# What are the first 5 colorscolors[1:3]## [1] "white" "aliceblue" "antiquewhite"
When indexing a vector with a logical index, R
will only return values for which the index is TRUE
.
years <- c(2010, 2005, 2012, 2013, 2001)# select all years above 2010years[years>2010]## [1] 2012 2013# select all years larger than 2002 and smaller than 2013years[years > 2002 & years < 2013]## [1] 2010 2005 2012
When indexing a vector with a logical index, R
will only return values for which the index is TRUE
.
years <- c(2010, 2005, 2012, 2013, 2001)# select all years above 2010years[years>2010]## [1] 2012 2013# select all years larger than 2002 and smaller than 2013years[years > 2002 & years < 2013]## [1] 2010 2005 2012
R
actually interprets TRUE values as 1 and FALSE values as 0.
This allows us to quickly answer questions like:
#How many observations in years are greater than 2005?sum(years > 2005)## [1] 3# What's the proportion of observations in years greater than 2005mean(years > 2005)## [1] 0.6
This is a really useful feature for quick calculations!
# Generates a standard normal distributionx_norm <- rnorm(1000, mean = 0, sd = 1)
Get the 10th and 20th observation.
Save all observations below 0 in a new variable called x_norm_neg
.
How many observations are below 0?
Get the proportion of values below -2 and above 2.
x_norm[c(10, 20)]## [1] 0.6157467 0.6601145
x_norm[c(10, 20)]## [1] 0.6157467 0.6601145
x_norm_neg
. x_norm_neg <- x_norm[x_norm<0]
x_norm[c(10, 20)]## [1] 0.6157467 0.6601145
x_norm_neg
. x_norm_neg <- x_norm[x_norm<0]
length(x_norm_neg)## [1] 508
x_norm[c(10, 20)]## [1] 0.6157467 0.6601145
x_norm_neg
. x_norm_neg <- x_norm[x_norm<0]
length(x_norm_neg)## [1] 508
mean(x_norm > 2 | x_norm < -2)## [1] 0.038
In the example below, you know that the 4th value should have been 23
, but was wrongly coded as NA
.
age <- c(17, 21, 22, 25, NA)age[5] <- 23age ## [1] 17 21 22 25 23age[age >= 18] <- "18+"age## [1] "17" "18+" "18+" "18+" "18+"
x <- c(5, 15, NA, 25, 30)
x[1]x[c(3, 4)]x[!is.na(x)]
What is a quick way to calculate the share of missing values in x?
Replace the missing values in x
using the is.na()
operator.
Most of the work we do as sociologists will involve playing around with rectangular data or dataframes.
While vectors are one dimensional, dataframes have two dimensions; rows and columns.
Most of the work we do as sociologists will involve playing around with rectangular data or dataframes.
While vectors are one dimensional, dataframes have two dimensions; rows and columns.
columns: variables
rows: observations
name | height | mass | hair_color | skin_color |
---|---|---|---|---|
Luke Skywalker | 172 | 77 | blond | fair |
C-3PO | 167 | 75 | NA | gold |
R2-D2 | 96 | 32 | NA | white, blue |
Darth Vader | 202 | 136 | none | white |
Leia Organa | 150 | 49 | brown | light |
Owen Lars | 178 | 120 | brown, grey | light |
You can turn multiple vectors into a dataframe using the data.frame
command.
# vectorscountry_name <- c("Nigeria", "Gambia", "Finland", "Brazil")country_year <- 2013country_pop_1m <- c(173.6, 1.8, 5.4, 200.4)# combine into dataframe pop_df <- data.frame("country" = country_name, "year" = country_year, "pop_1m" = country_pop_1m, stringsAsFactors = FALSE)pop_df## country year pop_1m## 1 Nigeria 2013 173.6## 2 Gambia 2013 1.8## 3 Finland 2013 5.4## 4 Brazil 2013 200.4
Rather than constructing your own dataframe in R
, you will often work with data that already comes in a specific format, such as a csv, excel, .txt, Stata (.dta), or other file type.
R
can load all these different file types.
# csv file data <- read.csv(data)# To load stata files, you have to use the foreing package# Note that you might need to install the package data <- foreign::read.dta(data)# load an R data file load("data.Rda")
Here are some of the most important ways to inspect a dataframe.
Function | Description |
---|---|
head(x), tail(x) |
Print the first few rows (or last few rows). |
View(x) |
Open the entire object in a new window |
nrow(x), ncol(x), dim(x) |
Count the number of rows and columns |
names() |
Show the row (or column) names |
summary(x) |
Show the summary statistics of a dataframe |
# get the dimensions (rows and columns) of your datadim(pop_df)## [1] 4 3# view the first 5 rows of datahead(pop_df)## country year pop_1m## 1 Nigeria 2013 173.6## 2 Gambia 2013 1.8## 3 Finland 2013 5.4## 4 Brazil 2013 200.4# view the last 2 rows of datatail(pop_df, 2)## country year pop_1m## 3 Finland 2013 5.4## 4 Brazil 2013 200.4
# inspect the variable names in your dataframenames(pop_df)## [1] "country" "year" "pop_1m"# generate summary statistics for each variable in your dataframe. summary(pop_df)## country year pop_1m ## Length:4 Min. :2013 Min. : 1.8 ## Class :character 1st Qu.:2013 1st Qu.: 4.5 ## Mode :character Median :2013 Median : 89.5 ## Mean :2013 Mean : 95.3 ## 3rd Qu.:2013 3rd Qu.:180.3 ## Max. :2013 Max. :200.4
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |