R
Thus far, we've already learned what R
and RStudio
are. There's one essential prerequisite:
We need data!
R
's internal data types?data()
How your data are stored (data types)
Where your data are stored (data formats)
https://www.stat.berkeley.edu/~nolan/stat133/Fall05/lectures/DataTypes4.pdf
Integers are values without a decimal value. To be explicit in R
in using them, you have to place an L
behind the actual value.
1L
## [1] 1
By contrast, doubles are values with a decimal value.
1.1
## [1] 1.1
We can check data types by using the typeof()
function.
typeof(1L)
## [1] "integer"
typeof(1.1)
## [1] "double"
At first glance, a character is a letter somewhere between a-z. String in this context might mean that we have a series of characters. However, numbers and other symbols can be part of a character string, which can then be, e.g., part of a text. In R
, character strings are wrapped in quotation marks.
"Hi. I am a character string, the 1st of its kind!"
## [1] "Hi. I am a character string, the 1st of its kind!"
Note: There are no values associated with the content of character strings unless we change that, e.g., with factors.
Factors are data types that assume that their values are not continuous, e.g., as in ordinal or nominal data.
factor(1.1)
## [1] 1.1## Levels: 1.1
factor("Hi. I am a character string, the 1st of its kind!")
## [1] Hi. I am a character string, the 1st of its kind!## Levels: Hi. I am a character string, the 1st of its kind!
Factors take numeric data or character strings as input as they simply convert them into so-called levels.
Logical values are basically either TRUE
or FALSE
values. These values are produced by making logical requests on your data.
2 > 1
## [1] TRUE
2 < 1
## [1] FALSE
Logical values are at the heart of creating loops. For this purpose, however, we need more logical operators to request TRUE
or FALSE
values.
There are quite a few logical operators in R
:
<
less than<=
less than or equal to>
greater than>=
greater than or equal to==
exactly equal to!=
not equal to!x
Not xx | y
x OR yx & y
x AND yisTRUE(x)
test if X is TRUE isFALSE(x)
test if X is FALSE https://www.statmethods.net/management/operators.html
Moreover, there are some more is.PROPERTY_ASKED_FOR()
functions, such as is.numeric()
, which also return TRUE
or FALSE
values.
R
's data formatsR
's different data types can be put into 'containers'.
https://devopedia.org/r-data-structures
Vectors are built by enclosing your content with c()
("c" for "concatenate")
numeric_vector <- c(1, 2, 3, 4)character_vector <- c("a", "b", "c", "d")numeric_vector
## [1] 1 2 3 4
character_vector
## [1] "a" "b" "c" "d"
Vectors are really like vectors in mathematics. Initially, it doesn't matter if you look at them as column or row vectors.
Using the function cbind()
or rbind()
you can either combine vectors column-wise or row-wise. Thus, they become matrices.
cbind(numeric_vector, character_vector)
## numeric_vector character_vector## [1,] "1" "a" ## [2,] "2" "b" ## [3,] "3" "c" ## [4,] "4" "d"
rbind(numeric_vector, character_vector)
## [,1] [,2] [,3] [,4]## numeric_vector "1" "2" "3" "4" ## character_vector "a" "b" "c" "d"
Note: The numeric values are coerced into strings here.
Matrices are the basic rectangular data format in R.
fancy_matrix <- matrix(1:16, nrow = 4)fancy_matrix
## [,1] [,2] [,3] [,4]## [1,] 1 5 9 13## [2,] 2 6 10 14## [3,] 3 7 11 15## [4,] 4 8 12 16
You cannot store multiple data types, such as strings and numeric values in the same matrix. Otherwise, your data will get coerced to a common type, as seen in the previous slide. This is something that happens already within vectors:
c(1, 2, "evil string")
## [1] "1" "2" "evil string"
library(randomNames) # a name generator packagefancy_data <- data.frame( who = randomNames(n = 10, which.names = "first"), age = sample(14:49, 10, replace = TRUE), # you see what we are doing here? salary_2018 = sample(15:100, 10, replace = TRUE), salary_2019 = sample(15:100, 10, replace = TRUE) )fancy_data
↪️
## who age salary_2018 salary_2019## 1 Joseph 27 30 93## 2 Asha 37 40 99## 3 Emily 23 86 18## 4 Michaela 23 77 68## 5 Jordan 38 52 66## 6 Burhaan 40 92 48## 7 Lasandra 36 45 41## 8 Caleb 31 28 15## 9 Angelica 39 91 80## 10 Alfred 38 97 60
Tibbles are basically just R data.frames
but nicer.
dim()
and other functionsYou can check the tibble vignette for technical details.
library(tibble)as_tibble(fancy_data)
## # A tibble: 10 × 4## who age salary_2018 salary_2019## <chr> <int> <int> <int>## 1 Joseph 27 30 93## 2 Asha 37 40 99## 3 Emily 23 86 18## 4 Michaela 23 77 68## 5 Jordan 38 52 66## 6 Burhaan 40 92 48## 7 Lasandra 36 45 41## 8 Caleb 31 28 15## 9 Angelica 39 91 80## 10 Alfred 38 97 60
Lists are perfect for storing numerous and potentially diverse pieces of information in one place.
fancy_list <- list( numeric_vector, character_vector, fancy_matrix, fancy_data )fancy_list
↪️
## [[1]]## [1] 1 2 3 4## ## [[2]]## [1] "a" "b" "c" "d"## ## [[3]]## [,1] [,2] [,3] [,4]## [1,] 1 5 9 13## [2,] 2 6 10 14## [3,] 3 7 11 15## [4,] 4 8 12 16## ## [[4]]## who age salary_2018 salary_2019## 1 Joseph 27 30 93## 2 Asha 37 40 99## 3 Emily 23 86 18## 4 Michaela 23 77 68## 5 Jordan 38 52 66## 6 Burhaan 40 92 48## 7 Lasandra 36 45 41## 8 Caleb 31 28 15## 9 Angelica 39 91 80## 10 Alfred 38 97 60
fancy_nested_list <- list( fancy_vectors = list(numeric_vector, character_vector), data_stuff = list(fancy_matrix, fancy_data) )fancy_nested_list
↪️
## $fancy_vectors## $fancy_vectors[[1]]## [1] 1 2 3 4## ## $fancy_vectors[[2]]## [1] "a" "b" "c" "d"## ## ## $data_stuff## $data_stuff[[1]]## [,1] [,2] [,3] [,4]## [1,] 1 5 9 13## [2,] 2 6 10 14## [3,] 3 7 11 15## [4,] 4 8 12 16## ## $data_stuff[[2]]## who age salary_2018 salary_2019## 1 Joseph 27 30 93## 2 Asha 37 40 99## 3 Emily 23 86 18## 4 Michaela 23 77 68## 5 Jordan 38 52 66## 6 Burhaan 40 92 48## 7 Lasandra 36 45 41## 8 Caleb 31 28 15## 9 Angelica 39 91 80## 10 Alfred 38 97 60
Generally, the logic of [index_number]
is to access only a subset of information in an object, no matter if we have vectors or data frames.
Say, we want to extract the 2nd element of our character_vector
object, we could do that like this:
character_vector[2]
## [1] "b"
Matrices can have more dimensions, often you want information from a specific row and column.
a_wonderful_matrix[number_of_row, number_of_column]
Note: You can do the same indexing with data.frame
s.
Identifying rows, columns, or elements using subscripts is similar to matrix notation:
fancy_matrix[, 4] # 4th column of matrixfancy_matrix[3,] # 3rd row of matrixfancy_matrix[2:4, 1:3] # rows 2,3,4 of columns 1,2,3
A nice feature of data.frames
or tibbles
is that their columns are names, just as variable names in ordinary data.
fancy_data$who
## [1] "Joseph" "Asha" "Emily" "Michaela" "Jordan" "Burhaan" ## [7] "Lasandra" "Caleb" "Angelica" "Alfred"
Just place a $
-sign between the data object and the variable name.
[]
in data framesSometimes we also have to rely on character strings as input information, e.g., for iterating over data. We can also use []
to access variables by name.
Not only this way:
fancy_data[1]
## who## 1 Joseph## 2 Asha## 3 Emily## 4 Michaela## 5 Jordan## 6 Burhaan## 7 Lasandra## 8 Caleb## 9 Angelica## 10 Alfred
But also this way:
fancy_data["who"]
## who## 1 Joseph## 2 Asha## 3 Emily## 4 Michaela## 5 Jordan## 6 Burhaan## 7 Lasandra## 8 Caleb## 9 Angelica## 10 Alfred
The most high-level information you can get is about the object type and its dimensions.
# object typeclass(fancy_data)
## [1] "data.frame"
# number of rows and columnsdim(fancy_data)
## [1] 10 4
# number of rowsnrow(fancy_data)
## [1] 10
# number of columnsncol(fancy_data)
## [1] 4
You can also print the first 6 lines of the data frame with head()
. You can easily change the number of lines by providing the number as the second argument to the head()
function.
head(fancy_data, 3)
## who age salary_2018 salary_2019## 1 Joseph 27 30 93## 2 Asha 37 40 99## 3 Emily 23 86 18
If we want some more (detailed) information about the data set or object, we can use the base R
function str()
.
str(fancy_data)
## 'data.frame': 10 obs. of 4 variables:## $ who : chr "Joseph" "Asha" "Emily" "Michaela" ...## $ age : int 27 37 23 23 38 40 36 31 39 38## $ salary_2018: int 30 40 86 77 52 92 45 28 91 97## $ salary_2019: int 93 99 18 68 66 48 41 15 80 60
If you want to have a look at your full data set, you can use the View()
function. In RStudio, this will open a new tab in the source pane through which you can explore the data set (including a search function). You can also click on the small spreadsheet symbol on the right side of the object in the environment tab to open this view.
View(fancy_data)
We can print all names of an object using the names()
function...
names(fancy_data)
## [1] "who" "age" "salary_2018" "salary_2019"
...and we can also change names with it.
names(fancy_data) <- c("name", "age", "salary_2018", "salary_2019")names(fancy_data)
## [1] "name" "age" "salary_2018" "salary_2019"
Data: Stack Overflow Annual Developer Survey 2024.
# Option 1: tidytuesdayR package ## install.packages("tidytuesdayR")tuesdata <- tidytuesdayR::tt_load('2024-09-03')qname_levels_single_response_crosswalk <- tuesdata$qname_levels_single_response_crosswalkstackoverflow_survey_questions <- tuesdata$stackoverflow_survey_questionsstackoverflow_survey_single_response <- tuesdata$stackoverflow_survey_single_response# Option 2: Read directly from GitHubqname_levels_single_response_crosswalk <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-09-03/qname_levels_single_response_crosswalk.csv')stackoverflow_survey_questions <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-09-03/stackoverflow_survey_questions.csv')stackoverflow_survey_single_response <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-09-03/stackoverflow_survey_single_response.csv')
We will also use data from Gapminder. During the course and the exercises, we work with data we have downloaded from their website. There also is an R
package that bundles some of the Gapminder data: install.packages("gapminder")
.
This R
package provides "[a]n excerpt of the data available at Gapminder.org. For each of 142 countries, the package provides values for life expectancy, GDP per capita, and population, every five years, from 1952 to 2007."
To code along and be able to do the exercises, you should store the data files for the tuesdata in a folder called ./data
in the same folder as the other materials for this course.
R
is data-agnosticWhat you will learn
R
tidyverse
instead of using base R
You can use the RStudio GUI for importing data via Environment - Import data set - Choose file type
.
Browse Button in RStudio
Code preview in Rstudio
Basic file formats, such as CSV (comma-separated value file), can directly be imported into R
Other file formats, particularly the proprietary ones, require the use of additional packages
In the following slides, we'll jump right into importing data. We use a lot of different packages for this purpose, and you don't have to remember everything. It's just for making a point of how agnostic R
actually is regarding the file type. Later on, we will dive more into the specifics of importing.
base R
titanic <- read.csv("./data/titanic.csv")titanic
## PassengerId Survived Pclass## 1 1 0 3## 2 2 1 1## 3 3 1 3## 4 4 1 1## 5 5 0 3## 6 6 0 3## Name Sex Age SibSp## 1 Braund, Mr. Owen Harris male 22 1## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1## 3 Heikkinen, Miss. Laina female 26 0## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1## 5 Allen, Mr. William Henry male 35 0## 6 Moran, Mr. James male NA 0## Parch Ticket Fare Cabin Embarked## 1 0 A/5 21171 7.2500 S## 2 0 PC 17599 71.2833 C85 C## 3 0 STON/O2. 3101282 7.9250 S## 4 0 113803 53.1000 C123 S## 5 0 373450 8.0500 S## 6 0 330877 8.4583 Q
readr
example: CSV
fileslibrary(readr)titanic <- read_csv("./data/titanic.csv")
titanic
## # A tibble: 891 × 12## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> ## 1 1 0 3 Braund, M… male 22 1 0 A/5 2…## 2 2 1 1 Cumings, … fema… 38 1 0 PC 17…## 3 3 1 3 Heikkinen… fema… 26 0 0 STON/…## 4 4 1 1 Futrelle,… fema… 35 1 0 113803## 5 5 0 3 Allen, Mr… male 35 0 0 373450## 6 6 0 3 Moran, Mr… male NA 0 0 330877## 7 7 0 1 McCarthy,… male 54 0 0 17463 ## 8 8 0 3 Palsson, … male 2 3 1 349909## 9 9 1 3 Johnson, … fema… 27 0 2 347742## 10 10 1 2 Nasser, M… fema… 14 1 0 237736## # ℹ 881 more rows## # ℹ 3 more variables: Fare <dbl>, Cabin <chr>, Embarked <chr>
Note the column specifications: readr
'guesses' them based on the first 1000 observations (we will come back to this later).
readxl
library(readxl)unicorns <- read_xlsx("./data/observations.xlsx")
No output ☹️
unicorns
## # A tibble: 42 × 3## countryname year pop## <chr> <dbl> <dbl>## 1 Austria 1670 85## 2 Austria 1671 83## 3 Austria 1674 75## 4 Austria 1675 82## 5 Austria 1676 79## 6 Austria 1677 70## 7 Austria 1678 81## 8 Austria 1680 80## 9 France 1673 70## 10 France 1674 79## # ℹ 32 more rows
These were just some very first examples of applying functions for data import from the different packages. There are many more...
readr
read_csv()
read_tsv()
read_delim()
read_fwf()
read_table()
read_log()
haven
read_sas()
read_spss()
read_stata()
tibbles
<chr>
col_character()
<int>
col_integer()
<dbl>
col_double()
<fct>
col_factor()
<lgl>
col_logical()
As mentioned before, read_csv
'guesses' the variable types by scanning the first 1000 observations. NB: This can go wrong!
Luckily, we can change the variable type...
read_csv
titanic <- read_csv( "./data/titanic.csv", col_types = cols( PassengerId = col_double(), Survived = col_double(), Pclass = col_double(), Name = col_character(), Sex = col_character(), Age = col_double(), SibSp = col_double(), Parch = col_double(), Ticket = col_character(), Fare = col_double(), Cabin = col_character(), Embarked = col_character() ) )titanic
↪️
## # A tibble: 891 × 12## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> ## 1 1 0 3 Braund, M… male 22 1 0 A/5 2…## 2 2 1 1 Cumings, … fema… 38 1 0 PC 17…## 3 3 1 3 Heikkinen… fema… 26 0 0 STON/…## 4 4 1 1 Futrelle,… fema… 35 1 0 113803## 5 5 0 3 Allen, Mr… male 35 0 0 373450## 6 6 0 3 Moran, Mr… male NA 0 0 330877## 7 7 0 1 McCarthy,… male 54 0 0 17463 ## 8 8 0 3 Palsson, … male 2 3 1 349909## 9 9 1 3 Johnson, … fema… 27 0 2 347742## 10 10 1 2 Nasser, M… fema… 14 1 0 237736## # ℹ 881 more rows## # ℹ 3 more variables: Fare <dbl>, Cabin <chr>, Embarked <chr>
read_csv
titanic <- read_csv( "./data/titanic.csv", col_types = cols( PassengerId = col_double(), Survived = col_double(), Pclass = col_double(), Name = col_character(), Sex = col_factor(), # This one changed! Age = col_double(), SibSp = col_double(), Parch = col_double(), Ticket = col_character(), Fare = col_double(), Cabin = col_character(), Embarked = col_character() ) )titanic
↪️
## # A tibble: 891 × 12## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket## <dbl> <dbl> <dbl> <chr> <fct> <dbl> <dbl> <dbl> <chr> ## 1 1 0 3 Braund, M… male 22 1 0 A/5 2…## 2 2 1 1 Cumings, … fema… 38 1 0 PC 17…## 3 3 1 3 Heikkinen… fema… 26 0 0 STON/…## 4 4 1 1 Futrelle,… fema… 35 1 0 113803## 5 5 0 3 Allen, Mr… male 35 0 0 373450## 6 6 0 3 Moran, Mr… male NA 0 0 330877## 7 7 0 1 McCarthy,… male 54 0 0 17463 ## 8 8 0 3 Palsson, … male 2 3 1 349909## 9 9 1 3 Johnson, … fema… 27 0 2 347742## 10 10 1 2 Nasser, M… fema… 14 1 0 237736## # ℹ 881 more rows## # ℹ 3 more variables: Fare <dbl>, Cabin <chr>, Embarked <chr>
titanic <- type_convert( titanic, col_types = cols( PassengerId = col_double(), Survived = col_double(), Pclass = col_double(), Name = col_character(), Sex = col_factor(), Age = col_double(), SibSp = col_double(), Parch = col_double(), Ticket = col_character(), Fare = col_double(), Cabin = col_character(), Embarked = col_character() ) )
Sometimes our data have to leave R
, for example, if we....
R
For such purposes, we also need a way to export our data.
All of the packages we have discussed in this session also have designated functions for that.
write_csv(titanic, "titanic_own.csv")
R
's native file formatsThere are 2 native 'file formats' to choose from. The advantage of using them is that they are compressed files, so that they don't occupy unnecessarily large disk space. These two formats are .Rdata
/.rda
and .rds
.
The key difference between them is that .rds
can only hold one object, whereas .Rdata
/.rda
can also be used for storing several objects in one file.
.Rdata
/.rda
Saving
save(mydata, file = "mydata.RData")
Loading
load("mydata.RData")
.rds
Saving
saveRDS(mydata, "mydata.rds")
Loading
mydata <- readRDS("mydata.rds")
Note: A nice property of saveRDS()
is that just saves a representation of the object, which means you can name it whatever you want when loading.
If you have not changed the General Global Options in RStudio as suggested in the Getting Started session, you may have noticed that, when closing Rstudio, by default, the programs asks you whether you want to save the workspace image.
You can also do that whenever you want using the save.image()
function:
save.image()
Note: As we've said before, though, this is not something we'd recommend as a worfklow. Instead, you should (explicitly and separately) save your R
scripts and data sets (in appropriate formats).
For data import (and export) in general, there are even more options, such as...
data.table
or fst
for large data sets
jsonlite
for .json
files
In general, you should avoid using absolute file paths to maintain your code reproducible and future-proof. We already talked about this in the introduction, but this is particularly important for importing and exporting data.
As a reminder: Absolute file paths look like this (on different OS):
# Windowsload("C:/Users/cool_user/data/fancy_data.Rdata")# Macload("/Users/cool_user/data/fancy_data.Rdata")# GNU/Linuxload("/home/cool_user/data/fancy_data.Rdata")
Instead of using absolute paths, it is recommended to use relative file paths. The general principle here is to start from a directory where your current script currently exists and navigate to your target location. Say we are in the "C:/Users/cool_user/" location on a Windows machine. To load your data, we would use:
load("./data/fancy_data.Rdata")
If we were in a different folder, e.g., "C:/Users/cool_user/cat_pics/mittens/", we would use:
load("../../data/fancy_data.Rdata")
R
Thus far, we've already learned what R
and RStudio
are. There's one essential prerequisite:
We need data!
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
o | Tile View: Overview of Slides |
Esc | Back to slideshow |
R
Thus far, we've already learned what R
and RStudio
are. There's one essential prerequisite:
We need data!
R
's internal data types?data()
How your data are stored (data types)
Where your data are stored (data formats)
https://www.stat.berkeley.edu/~nolan/stat133/Fall05/lectures/DataTypes4.pdf
Integers are values without a decimal value. To be explicit in R
in using them, you have to place an L
behind the actual value.
1L
## [1] 1
By contrast, doubles are values with a decimal value.
1.1
## [1] 1.1
We can check data types by using the typeof()
function.
typeof(1L)
## [1] "integer"
typeof(1.1)
## [1] "double"
At first glance, a character is a letter somewhere between a-z. String in this context might mean that we have a series of characters. However, numbers and other symbols can be part of a character string, which can then be, e.g., part of a text. In R
, character strings are wrapped in quotation marks.
"Hi. I am a character string, the 1st of its kind!"
## [1] "Hi. I am a character string, the 1st of its kind!"
Note: There are no values associated with the content of character strings unless we change that, e.g., with factors.
Factors are data types that assume that their values are not continuous, e.g., as in ordinal or nominal data.
factor(1.1)
## [1] 1.1## Levels: 1.1
factor("Hi. I am a character string, the 1st of its kind!")
## [1] Hi. I am a character string, the 1st of its kind!## Levels: Hi. I am a character string, the 1st of its kind!
Factors take numeric data or character strings as input as they simply convert them into so-called levels.
Logical values are basically either TRUE
or FALSE
values. These values are produced by making logical requests on your data.
2 > 1
## [1] TRUE
2 < 1
## [1] FALSE
Logical values are at the heart of creating loops. For this purpose, however, we need more logical operators to request TRUE
or FALSE
values.
There are quite a few logical operators in R
:
<
less than<=
less than or equal to>
greater than>=
greater than or equal to==
exactly equal to!=
not equal to!x
Not xx | y
x OR yx & y
x AND yisTRUE(x)
test if X is TRUE isFALSE(x)
test if X is FALSE https://www.statmethods.net/management/operators.html
Moreover, there are some more is.PROPERTY_ASKED_FOR()
functions, such as is.numeric()
, which also return TRUE
or FALSE
values.
R
's data formatsR
's different data types can be put into 'containers'.
https://devopedia.org/r-data-structures
Vectors are built by enclosing your content with c()
("c" for "concatenate")
numeric_vector <- c(1, 2, 3, 4)character_vector <- c("a", "b", "c", "d")numeric_vector
## [1] 1 2 3 4
character_vector
## [1] "a" "b" "c" "d"
Vectors are really like vectors in mathematics. Initially, it doesn't matter if you look at them as column or row vectors.
Using the function cbind()
or rbind()
you can either combine vectors column-wise or row-wise. Thus, they become matrices.
cbind(numeric_vector, character_vector)
## numeric_vector character_vector## [1,] "1" "a" ## [2,] "2" "b" ## [3,] "3" "c" ## [4,] "4" "d"
rbind(numeric_vector, character_vector)
## [,1] [,2] [,3] [,4]## numeric_vector "1" "2" "3" "4" ## character_vector "a" "b" "c" "d"
Note: The numeric values are coerced into strings here.
Matrices are the basic rectangular data format in R.
fancy_matrix <- matrix(1:16, nrow = 4)fancy_matrix
## [,1] [,2] [,3] [,4]## [1,] 1 5 9 13## [2,] 2 6 10 14## [3,] 3 7 11 15## [4,] 4 8 12 16
You cannot store multiple data types, such as strings and numeric values in the same matrix. Otherwise, your data will get coerced to a common type, as seen in the previous slide. This is something that happens already within vectors:
c(1, 2, "evil string")
## [1] "1" "2" "evil string"
library(randomNames) # a name generator packagefancy_data <- data.frame( who = randomNames(n = 10, which.names = "first"), age = sample(14:49, 10, replace = TRUE), # you see what we are doing here? salary_2018 = sample(15:100, 10, replace = TRUE), salary_2019 = sample(15:100, 10, replace = TRUE) )fancy_data
↪️
## who age salary_2018 salary_2019## 1 Joseph 27 30 93## 2 Asha 37 40 99## 3 Emily 23 86 18## 4 Michaela 23 77 68## 5 Jordan 38 52 66## 6 Burhaan 40 92 48## 7 Lasandra 36 45 41## 8 Caleb 31 28 15## 9 Angelica 39 91 80## 10 Alfred 38 97 60
Tibbles are basically just R data.frames
but nicer.
dim()
and other functionsYou can check the tibble vignette for technical details.
library(tibble)as_tibble(fancy_data)
## # A tibble: 10 × 4## who age salary_2018 salary_2019## <chr> <int> <int> <int>## 1 Joseph 27 30 93## 2 Asha 37 40 99## 3 Emily 23 86 18## 4 Michaela 23 77 68## 5 Jordan 38 52 66## 6 Burhaan 40 92 48## 7 Lasandra 36 45 41## 8 Caleb 31 28 15## 9 Angelica 39 91 80## 10 Alfred 38 97 60
Lists are perfect for storing numerous and potentially diverse pieces of information in one place.
fancy_list <- list( numeric_vector, character_vector, fancy_matrix, fancy_data )fancy_list
↪️
## [[1]]## [1] 1 2 3 4## ## [[2]]## [1] "a" "b" "c" "d"## ## [[3]]## [,1] [,2] [,3] [,4]## [1,] 1 5 9 13## [2,] 2 6 10 14## [3,] 3 7 11 15## [4,] 4 8 12 16## ## [[4]]## who age salary_2018 salary_2019## 1 Joseph 27 30 93## 2 Asha 37 40 99## 3 Emily 23 86 18## 4 Michaela 23 77 68## 5 Jordan 38 52 66## 6 Burhaan 40 92 48## 7 Lasandra 36 45 41## 8 Caleb 31 28 15## 9 Angelica 39 91 80## 10 Alfred 38 97 60
fancy_nested_list <- list( fancy_vectors = list(numeric_vector, character_vector), data_stuff = list(fancy_matrix, fancy_data) )fancy_nested_list
↪️
## $fancy_vectors## $fancy_vectors[[1]]## [1] 1 2 3 4## ## $fancy_vectors[[2]]## [1] "a" "b" "c" "d"## ## ## $data_stuff## $data_stuff[[1]]## [,1] [,2] [,3] [,4]## [1,] 1 5 9 13## [2,] 2 6 10 14## [3,] 3 7 11 15## [4,] 4 8 12 16## ## $data_stuff[[2]]## who age salary_2018 salary_2019## 1 Joseph 27 30 93## 2 Asha 37 40 99## 3 Emily 23 86 18## 4 Michaela 23 77 68## 5 Jordan 38 52 66## 6 Burhaan 40 92 48## 7 Lasandra 36 45 41## 8 Caleb 31 28 15## 9 Angelica 39 91 80## 10 Alfred 38 97 60
Generally, the logic of [index_number]
is to access only a subset of information in an object, no matter if we have vectors or data frames.
Say, we want to extract the 2nd element of our character_vector
object, we could do that like this:
character_vector[2]
## [1] "b"
Matrices can have more dimensions, often you want information from a specific row and column.
a_wonderful_matrix[number_of_row, number_of_column]
Note: You can do the same indexing with data.frame
s.
Identifying rows, columns, or elements using subscripts is similar to matrix notation:
fancy_matrix[, 4] # 4th column of matrixfancy_matrix[3,] # 3rd row of matrixfancy_matrix[2:4, 1:3] # rows 2,3,4 of columns 1,2,3
A nice feature of data.frames
or tibbles
is that their columns are names, just as variable names in ordinary data.
fancy_data$who
## [1] "Joseph" "Asha" "Emily" "Michaela" "Jordan" "Burhaan" ## [7] "Lasandra" "Caleb" "Angelica" "Alfred"
Just place a $
-sign between the data object and the variable name.
[]
in data framesSometimes we also have to rely on character strings as input information, e.g., for iterating over data. We can also use []
to access variables by name.
Not only this way:
fancy_data[1]
## who## 1 Joseph## 2 Asha## 3 Emily## 4 Michaela## 5 Jordan## 6 Burhaan## 7 Lasandra## 8 Caleb## 9 Angelica## 10 Alfred
But also this way:
fancy_data["who"]
## who## 1 Joseph## 2 Asha## 3 Emily## 4 Michaela## 5 Jordan## 6 Burhaan## 7 Lasandra## 8 Caleb## 9 Angelica## 10 Alfred
The most high-level information you can get is about the object type and its dimensions.
# object typeclass(fancy_data)
## [1] "data.frame"
# number of rows and columnsdim(fancy_data)
## [1] 10 4
# number of rowsnrow(fancy_data)
## [1] 10
# number of columnsncol(fancy_data)
## [1] 4
You can also print the first 6 lines of the data frame with head()
. You can easily change the number of lines by providing the number as the second argument to the head()
function.
head(fancy_data, 3)
## who age salary_2018 salary_2019## 1 Joseph 27 30 93## 2 Asha 37 40 99## 3 Emily 23 86 18
If we want some more (detailed) information about the data set or object, we can use the base R
function str()
.
str(fancy_data)
## 'data.frame': 10 obs. of 4 variables:## $ who : chr "Joseph" "Asha" "Emily" "Michaela" ...## $ age : int 27 37 23 23 38 40 36 31 39 38## $ salary_2018: int 30 40 86 77 52 92 45 28 91 97## $ salary_2019: int 93 99 18 68 66 48 41 15 80 60
If you want to have a look at your full data set, you can use the View()
function. In RStudio, this will open a new tab in the source pane through which you can explore the data set (including a search function). You can also click on the small spreadsheet symbol on the right side of the object in the environment tab to open this view.
View(fancy_data)
We can print all names of an object using the names()
function...
names(fancy_data)
## [1] "who" "age" "salary_2018" "salary_2019"
...and we can also change names with it.
names(fancy_data) <- c("name", "age", "salary_2018", "salary_2019")names(fancy_data)
## [1] "name" "age" "salary_2018" "salary_2019"
Data: Stack Overflow Annual Developer Survey 2024.
# Option 1: tidytuesdayR package ## install.packages("tidytuesdayR")tuesdata <- tidytuesdayR::tt_load('2024-09-03')qname_levels_single_response_crosswalk <- tuesdata$qname_levels_single_response_crosswalkstackoverflow_survey_questions <- tuesdata$stackoverflow_survey_questionsstackoverflow_survey_single_response <- tuesdata$stackoverflow_survey_single_response# Option 2: Read directly from GitHubqname_levels_single_response_crosswalk <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-09-03/qname_levels_single_response_crosswalk.csv')stackoverflow_survey_questions <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-09-03/stackoverflow_survey_questions.csv')stackoverflow_survey_single_response <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-09-03/stackoverflow_survey_single_response.csv')
We will also use data from Gapminder. During the course and the exercises, we work with data we have downloaded from their website. There also is an R
package that bundles some of the Gapminder data: install.packages("gapminder")
.
This R
package provides "[a]n excerpt of the data available at Gapminder.org. For each of 142 countries, the package provides values for life expectancy, GDP per capita, and population, every five years, from 1952 to 2007."
To code along and be able to do the exercises, you should store the data files for the tuesdata in a folder called ./data
in the same folder as the other materials for this course.
R
is data-agnosticWhat you will learn
R
tidyverse
instead of using base R
You can use the RStudio GUI for importing data via Environment - Import data set - Choose file type
.
Browse Button in RStudio
Code preview in Rstudio
Basic file formats, such as CSV (comma-separated value file), can directly be imported into R
Other file formats, particularly the proprietary ones, require the use of additional packages
In the following slides, we'll jump right into importing data. We use a lot of different packages for this purpose, and you don't have to remember everything. It's just for making a point of how agnostic R
actually is regarding the file type. Later on, we will dive more into the specifics of importing.
base R
titanic <- read.csv("./data/titanic.csv")titanic
## PassengerId Survived Pclass## 1 1 0 3## 2 2 1 1## 3 3 1 3## 4 4 1 1## 5 5 0 3## 6 6 0 3## Name Sex Age SibSp## 1 Braund, Mr. Owen Harris male 22 1## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1## 3 Heikkinen, Miss. Laina female 26 0## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1## 5 Allen, Mr. William Henry male 35 0## 6 Moran, Mr. James male NA 0## Parch Ticket Fare Cabin Embarked## 1 0 A/5 21171 7.2500 S## 2 0 PC 17599 71.2833 C85 C## 3 0 STON/O2. 3101282 7.9250 S## 4 0 113803 53.1000 C123 S## 5 0 373450 8.0500 S## 6 0 330877 8.4583 Q
readr
example: CSV
fileslibrary(readr)titanic <- read_csv("./data/titanic.csv")
titanic
## # A tibble: 891 × 12## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> ## 1 1 0 3 Braund, M… male 22 1 0 A/5 2…## 2 2 1 1 Cumings, … fema… 38 1 0 PC 17…## 3 3 1 3 Heikkinen… fema… 26 0 0 STON/…## 4 4 1 1 Futrelle,… fema… 35 1 0 113803## 5 5 0 3 Allen, Mr… male 35 0 0 373450## 6 6 0 3 Moran, Mr… male NA 0 0 330877## 7 7 0 1 McCarthy,… male 54 0 0 17463 ## 8 8 0 3 Palsson, … male 2 3 1 349909## 9 9 1 3 Johnson, … fema… 27 0 2 347742## 10 10 1 2 Nasser, M… fema… 14 1 0 237736## # ℹ 881 more rows## # ℹ 3 more variables: Fare <dbl>, Cabin <chr>, Embarked <chr>
Note the column specifications: readr
'guesses' them based on the first 1000 observations (we will come back to this later).
readxl
library(readxl)unicorns <- read_xlsx("./data/observations.xlsx")
No output ☹️
unicorns
## # A tibble: 42 × 3## countryname year pop## <chr> <dbl> <dbl>## 1 Austria 1670 85## 2 Austria 1671 83## 3 Austria 1674 75## 4 Austria 1675 82## 5 Austria 1676 79## 6 Austria 1677 70## 7 Austria 1678 81## 8 Austria 1680 80## 9 France 1673 70## 10 France 1674 79## # ℹ 32 more rows
These were just some very first examples of applying functions for data import from the different packages. There are many more...
readr
read_csv()
read_tsv()
read_delim()
read_fwf()
read_table()
read_log()
haven
read_sas()
read_spss()
read_stata()
tibbles
<chr>
col_character()
<int>
col_integer()
<dbl>
col_double()
<fct>
col_factor()
<lgl>
col_logical()
As mentioned before, read_csv
'guesses' the variable types by scanning the first 1000 observations. NB: This can go wrong!
Luckily, we can change the variable type...
read_csv
titanic <- read_csv( "./data/titanic.csv", col_types = cols( PassengerId = col_double(), Survived = col_double(), Pclass = col_double(), Name = col_character(), Sex = col_character(), Age = col_double(), SibSp = col_double(), Parch = col_double(), Ticket = col_character(), Fare = col_double(), Cabin = col_character(), Embarked = col_character() ) )titanic
↪️
## # A tibble: 891 × 12## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> ## 1 1 0 3 Braund, M… male 22 1 0 A/5 2…## 2 2 1 1 Cumings, … fema… 38 1 0 PC 17…## 3 3 1 3 Heikkinen… fema… 26 0 0 STON/…## 4 4 1 1 Futrelle,… fema… 35 1 0 113803## 5 5 0 3 Allen, Mr… male 35 0 0 373450## 6 6 0 3 Moran, Mr… male NA 0 0 330877## 7 7 0 1 McCarthy,… male 54 0 0 17463 ## 8 8 0 3 Palsson, … male 2 3 1 349909## 9 9 1 3 Johnson, … fema… 27 0 2 347742## 10 10 1 2 Nasser, M… fema… 14 1 0 237736## # ℹ 881 more rows## # ℹ 3 more variables: Fare <dbl>, Cabin <chr>, Embarked <chr>
read_csv
titanic <- read_csv( "./data/titanic.csv", col_types = cols( PassengerId = col_double(), Survived = col_double(), Pclass = col_double(), Name = col_character(), Sex = col_factor(), # This one changed! Age = col_double(), SibSp = col_double(), Parch = col_double(), Ticket = col_character(), Fare = col_double(), Cabin = col_character(), Embarked = col_character() ) )titanic
↪️
## # A tibble: 891 × 12## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket## <dbl> <dbl> <dbl> <chr> <fct> <dbl> <dbl> <dbl> <chr> ## 1 1 0 3 Braund, M… male 22 1 0 A/5 2…## 2 2 1 1 Cumings, … fema… 38 1 0 PC 17…## 3 3 1 3 Heikkinen… fema… 26 0 0 STON/…## 4 4 1 1 Futrelle,… fema… 35 1 0 113803## 5 5 0 3 Allen, Mr… male 35 0 0 373450## 6 6 0 3 Moran, Mr… male NA 0 0 330877## 7 7 0 1 McCarthy,… male 54 0 0 17463 ## 8 8 0 3 Palsson, … male 2 3 1 349909## 9 9 1 3 Johnson, … fema… 27 0 2 347742## 10 10 1 2 Nasser, M… fema… 14 1 0 237736## # ℹ 881 more rows## # ℹ 3 more variables: Fare <dbl>, Cabin <chr>, Embarked <chr>
titanic <- type_convert( titanic, col_types = cols( PassengerId = col_double(), Survived = col_double(), Pclass = col_double(), Name = col_character(), Sex = col_factor(), Age = col_double(), SibSp = col_double(), Parch = col_double(), Ticket = col_character(), Fare = col_double(), Cabin = col_character(), Embarked = col_character() ) )
Sometimes our data have to leave R
, for example, if we....
R
For such purposes, we also need a way to export our data.
All of the packages we have discussed in this session also have designated functions for that.
write_csv(titanic, "titanic_own.csv")
R
's native file formatsThere are 2 native 'file formats' to choose from. The advantage of using them is that they are compressed files, so that they don't occupy unnecessarily large disk space. These two formats are .Rdata
/.rda
and .rds
.
The key difference between them is that .rds
can only hold one object, whereas .Rdata
/.rda
can also be used for storing several objects in one file.
.Rdata
/.rda
Saving
save(mydata, file = "mydata.RData")
Loading
load("mydata.RData")
.rds
Saving
saveRDS(mydata, "mydata.rds")
Loading
mydata <- readRDS("mydata.rds")
Note: A nice property of saveRDS()
is that just saves a representation of the object, which means you can name it whatever you want when loading.
If you have not changed the General Global Options in RStudio as suggested in the Getting Started session, you may have noticed that, when closing Rstudio, by default, the programs asks you whether you want to save the workspace image.
You can also do that whenever you want using the save.image()
function:
save.image()
Note: As we've said before, though, this is not something we'd recommend as a worfklow. Instead, you should (explicitly and separately) save your R
scripts and data sets (in appropriate formats).
For data import (and export) in general, there are even more options, such as...
data.table
or fst
for large data sets
jsonlite
for .json
files
In general, you should avoid using absolute file paths to maintain your code reproducible and future-proof. We already talked about this in the introduction, but this is particularly important for importing and exporting data.
As a reminder: Absolute file paths look like this (on different OS):
# Windowsload("C:/Users/cool_user/data/fancy_data.Rdata")# Macload("/Users/cool_user/data/fancy_data.Rdata")# GNU/Linuxload("/home/cool_user/data/fancy_data.Rdata")
Instead of using absolute paths, it is recommended to use relative file paths. The general principle here is to start from a directory where your current script currently exists and navigate to your target location. Say we are in the "C:/Users/cool_user/" location on a Windows machine. To load your data, we would use:
load("./data/fancy_data.Rdata")
If we were in a different folder, e.g., "C:/Users/cool_user/cat_pics/mittens/", we would use:
load("../../data/fancy_data.Rdata")