Introduction to R

Introduction
Downloading/Installing R and RStudio
R Basics
Exercises:

Last updated: 2019-04-29

Checks: 6 0

Knit directory: MSTPsummerstatistics/

This reproducible R Markdown analysis was created with workflowr (version 1.3.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20180927)

The command set.seed(20180927) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Repository version: e0e8156

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rproj.user/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File	Version	Author	Date	Message
html	e0e8156	Anthony Hung	2019-04-29	Build site.
html	e746cf5	Anthony Hung	2019-04-28	Build site.
Rmd	133df4a	Anthony Hung	2019-04-28	introR
html	133df4a	Anthony Hung	2019-04-28	introR
html	22b3720	Anthony Hung	2019-04-26	Build site.
html	ddb3114	Anthony Hung	2019-04-26	Build site.
html	413d065	Anthony Hung	2019-04-26	Build site.
html	6b98d6c	Anthony Hung	2019-04-26	Build site.
Rmd	9f13e70	Anthony Hung	2019-04-25	finish CLT
html	9f13e70	Anthony Hung	2019-04-25	finish CLT

Introduction

Here, we introduce R, a statistical programming language. Doing statistics within a programming language brings many advantages, including allowing one to organize all analyses into program files that can be rerun to replicate analyses. In addition to using R, we will be using RStudio, an integrated development environment (IDE), which assists us in working with R and outputs of our code as we develop it.

Both R and RStudio are freely available online.

Downloading/Installing R and RStudio

Download the appropriate “base” version of R for your operating system from CRAN: https://cran.r-project.org/
Install the software with default settings.
Download the appropriate RStudio version for your operating system: https://www.rstudio.com/products/rstudio/download/#download

R Basics

Follow along in your R console with the code in each of the code chunks as we explore the different aspects of R! Clicking on the github logo on the top right corner of the webpage will take you to the repository for this website, where you can download the R markdown file for this page to load into RStudio to follow along.

Mathematical operations in R

Many familiar operators work in R, allowing you to work with numbers like you would in a calculator. Operators such as inequalities also work, returning “TRUE” if the proposed logical expression is true and “FALSE” otherwise.

2+4 #addition

[1] 6

2-4 #subtraction

[1] -2

2*4 #multiplication

[1] 8

2/4 #division

[1] 0.5

2^4 #exponentiation

[1] 16

log(2) #the default log base is the natural log

[1] 0.6931472

2 < 4

[1] TRUE

2 > 4

[1] FALSE

2 >= 4 #greater than or equal to

[1] FALSE

2 == 2 #is equal to (notice that there are two equal signs, as a single equal sign denotes assignment)

[1] TRUE

2 != 4 #is not equal to

[1] TRUE

2 != 4 | 2 + 2 == 4 #OR

[1] TRUE

2 != 4 & 2 + 2 == 4 #AND

[1] TRUE

"Red" == "Red"

[1] TRUE

Objects

In addition to being able to work with actual numbers, R works in objects, which can represent anything from numbers to strings to vectors to matrices. Everything in R is an object. The best practice for assigning variable names to objects is the “<-” operator. After objects are created, they are stored in in the “Environment” tab in your RStudio console and can be called upon to perform different operations.

R has many data structures, including:

atomic vector
list
matrix
data frame
factors

R has 6 atomic vector types, or classes. Atomic means that a vector only contains elements of one class (i.e. the elements inside the vector do not come from mutliple classes).

character
numeric (real or decimal)
integer
logical (TRUE or FALSE)
complex (containing i)

a <- 2
b <- 3
a + b

[1] 5

class(a) #the "class" function tells you what class of object a is

[1] "numeric"

d <- c(1,2,3,4,5) #the "c" function concatenates the arguments contained within it into a vector
class(d)

[1] "numeric"

d[3] #brackets allow you index vectors or matrices. Here, we call the third value from our d vector.

[1] 3

#The below code stores the stated values in a dataframe which contains employee ids, names, salaries, and start dates for 5 employees
emp.data <- data.frame(
   emp_id = c (1:5), 
   emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
   salary = c(623.3,515.2,611.0,729.0,843.25), 
   
   start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",
      "2015-03-27")),
   stringsAsFactors = FALSE
)

emp.data

  emp_id emp_name salary start_date
1      1     Rick 623.30 2012-01-01
2      2      Dan 515.20 2013-09-23
3      3 Michelle 611.00 2014-11-15
4      4     Ryan 729.00 2014-05-11
5      5     Gary 843.25 2015-03-27

emp.data$emp_id #the $ operator calls on a certain column of a dataframe

[1] 1 2 3 4 5

emp.data$emp_name[emp.data$salary > 620] #You can combine logical operators, brackets, and the $ sign to subset your dataframe in any way you choose! Here, we print out all the employee names for employees who have a salary greater than 620.

[1] "Rick" "Ryan" "Gary"

ls() #ls lists all the variable names that have been assigned to objects in your workspace

[1] "a"        "b"        "d"        "emp.data"

Using Packages in R

In addition to the basic functions provided in R, oftentimes we will be working with packages that contain functions written by other people to perform common tasks or specific analyses. Packages can also contain datasets. We can load these packages into our R environment after installing them in R.

usePackage <- function(p) 
{
  if (!is.element(p, installed.packages()[,1]))
    install.packages(p, dep = TRUE)
  require(p, character.only = TRUE)
}

usePackage("gapminder")#This code installs the gapminder packages, which contains vital statistics data from multiple countries. install.packages() is the function that will install a package for you if you know it's name.

Loading required package: gapminder

library("gapminder") #After installing the package, we need to tell R to load it into our current environment with this function.
head(gapminder) #The package gapminder contains a dataset called gapminder. We can use the "head" function to print out the first 6 rows of this dataset.

# A tibble: 6 x 6
  country     continent  year lifeExp      pop gdpPercap
  <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3  9240934      821.
3 Afghanistan Asia       1962    32.0 10267083      853.
4 Afghanistan Asia       1967    34.0 11537966      836.
5 Afghanistan Asia       1972    36.1 13079460      740.
6 Afghanistan Asia       1977    38.4 14880372      786.

?gapminder #the ? operator lauches a help page to describe a particular function, including the arguments it takes. Whenever using a new function, it is good practice to first explore it through ?.

Loops

Oftentimes, we may want to perform the same operation or function many many times. Rather than having to explicitly write out each individual operation, we can make use of loops. For example, let’s say that we want to raise the number 2 to the power of each integer from 0 to 20. We could either write out 2^0, 2^1, 2^2 …, or make use of a for loop to condense our code while getting the same result.

2^0

[1] 1

2^1

[1] 2

2^2

[1] 4

2^3

[1] 8

# ...

#This is a for loop. in the parentheses after the for function, we specify over what range of values we want to loop over, and assign a dummy variable name to take on each of those values in sequence. Within the curly braces, we state what operation we want to perform over all the values taken on by the dummy variable.
for(i in 0:20){
  print(2^i)
}

[1] 1
[1] 2
[1] 4
[1] 8
[1] 16
[1] 32
[1] 64
[1] 128
[1] 256
[1] 512
[1] 1024
[1] 2048
[1] 4096
[1] 8192
[1] 16384
[1] 32768
[1] 65536
[1] 131072
[1] 262144
[1] 524288
[1] 1048576

User-defined Functions (UDF)

Another way to avoid writing out or copy-pasting the same exact thing over and over again when working with data is to write a function to contain a certain combination of operations you find yourself running mutliple times. For example, you may find yourself needing to calculate the Hardy-Weinberg Equillibrium genotype frequencies of a population given the allele frequencies. We can wrap up all the code that you would need to calculate this in a function that we can call upon again and again.

calc_HWE_geno <- function(p = 0.5){ 
  q <- 1-p
  
  pp <- p^2
  pq <- 2*p*q
  qq <- q^2
  
  return(c(pp, pq, qq))
}

calc_HWE_geno(p = 0.1)

[1] 0.01 0.18 0.81

#note that in our UDF we assigned a default value to p (p = 0.5). This means that if we do not specify a value for our argument of p, it will default to using that value.

calc_HWE_geno()

[1] 0.25 0.50 0.25

Plots

In addition to mathematical operations, R can help with data visualization. Base R has a few useful plotting functions, but popular packages such as ggplot2 give more customization and control to the user.

hist(gapminder$lifeExp)

Version	Author	Date
133df4a	Anthony Hung	2019-04-28

boxplot(lifeExp ~ continent, data = gapminder) #box plot for the life expectancies of all years per continent

Version	Author	Date
133df4a	Anthony Hung	2019-04-28

Setting a random seed

R has many functions that use a random number generator to generate an output. For example, the r____ functions (e.g. rbinom, runif) pull numbers from a probability distribution of your choice. In order to create reproducible analyses, it is often advantageous to be able to reliably obtain the same “random” number after running the same function over again. In order to do so, we can set a seed for the random number generator.

runif(1,0,1) #runif pulls a number from the uniform distribution with a set of given parameters

[1] 0.1944457

runif(1,0,1) #we can see that running runif twice gives you differnt results

[1] 0.205278

set.seed(1234) #setting a seed allows us to obtain reproducible results from functions that use the random number generator
runif(1,0,1)

[1] 0.1137034

set.seed(1234)
runif(1,0,1)

[1] 0.1137034

Exercises:

Write a function called calc_KE that takes as arguments the mass and velocity of an object and returns the kinetic energy of an object. Use it to find the KE of a 0.5 kg rock moving at 1.2 m/s.
Working with the gapminder dataset, find the country with the highest life expectancy in 1962.

sessionInfo()

R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.14.4

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] gapminder_0.3.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.18     knitr_1.20       whisker_0.3-2    magrittr_1.5    
 [5] workflowr_1.3.0  rlang_0.2.2      fansi_0.3.0      stringr_1.3.1   
 [9] tools_3.5.1      utf8_1.1.4       cli_1.0.0        git2r_0.23.0    
[13] htmltools_0.3.6  yaml_2.2.0       rprojroot_1.3-2  digest_0.6.16   
[17] assertthat_0.2.1 tibble_1.4.2     crayon_1.3.4     fs_1.2.7        
[21] glue_1.3.0       evaluate_0.11    rmarkdown_1.10   stringi_1.2.4   
[25] compiler_3.5.1   pillar_1.3.0     backports_1.1.2