GitHub, Functions and Iteration

Today, we will start by setting up a basic workflow with GitHub. This workflow will make it easier to keep track of your work and collaborate with others. From there, we will learn about functional programming in R. This is one of the most important parts of coding with R (and I cannot stress enough how important it is). Built-in functions and packages can only get you so far. There will be times when it is both easier and more efficient to create your own.

This lab will teach you how to:

  • setup a basic workflow with R Studio and GitHub
  • write your own functions
  • iterate functions over multiple inputs
  • vectorise your functions using the purrr package

1 . Git and GitHub (Condensed Version)

This is the condensed version of the version control extra session, which can be found in full here. We will create your first repo on GitHub and see how we can use it to keep a record of all the changes that were ever made to your code. This is an essential component of collaborative coding efforts, but can also be immensely beneficial for your own solo projects.

The structure of this session is as follows:

  1. Create a new repo on GitHub and initialize it
  2. Clone this repo to your local machine
  3. Make some changes to a file
  4. Stage these local changes
  5. Commit them to our Git history with a helpful message
  6. Pull from the GitHub repo just in case anyone else made changes too (not expected here, but good practice).
  7. Push our changes to the GitHub repo.

Before we get into the nitty-gritty of this week’s session, I would like to suggest you all download GitHub Desktop. It is a GUI that lets you interact with GitHub and might be an additional option if your do not want to rely on RStudio or the Command Line alone.

There are a number of different GUIs that you can try out and play around with. Some are better than others. An example of another free GUI like GitHub Desktop is GitKraken.


Suggested Workflow 🏄

This is the recommended workflow that you should employ when you work with Git. You can see it as a sort of recipe that you should follow under most circumstances. At each stage you can find the instructions for working through both the Command Line and through RStudio. Be aware however that these are separate processes that should not be mixed. Either you use the shell for version control or you use RStudio or you use GitHub Desktop.

1. Create a new repo on GitHub

  • Go to your github page and make sure you are logged in.

  • Click green “New repository” button. Or, if you are on your own profile page, click on “Repositories”, then click the green “New” button.

  • How to fill this in:

    • Repository name: my_first_repo (or whatever you want).
    • Description: “figuring out how this works” (or whatever, but some text is good for the README).
    • Select Public.
    • YES Initialize this repository with a README.
    • For everything else, just accept the default.

Great, now that you have created a new repo on GitHub, it is important to note that you should always create a repo prior to starting your work in RStudio.

2. Clone it to your local machine

Whatever way you plan on cloning this repo, you first need to copy the URL identifying it. Luckily there is another green button “Code” that allows you to do just that. Copy the HTTPS link for now. It will look something like this https://github.com/your-git-username/my_first_repo.git.

Using GitHub Desktop

Here is a short gif on how to clone a repo with GitHub Desktop:

Using Rstudio

In RStudio, go to:

File > New Project > Version Control > Git.

In the “repository URL”-box paste the URL of your new GitHub repository.

Do not just create some random directory for the local copy. Instead think about how you organize your files and folders and make it coherent.

I always suggest that with any new R-project you “Open in new session”.

Finally, click the “Create Project” to create a new directory. What you get are three things in one:

  • a directory or “folder” on your computer
  • a Git repository, linked to a remote GitHub repository
  • an RStudio Project

In the absence of other constraints, I suggest that all of your R projects have exactly this set-up.

Using the Command Line

Open the Terminal on your laptop.

Be sure to check what directory you’re in. $ pwd displays the working directory. $ cd is the command to change directory.

Clone a repo into your chosen directory.

cd ~/teaching/2024-ids
git clone https://github.com/your-git-username/my_first_repo.git

Check whether it worked:

cd ~/teaching/2024-ids/my_first_repo
git log
git status

3. Make changes to a file

To showcase how useful Git can be, we first need to add some files to our repo. For now there should only be the .gitignore and the README file. While we do this we might as well review some of the stuff we encountered last week.

  • So let’s create a new R script and save it in the directory that we just cloned.

  • First load/install necessary packages (the tidyverse suffices here)

# Set-up your script ------------------------------------------------------

# install.packages(c("tidyverse", "gapminder", "pacman")) # uncomment if already installed
pacman::p_load(tidyverse, gapminder)
  • Then load the data you want to work with into R.
# Load your Data into R ---------------------------------------------------

data(gapminder)
head(gapminder)
  • Finally, start cleaning your data.
# Clean your Data ---------------------------------------------------------

gapminder_clean <- gapminder %>% 
  dplyr::rename(life_exp = lifeExp, gdp_per_cap = gdpPercap) %>% 
  dplyr::mutate(gdp = pop * gdp_per_cap)

4. Stage your Changes and Commit

Before we get on to the next step, it is a good idea to save this newly created script and give it a name. Now that your changes are saved locally we need to let Git know.

Using GitHub Desktop

Changed or added files are automatically staged in GitHub Desktop. The GUI also displays (where possible) the changes that were made

You only really need to commit the changes with a nice little summary or message.

Using RStudio

In the Environment/History panel a new tab called “git” should have appeared.

  • Click on it and it should display all the changed and new files in your directory.
  • Now you can select which files you want to stage. Simply tick the box next to your chosen files.
  • Hit the Commit Button and a new window should open up.
  • In this window you can quickly add a helpful message to mark exactly what you did. (Do not neglect this, as Git won’t allow you to commit without it. Plus, a clear message will help both you and your collaborators to understand your changes.)

This method has the clear advantage of being extremely intuitive. However you are limited to selecting and staging individual files.

Using the Command Line

Stage (“add”) a file or group of files. This allows you to stage specific individual files such as the README file for example.

git add NAME-OF-FILE-OR-FOLDER

Alternatively, you could stage all files (whether updated or not):

git add -A

Or you could stage updated files only (modified or deleted, but not new):

git add -u

Finally, you can also only stage new files (not updated ones).

git add .

Having done so, you are now ready to commit these changes!

git commit -m "Helpful message"

As you can imagine, the command shell with its different options (and there are more beyond the staging phase), can be quicker and more flexible than your GUI interface, especially for experienced users.

5. Pulling and Pushing your commits

This part of the version control workflow should be relatively easy. The most important thing to remember is that you should always pull before you push. The reason for this is that in collaborative projects someone else might have pushed changes to the same file you were working on. This can lead to conflicts that should be avoided, if possible.

Using GitHub Desktop

It is not really straightforward to pull before pushing with GitHub Desktop. Given that there is no prominent pull button, you need to actively remember to do so prior to pressing the blue push button.

Using RStudio

In RStudio you should see a change in the git tab. It should now read: “Your branch is ahead of ‘origin/main’ by 1 commit”

As long as this information is displayed, you know that you need to pull and push your commit.

To do this simply click on the blue arrow pointing down to pull from your main repo, before clicking on the green arrow pointing upwards to push your commits.

Using the Command Line

There are only two commands you need to remember here and they are pretty intuitive:

To pull from the main repo:

git pull

And to push your commits:

git push

Practice Assignment Setup

We’ll be using GitHub for the assignments. Let’s take a look at the practice assignment to get familiar with the workflow.

2. Functions with R

Quick review: lists 📃

One of the more important objects in R for functional programming are lists. Since we only briefly touched upon them in Week 1, let’s go over them again in more detail.

Vectors can only hold a single data type.

vec <- c(a = "hello", b = 1)

By comparison lists can hold many different data types at the same time.

list <- list(a = "hello", b = 1, c = mean)

When you think about it, data.frames are also lists (or rather a list of columns).

library(gapminder)
head(gapminder) 
## # A tibble: 6 × 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.

Lists are very useful when you start writing your own functions and iterating through them using the purrr family of functions.


Functions 🏭

In any coding language a fundamental principle should be DRY (Don’t Repeat Yourself). A good rule of thumb: once you’ve copy-pasted code twice, it’s time to write a function.

Functions allow you to automate tasks in a more powerful and general way. Writing a function has three big advantages copy-and-pasting:

  1. As requirements change, you only need to update code in one place, instead of in many.

  2. You eliminate the chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another).

  3. You can give functions descriptive names that make your code easier to understand.

You can read more on functions in this section of Rfor Data Science.


Basic Syntax

What kind of code calls for writing a function? Here’s a typical example:

df <- data.frame(
  a = rnorm(100, 5, 2),
  b = rnorm(100, 100, 15),
  c = rnorm(100, 2, 1),
  d = rnorm(100, 36, 7)
)

df$a <- (df$a - mean(df$a, na.rm = TRUE)) / sd(df$a, na.rm = TRUE)
df$b <- (df$b - mean(df$b, na.rm = TRUE)) / sd(df$a, na.rm = TRUE) # can you spot the mistake?
df$c <- (df$c - mean(df$c, na.rm = TRUE)) / sd(df$c, na.rm = TRUE)
df$d <- (df$d - mean(df$d, na.rm = TRUE)) / sd(df$d, na.rm = TRUE)

There are three key steps to creating a new function:

  1. Pick a name for the function. We’ll use zscale because this function re-scales (or “z-transforms”) a vector to have a mean of 0 and a standard deviation of 1.

  2. List the inputs, or arguments, to the function inside the brackets. Here we have just one argument. If we had more, the call would look like this: function(x, y, z).

  3. Place the code you have developed in the body of the function. The body of the function is represented by a {} block that immediately follows the function(...) call.

The overall structure of a function looks like this:

function_name <- function(input_parameters) {
  Do what you want to do in the body of the
  function, just like you would write other code in R.
}

In our example, we could simplify the z-transformation of four variables with this function:

zscale <- function(x) {
  (x - mean(x, na.rm = T) / sd(x, na.rm = T))
}

Now instead of repeating that long formula four times (and risking copy-paste errors), we can simply use:

zscale(df$a)

A word on function names. Generally, function names should be verbs, and arguments should be nouns. There are some exceptions: nouns are ok if the function computes a very well known noun (i.e. mean), or accessing some property of an object (i.e. coefficients). A good sign that a noun might be a better choice is if you’re using a very broad verb like “get”, “compute”, “calculate”, or “determine”. Where possible, avoid overriding existing functions and variables. This might be a little tricky sometimes, as many good names are already taken by other packages. Nevertheless, avoiding the most common names from base R will avoid confusion.


Conditional functions 🔀

In practice, you’ll frequently need functions that work differently depending on the input or situation. Adding conditions to your custom functions is simple. Here’s the basic syntax:

if (this) {
  # do that
  } else if (that) {
  # do something else
  } else if (that) {
  # do something else
  } else {
  # do something else
}

The conditions in the normal brackets are specified using the logical operators of R (!=, ==, <, >, etc.) or a function that returns a logical value. In many ways these conditions follow the same approach we applied to dplyr::filter() during last week’s lab. The {} denominate the body of the function, just as with unconditional functions.

You could, for example, only transform numeric variables and code the function to warn you if you tried to scale a character variable.

zscale <- function(x){
  if (is.numeric(x)) {
    (x - mean(x, na.rm = T) / sd(x, na.rm = T))
  } else {
    return("Not a numeric input!")
  }
}

zscale(df$a)
##   [1]  0.283591172 -1.121317149 -2.466773594 -1.317191055  1.645353316
##   [6]  2.519126056 -0.498042780 -0.529249371 -0.061895104  0.648824757
##  [11] -0.575288049 -0.923718906  1.309040394  0.402784016  0.418776037
##  [16]  0.702965191 -1.689670871 -0.472087965  0.981687615  1.365133619
##  [21] -0.148569634 -0.002188474  1.389136238 -0.836033306 -0.402617331
##  [26] -0.426928631 -0.284871097 -0.200947603 -1.056355399 -0.022386791
##  [31]  0.581152420 -0.877751597  1.428028842  0.186463913  2.130256243
##  [36]  1.462694866 -0.261622164  0.417802935 -1.489427199 -0.954428312
##  [41] -0.304751279 -0.974562604  0.852196988  0.218729001  0.369205529
##  [46] -0.767736295  0.574569107  0.526726190 -0.427461450  0.035317244
##  [51] -1.261206997 -0.787767587 -0.758254347  1.137106288 -0.307256575
##  [56]  0.074713188  2.171740200  1.295958875 -0.677542038 -1.083919817
##  [61]  0.379972525 -0.089032539 -1.890414947 -1.076079960  1.380073869
##  [66]  0.805513655 -0.141233327 -1.111803241  2.317520347  0.677638611
##  [71] -0.153996730 -2.002781677  0.520740668 -0.114857425  1.378550487
##  [76]  1.358585780 -0.164833081 -1.468065727 -0.419303586 -0.311570915
##  [81]  1.514497978  0.205875379 -0.058022255 -0.392766734 -1.261358002
##  [86] -0.543019092  0.059347379  1.412744892  0.408139650 -1.281478942
##  [91]  1.137965797 -1.095915847  0.373046001 -0.404730692  0.053767008
##  [96] -0.394486172 -0.811751679  0.433922705 -0.395736084  0.006077052

Now we can apply our function to any variable that we would like to transform. It will run even if we apply it to a character input, but warn us that the input does not fit the required input.

df$a <- zscale(df$a)
df$b <- zscale(df$b)
df$c <- zscale(df$c)
df$d <- zscale(df$d)

# you can also use your function with a pipe!
df$d |> zscale()

Note that there is still a lot of repetition in the example above. We can get rid of this repetition using what coders call iteration 👇. We will take a look at it in a second.


First, let’s do one last exercise with functions. We will work with palmerpenguins again.

library(palmerpenguins)
data(penguins)
str(penguins)
## tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
##  $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

Exercise 1:

Can you write a function to calculate the mode for a variable in the data? Let’s call it get_mode.

# break down the problem into multiple parts
# you might want to 
# a) count how many times each value appears in a vector
# b) find which value(s) appear most frequently

Exercise 2:

What is the mode for the variable flipper_length_mm?


3. Iteration ⚙️

Iteration helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns or on different datasets.

On the one hand, you have for loops and while loops, which are a great place to start because they make iteration very explicit. On the other hand, functional programming (FP) offers tools to extract out duplicated code, so each common for loop pattern gets its own function.

Remember the code above - it violates the rule of thumb that you should not copy-paste code more than twice.

# repetitive code
df$a <- zscale(df$a)
df$b <- zscale(df$b)
df$c <- zscale(df$c)
df$d <- zscale(df$d)

For-loops

To solve problems like this one with a for loop, we need to think again about the following three components:

  1. Output: When you write a loop, you need to decide where the results will go. If you are modifying an existing object, the output is that same object. If you are creating something new, define space for it first, such as an empty vector. Avoid growing the vector inside the loop, because R copies the data each time, which makes the loop much slower (O(n²) behavior). A better way is to store the results in a list during the loop and then combine them into a single vector afterwards. See more on this here.

  2. Sequence: we can think about a data frame as a list of columns, so we can iterate over each column with seq_along(df).

  3. Body: apply zscale() or any other function.

The better solution will look like this:

# repetitive code
df$a <- zscale(df$a)
df$b <- zscale(df$b)
df$c <- zscale(df$c)
df$d <- zscale(df$d)

# equivalent iteration
for (i in seq_along(df)) {       # seq_along() similar to length()
  df[[i]] <- zscale(df[[i]])     # [[]] because we are working on single elements
}

Remember, this only works for for loops that manipulate existing inputs (i.e. columns in a dataframe). If you want to save the output of your function in a different way, you need to define the object where you wish to store the output ahead of the function. In this case, you will see a pre-defined vector and an empty list:

###### Vector

# creating an "empty" vector to put the values
output_median <- vector("double", ncol(df))

# running for loop
for (i in seq_along(df)) {            
  output_median[[i]] <- median(df[[i]])
}

# checking result
output_median
## [1] -0.10194498 -0.74812672 -0.09009347 -0.02034463
##### List 

# creating an "empty" vector to put the values
output_median_list <- list()

# running for loop
for (i in seq_along(df)) {            
  output_median_list[[i]] <- median(df[[i]])
}

# checking result
output_median_list
## [[1]]
## [1] -0.101945
## 
## [[2]]
## [1] -0.7481267
## 
## [[3]]
## [1] -0.09009347
## 
## [[4]]
## [1] -0.02034463

While-loops

You should also be aware that there is a conditional version of for-loops called while loops. Their uses are a little more niche and as such will not be covered in this lab. For those among you who are curious about them, you can find a pretty good tutorial here.


The purrr package 🐱

For-loops are not as important in R as they are in other languages because R is a functional programming language. This means that it’s possible to wrap up for-loops in a function, and call that function instead of using the for-loop directly. 💡

Basic syntax 🖊

The purrr package provides functions that eliminate the need for many common for loops. The apply family of functions in base R (apply(), lapply(), tapply(), etc.) solve a similar problem, but purrr is more consistent and thus is easier to learn. The most useful function will be map(.x, .f), where:

  • .x: is a vector, list, or data frame
  • .f: is a function
  • output: is a list
Logic behind vectorised functions (also called functional programming).

Logic behind vectorised functions (also called functional programming).

Three ways to pass functions to map():

  1. pass directly to map()
purrr::map(df, mean, na.rm = TRUE) 
  1. use an anonymous function \(x)
purrr::map(df, \(x) {
  mean(x, na.rm = TRUE) }
)
  1. use ~
purrr::map(.x = df, ~ mean(.x, na.rm = TRUE))

Let’s look at this in practice. Imagine you want to calculate the mean of each column in your data frame:

# repetitive code
mean(df$a)
mean(df$b)
mean(df$c)
mean(df$d)


# equivalent map function
purrr::map(.x = df, ~ mean(.x, na.rm =T))

# map function in tidyverse style
df |> purrr::map(mean)

The purrr::map*() family of functions 👪

The pattern of looping over a vector, doing something to each element and saving the results is so common that the purrr package provides a family of functions to do it for you. Indeed, their use is so common that several wrapper functions were created to include the final transformation of the list output. There is one function for each type of output:

  • purrr::map() returns a list.
  • purrr::map_lgl() returns a logical vector
  • purrr::map_int() returns an integer vector.
  • purrr::map_dbl() returns a double vector.
  • purrr::map_chr() returns a character vector.

Exercise 3:

Go back to the example above. Since all of the means are numeric, it makes more sense to store them in a vector rather than a list. Which function should we use and how?


purrr::map2 ‼️

You can also iterate over two inputs at the same time using map2(.x, .y, .f)

Logic behind map2().

Logic behind map2().

The function works exactly the same way as the purrr::map* functions for a single input. One caveat that applies only to purrr::map2() is that both inputs need to have the same length!


Exercise 4:

Write a function that returns both the mean and the standard deviation for the numeric variables in our palmerpenguins data.


Exercise 5:

Iterate over the relevant columns.


Non standard evaluation (optional)

There is, of course, much more to learn about functions in R and for those of you who want to take it further, you can find more information here. For now, consider this as is the first exposure to functions (that can actually already get you pretty far). However, it is important that you apply 🤓 your new skills and practice further on your own.

One such skill is the question of how to integrate tidyverse functions into your own functions. Most dplyr verbs use tidy evaluation in some way. Tidy evaluation is a special type of non-standard evaluation (meaning the way R interprets your written code) used throughout the tidyverse. There are two basic forms found in dplyr:

  • data masking makes it so that you can use data variables as if they were variables in the environment (i.e. you write my_variable instead of df$myvariable).
  • tidy selection allows you to switch choosing variables based on their position, name, or type (e.g. starts_with("x") or is.numeric).

Data masking and tidy selection make interactive data exploration fast and fluid, but they add some new challenges when you attempt to use them indirectly such as in a for loop or a function. This vignette shows you how to overcome those challenges.


Acknowledgements

This tutorial is partly based on R for Data Science, section 5.2, Quantitative Politics with R, chapter 3, the Tidyverse Session in the course Data Science for Economists by Grant McDermott, and Teaching the Tidyverse in 2023.

The section on functions and iteration is partly based on R for Data Science, section 5.2, Quantitative Politics with R, chapter 3; as well as the Tidyverse Session and on the excellent slides by Malcolm Barrett in the course Data Science for Economists by Grant McDermott. The data for the exercises was inspired by R for Epidemiology.

This script was drafted by Tom Arendt and Lisa Oswald, with contributions by Steve Kerr, Hiba Ahmad, Carmen Garro, Sebastian Ramirez-Ruiz, Killian Conyngham, and Carol Sobral.