Tutorial 1: Accessing and reading ebird data

Last updated: 2021-11-22

Checks: 7 0

Knit directory: ebird_light_pollution/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20211122)

The command set.seed(20211122) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 783d263

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 783d263. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Untracked files:
    Untracked:  analysis/2_make_a_simple_occurrence_plot.Rmd
    Untracked:  analysis/3_preparing_data_for_density_map.Rmd
    Untracked:  analysis/4_drawing_a_density_map.Rmd
    Untracked:  analysis/5_extracting_light_pollution_data.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/1_reading_data_with_ebird.Rmd) and HTML (docs/1_reading_data_with_ebird.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	783d263	markravinet	2021-11-22	initial tutorial upload

Introduction

For this tutorial, we will learn how to use R to interact with data from the ebird repository. eBird is a database maintained by the Cornell Lab of Ornithology and there is a lot of support, information and tutorials for using this sort of data available online (see the end of the tutorial for more information). We will be using the auk package in order to access this data and to get it into a more useable format.

Step 1: Installing the `auk` package

ebird has a huge amount of data stored on it, with over 600 million observations of birds across the world. You can access it via a browser with the front-end information, but if you’d like to use it for data analysis, you need to interact with the database itself.

One way to do this is using auk which is specifically designed by the Cornell Lab of Ornithology to interact with ebird in R. The naming of auk is a pun, an auk is a sea bird but AWK is also a Unix programming utility that is useful for interacting with text and runs in the background.

However, because AWK is specifically for Unix, unless you have a Mac or Linux machine, you need to install some other tools first.

Pre-installation: Mac and Linux users

You don’t need to do anything so you can proceed to the next step!

Pre-installation: Windows users

If you have a Windows machine, you will need to install cygwin which is an emulator for Unix on Windows.

Installing the package via R

Once you are ready to install the package, all you need to do is the following:

install.packages("auk") # you only need to do this once
library(auk)

You are now ready to use auk

Step 2: Setting the ebird directory

In order to interact with ebird data, you need to download it to your local machine and tell auk where to look for it. First you will need to make a request for the data you are interested in via ebird here. You can download the entire dataset, but for now as you get used to the data, it is best to just choose a specific species and location.

The next thing you need to do is place the data somewhere you can access it. The easiest way to do this is to create a project using RStudio and placing the directory with the data inside that.

For now we will use a filtered version of the September 2020 ebird release in order to demonstrate the principles of how to interact with the data. The first thing we need to do is to tell auk where to find the ebird data. We do this like so:

auk_set_ebd_path("./ebd_GB_relSep-2020/", overwrite = TRUE)

Note here that we are setting a path, so the ./ just indicates this directory is in the directory R is operating in. To see where that is, you should use getwd(). For more effective control over directories, files and paths I recommend using the here package.

Next up, we need to tell auk where to find the data file. We’ll use a filtered down one here to make life a little easier.

ebd_file <- "./ebd_GB_relSep-2020/ebd_GB_relSep-2020_filtered.txt"

Now we’re ready to access data!

Step 3: Accessing data with `auk`

The way that auk works to access data is that you essentially build a command that introduces a series of filters to the main dataset

In this example, we will filter the dataset for all observations of house sparrows from the UK between 2018 and 2020. Here’s how to achieve this:

my_filter <- auk_ebd(ebd_file) %>% 
  auk_species(c("House sparrow")) %>%
  auk_country("GB") %>%
  auk_date(c("2018-01-01", "2020-09-01"))

Two things to note. Firstly, auk is built to align with the principles of tidyR and the [tidyverse](https://www.tidyverse.org/. If you are not familiar with this, see here for a short introduction.

Secondly the filters are not actually applied yet. We’ve just made the command. You can verify this for yourself by typing my_filter into the command line.

NB you might get a warning about taxonomy - this is because ebird updates its taxonomic database for each year. Unless you are working with very rare species, this probably won’t be too much of an issue.

Next we need to do a bit of groundwork before applying our filter to the main dataset. Essentially, we should set up the name of our output file.

output <- "house_sparrow_test.txt"

With that done, we can now run the auk filters. This will filter the main data, write it to the output file we have specified and because of the overwrite = TRUE argument, overwrite any previously created files with the same name as output. This step might take a while to run, but once it has, we will be able to read our data back into R!

auk_filter(my_filter, file = output, overwrite = TRUE)

Step 4 : Reading data back into R

Essentially, once we filter the ebird database, we write our results out to a file and then we are able to read them back in. We can do this like so:

house <- read_ebd("./house_sparrow_test.txt", unique = TRUE, rollup = FALSE)

What have we done here? The rollup = TRUE rollup deals with taxonomic uncertainty - this is not such an issue for common UK birds. It essentially merges records for birds where the species is uncertain, here we set it to off since it takes a while to run. However, we do need to run unique = TRUE as this will ensure any duplicate records are combined. For example, if a group is birding and they are all recording their data in ebird, they are likely to make multiple records.

This is essentially all the data we need for the next step, but do spend the time to have a look at the table you have created. Think about how many observations there are, what the columns show and how you could manipulate it.

Other resources

sessionInfo()

R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] auk_0.5.1       here_1.0.1      workflowr_1.6.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7       whisker_0.4      knitr_1.36       magrittr_2.0.1  
 [5] R6_2.5.1         rlang_0.4.12     fastmap_1.1.0    fansi_0.5.0     
 [9] stringr_1.4.0    tools_4.1.2      xfun_0.28        utf8_1.2.2      
[13] git2r_0.28.0     jquerylib_0.1.4  htmltools_0.5.2  ellipsis_0.3.2  
[17] rprojroot_2.0.2  yaml_2.2.1       digest_0.6.28    tibble_3.1.5    
[21] lifecycle_1.0.1  crayon_1.4.2     later_1.3.0      vctrs_0.3.8     
[25] promises_1.2.0.1 fs_1.5.0         glue_1.5.0       evaluate_0.14   
[29] rmarkdown_2.11   stringi_1.7.5    compiler_4.1.2   pillar_1.6.4    
[33] httpuv_1.6.3     pkgconfig_2.0.3