Tutorial 5: Extracting light pollution data

Last updated: 2021-11-22

Checks: 7 0

Knit directory: ebird_light_pollution/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20211122)

The command set.seed(20211122) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 9277989

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 9277989. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store

Untracked files:
    Untracked:  data/.here
    Untracked:  data/house_sparrow_test.txt
    Untracked:  data/house_sparrow_test_2017.txt
    Untracked:  data/house_sparrow_test_sampling_2017.txt

Unstaged changes:
    Modified:   analysis/2_make_a_simple_occurrence_plot.Rmd
    Modified:   analysis/3_preparing_data_for_density_map.Rmd
    Modified:   analysis/4_drawing_a_density_map.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/5_extracting_light_pollution_data.Rmd) and HTML (docs/5_extracting_light_pollution_data.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	9277989	markravinet	2021-11-22	add tutorial 5

Introduction

Now that we know how to draw a density map based on predicted presence/absence, we’re now ready to extract the light pollution data. Our aim here is to extract the mean light pollution, as measured by satellite imaging for the centre of the grid squares that are the same as our bird density estimates. This tutorial will show you the code to do this and also help you write it in a way that is easy to extend for different years if you should choose to look at how the relationship between presence/absence varies over time.

Step 1: Installing packages and getting the data

For this tutorial, we are only using packages we have already worked with, so we just need to load them like so:

library(rgdal)
library(tidyverse)
library(raster)
library(sf)
library(auk)

Next we need to get hold of our light pollution data. We’re going to use the data from Yu et al (2018) who integrate multiple satellite sources of light pollution data from 1992-2018 into a single dataset, which we can download here. You should download the data for 2017 and 2018 and place it in your project directory. To make things easier, put that data, which is stored as a .tif inside a sub-directory called light_pollution_data. We will explore what this data is later.

Next, we’re going to set up a variable right at the start of our script. We do this like so.

my_year <- "2017"

Note that this variable is a character, not numeric. This because we’re going to use it in a paste0 command shortly. paste0 just allows you to combine character vectors together. For example:

paste0("I got my pet dog in ", my_year)

Why are we doing this? Because using paste0 we can set up our script so that we only ever need to alter the my_year variable in order to read in data from multiple years. With my_year set, we can then make the file names necessary for working with our data:

# make light file path
light_infile <- paste0("./light_pollution_data/Harmonized_DN_NTL_", my_year, "_simVIIRS.tif")
# make output filepath
output_file <- paste0("house_sparrow_light_data_", my_year, ".csv")

You’ll remember from the previous tutorial that we also set a _2017.txt suffix on our house sparrow data - i.e. house_sparrow_test_2017.txt. This means we can use paste0 for those file paths too, i.e. if we had data from multiple years.

# set paths for ebd data
sparrow_ebd <- paste0("house_sparrow_test_", my_year, ".txt")
sparrow_sample <- paste0("house_sparrow_test_sampling_", my_year, ".txt")

Step 2: Taking a look at the light data

Now that all our data paths are set, we are ready to take a look at the data. Let’s start with the light, since we’re not familiar with it. First we read in the light data as a raster file.

light <- raster(light_infile)

Next, we can plot it to make sure that it works:

plot(light)

This might take a few moments but once it works, you’ll see a map that resembles the world map with a scale of 0 to 63 (although 60 is the max value displayed). This is a good moment to explore what this data actually is. It’s essentially a composite of many different satellite images. For each pixel of that image, a digital number (DN) is calculated. This is equivalent to the brightness of the pixel, such that 0 has no brightness whereas 63 is maximum brightness.

Step 3: Sorting out the bird data

Now that we have the light data read in, it’s time to read in the bird data again. A lot of this will be familiar from the previous tutorial, so I will not explain it in detail. See tutorial 4 for more details.

Step 4: Extracting the light pollution data

At this point, we have our light and our bird data read in to the R environment. What we need to do next is take the coordinates of our bird observations and then extract the corresponding light pollution measurements for them. This is quite easy with some basic dplyr commands.

We can also use our my_year variable to extract only the observations from our data for that specific year. For example:

Here we used filter in order to extract all values of observation_date that contains "2017 - i.e. the value of my_year in this example. However, that returns a lot of data which we don’t actually need. All we’re interested in is the latitude and longitude. So we use select to obtain those only:

OK, now you’ve seen how this works, we can assign the output of this command to an object we’ll call points:

All we need to do now is use the raster::extract function to extract the light data for those specific points:

Have a look at the bird_light object - it is a vector of the pixel values for each of the coordinate points in our bird data. But this is only the light data, so what we want to do next is construct a tibble data frame with the light and bird observation data combined.

Again, we can do this with dplyr commands. We’ll break it down, step-by-step.

house_zf_c %>% 
  dplyr::filter(grepl(my_year, observation_date))

This first step is identical to one we’ve already run. It simply extracts all of the bird data where the observation_data matches the my_year object - "2017" in this case. Next we extend the code:

house_zf_c %>% 
  dplyr::filter(grepl(my_year, observation_date)) %>%
  dplyr::select(checklist_id, country, state, county, county_code,
                latitude, longitude, observation_date,
                scientific_name, observation_count, species_observed)

Here we’re using the select function to extract columns of interest. Most of these should be pretty self explanatory. The reason we do this is to make our dataset a bit more manageable - i.e. the principle of tidy data.

The only thing missing now is the light data. We can add this like so using mutate.

house_zf_c %>% 
  dplyr::filter(grepl(my_year, observation_date)) %>%
  dplyr::select(checklist_id, country, state, county, county_code,
                latitude, longitude, observation_date,
                scientific_name, observation_count, species_observed) %>%
  mutate(light = bird_light)

Look closely at the output here and you’ll see we added a column called mutate. It’s worth spending some time now to just look through the data frame and make sure you’ve understood what was done here. Moving on however, we need to assign the output from this code to a new object, which we’ll call house2 here.

house2 <- house_zf_c %>% 
  dplyr::filter(grepl(my_year, observation_date)) %>%
  dplyr::select(checklist_id, country, state, county, county_code,
                latitude, longitude, observation_date,
                scientific_name, observation_count, species_observed) %>%
  mutate(light = bird_light)

This is the data we’ll be doing downstream statistical analyses on! Finally!! So the last step is to make sure we store it properly - i.e. we write it out here.

write_csv(house2, output_file)

sessionInfo()

R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] auk_0.5.1       sf_1.0-3        raster_3.5-2    forcats_0.5.1  
 [5] stringr_1.4.0   dplyr_1.0.7     purrr_0.3.4     readr_2.0.2    
 [9] tidyr_1.1.4     tibble_3.1.5    ggplot2_3.3.5   tidyverse_1.3.1
[13] rgdal_1.5-27    sp_1.4-5        workflowr_1.6.2

loaded via a namespace (and not attached):
 [1] httr_1.4.2         jsonlite_1.7.2     modelr_0.1.8       assertthat_0.2.1  
 [5] cellranger_1.1.0   yaml_2.2.1         pillar_1.6.4       backports_1.3.0   
 [9] lattice_0.20-45    glue_1.5.0         digest_0.6.28      promises_1.2.0.1  
[13] rvest_1.0.2        colorspace_2.0-2   htmltools_0.5.2    httpuv_1.6.3      
[17] pkgconfig_2.0.3    broom_0.7.10       haven_2.4.3        scales_1.1.1      
[21] whisker_0.4        terra_1.4-11       later_1.3.0        tzdb_0.2.0        
[25] git2r_0.28.0       proxy_0.4-26       generics_0.1.1     ellipsis_0.3.2    
[29] withr_2.4.2        cli_3.1.0          magrittr_2.0.1     crayon_1.4.2      
[33] readxl_1.3.1       evaluate_0.14      fs_1.5.0           fansi_0.5.0       
[37] xml2_1.3.2         class_7.3-19       tools_4.1.2        hms_1.1.1         
[41] lifecycle_1.0.1    munsell_0.5.0      reprex_2.0.1       compiler_4.1.2    
[45] jquerylib_0.1.4    e1071_1.7-9        rlang_0.4.12       classInt_0.4-3    
[49] units_0.7-2        grid_4.1.2         rstudioapi_0.13    rmarkdown_2.11    
[53] gtable_0.3.0       codetools_0.2-18   DBI_1.1.1          R6_2.5.1          
[57] lubridate_1.8.0    knitr_1.36         fastmap_1.1.0      utf8_1.2.2        
[61] rprojroot_2.0.2    KernSmooth_2.23-20 stringi_1.7.5      Rcpp_1.0.7        
[65] vctrs_0.3.8        dbplyr_2.1.1       tidyselect_1.1.1   xfun_0.28