Introduction

Geocoding is the process of determining geographic coordinates for place names, street addresses, and zip codes. From information like addresses, city or building names, for example, R fetches the corresponding coordinates and pins it on a map. In this workshop, we will use adopt the standardized way to handle and process spatial data, the “simple features” (sf) package. At the end of our tutorial, you should be familiar with some common verbs used in sf and how to use them for geospatial analysis.

Basic shapes understanding

Before we delve into sf functions, let’s make sure we know basic geometries and how to create them. Later we can use other functions on top of the geometries such as st_dimensions

par(mfrow = c(2,4)) # sets the layout of the plotting region to a 2x4 grid
par(mar = c(1,1,1.2,1)) # sets the margin size for the plots

# 1
p <- st_point(5:6) # needs to be consecutive numbers
plot(p, pch = 16)
title("point")
box(col = 'grey')

# 2
mp <- st_multipoint(rbind(c(1,1), c(2, 2), c(2, 1), c(2, 3), c(1,4)))
plot(mp, pch = 16)
title("multipoint")
box(col = 'grey')

# 3
ls <- st_linestring(rbind(c(1,1), c(5,5), c(5, 6), c(4, 6), c(3, 4), c(2, 3)))
plot(ls, lwd = 2)
title("linestring")
box(col = 'grey')

# 4
mls <- st_multilinestring(list(
  rbind(c(1,1), c(5,5), c(5, 6), c(4, 6), c(3, 4), c(2, 3)),
  rbind(c(3,0), c(4,1), c(2,1))))
plot(mls, lwd = 2)
title("multilinestring")
box(col = 'grey')

# 5 polygon
po <- st_polygon(list(rbind(c(2,1), c(3,1), c(5,2), c(6,3), c(5,3), c(4,4), c(3,4), c(1,3), c(2,1)),
    rbind(c(2,2), c(3,3), c(4,3), c(4,2), c(2,2))))
plot(po, border = 'black', col = 'lavender', lwd = 2)
title("polygon")
box(col = 'grey')

# 6 multipolygon
mpo <- st_multipolygon(list(
    list(rbind(c(2,1), c(3,1), c(5,2), c(6,3), c(5,3), c(4,4), c(3,4), c(1,3), c(2,1)),
        rbind(c(2,2), c(3,3), c(4,3), c(4,2), c(2,2))),
    list(rbind(c(3,7), c(4,7), c(5,8), c(3,9), c(2,8), c(3,7)))))
plot(mpo, border = 'brown', col = 'orange', lwd = 2)
title("multipolygon")
box(col = 'grey')

# 7 geometrycollection
gc <- st_geometrycollection(list(po, ls + c(0,5), st_point(c(2,5)), st_point(c(5,4))))
plot(gc, border = 'aquamarine', col = 'beige', pch = 16, lwd = 2)
title("geometrycollection")
box(col = 'grey')

opar <- par(mfrow = c(1, 2))
a <- st_polygon(list(cbind(c(0,0,7.5,7.5,0),c(0,-1,-1,0,0))))
b <- st_polygon(list(cbind(c(0,1,2,3,4,5,6,7,7,0),c(1,0,.5,0,0,0.5,-0.5,-0.5,1,1))))
plot(a, ylim = c(-1,1))
title("intersecting two polygons:")
plot(b, add = TRUE, border = 'red')
(i <- st_intersection(a,b))

## GEOMETRYCOLLECTION (POLYGON ((7 0, 7 -0.5, 6 -0.5, 5.5 0, 7 0)), LINESTRING (4 0, 3 0), POINT (1 0))

## GEOMETRYCOLLECTION(POINT(1 0), LINESTRING(4 0, 3 0), POLYGON((5.5 0, 7 0, 7 -0.5, 6 -0.5, 5.5 0)))
plot(a, ylim = c(-1,1))
title("GEOMETRYCOLLECTION")
plot(b, add = TRUE, border = 'red')
plot(i, add = TRUE, col = 'green', lwd = 2)

Read shapefiles from the web using st_read()

If we want to create a map, we need to use the ‘Shapefile’ format, which stores geometry and attribute information for spatial features in a dataset. The geometry is stored as a shape comprising a set of vector coordinates. We will need to read the Shapefile with st_read.

As an example, we read the World Administrative Boundaries dataset (https://www.naturalearthdata.com/downloads/). st_read follows the conventions of base R, similar to how it reads tabular data into data.frames, it reads file or database vector dataset as a sf object

world_boundaries <- st_read("DATA/world-administrative-boundaries.shp")

## Reading layer `world-administrative-boundaries' from data source 
##   `C:\Users\Giulia Maria\Downloads\Utili_Math_Data\07_Workshop_Geospatial_sf\DATA\world-administrative-boundaries.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 256 features and 8 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -180 ymin: -58.49861 xmax: 180 ymax: 83.6236
## Geodetic CRS:  WGS 84

plot(world_boundaries["iso3"])

Transforming point data to a sf object and plotting on a map

st_as_sf() st_crs() and mapview()

Let us get to know two important sf functions, st_as_sf() and st_crs(). They often work together. While st_as_sf() is used to convert data frames or matrices into sf objects, st_crs() is used to manage and retrieve the Coordinate Reference System (CRS) of a spatial object. To be clear, an sf object is a special data structure used for working with spatial data, while the CRS defines how the two-dimensional (or three-dimensional) points relate to actual places on the Earth.

Let’s see them in action in an example.

Hertie’s international environment makes it so we often work with people from all over the world. Specifically, the three of us that worked on this workshop are from Rome, Tianjin and Ottawa. To get a sense of how far these places actually are from each other, let’s plot them on a map. The function mapview() will come in very handy, as it will produces an interactive view of our cities.

cities <- data.frame(
  place = c("Rome", "Tianjin", "Ottawa"),
  long = c(12.496366, 117.190182, -75.690308),  # Longitude of each 
  lat = c(41.902782, 39.084158, 45.421529))      # Latitude of each 
    
cities_dsf <- st_as_sf(cities, coords = c("long", "lat"))  # Makes long and lat into mappable points 
st_crs(cities_dsf) <- 4326  

#The number 4326 is an identifier for a specific Coordinate Reference System (CRS) known as WGS 84. It's like a language code that tells software how to understand and display locations on a map.

mapview(cities_dsf)

geocode_OSM()

Quite cool! But to insert the coordinates feels a bit lenghty, especially when there is a r function that does it for you. I am referring to geocode_OSM()! This function geocodes a location (based on a search query) to coordinates. It uses OpenStreetMap Nominatim; Nominatim ( from the latin “by name”) is a search engine for OpenStreetMap (OSM) data.

cities_c <- c( "Tianjin", "Rome","Ottawa") 
geocoded_cities <- geocode_OSM(cities_c) 
 
geocoded_cities # Showing Dataframe with coordinates

geometry_geocoded_cities <- st_as_sf(geocoded_cities,   # Create new column using st_as_sf to store the geometry in a column 
                      coords = c(x = "lon", y = "lat"),
                      crs = 4326)

geometry_geocoded_cities$CityName <- cities_c # Adding the names we want the map to show, more explanatory than addresses and zipcodes 

# Use mapview and specify the zcol parameter to display city names on the legend
mapview(geometry_geocoded_cities, zcol = "CityName")

Now, let us add the universities we all went to for our Bachelors, University of Ottawa, Maastricht University and Minerva University (SF, US). To show geocode_OSM’s range, we will feed into it as a building name, a zip code and an address respectively. Note how we don’t need to go and fetch any API online but geocode_OSM does all the work for us. Let us overlay the two sets of points together on one single map. This will allow us to see all the places in the same map and not on two different ones. Please feel free to take a minute and interact with the map bleow; on the top left of the map, below the + and - symbols, there is a square icon. Press that and check the different displays of the same map, and how you can deselect and select again the two sets of points.

bachelors <- geocode_OSM(c("University of Ottawa", "6211 KW", "Market Street, Floor 9, San Francisco"))
geometry_bachelors <- st_as_sf(bachelors, coords = c(x = "lon", y = "lat"), crs = 4326) # Create new column using st_as_sf to store the geometry in a column 

university_names <- c("University of Ottawa", "Maastricht University", "Minerva University (SF, US)") # names that will show up on the map
geometry_bachelors$UniName <- university_names 
mapview(geometry_bachelors, zcol = "UniName") # Use mapview and specify the zcol parameter to display university names on the legend

overlay_map <- mapview(geometry_bachelors, zcol = "UniName") + mapview(geometry_geocoded_cities, zcol = "CityName") # Maps overlay

overlay_map

st_distance()

Let’s now explore another function of sf, st_distance(). The function can either be used to compute the distance between pairs of geometries, or even compute the area or the length of a set of geometries. In our example, we are using it to check which pair of places are the closest to each other and which ones are the farthest. In many cases, this information would not be as easy to see as in our case. For us now its quite easy to understand what canada is very far from China, but when dealing with large dataset, it’d be just extremely time consuming to google the distance between each place and check the nearest pair.

colnames(geometry_geocoded_cities) <- colnames(geometry_bachelors)  # setting the col names as the same otherwise it would not rbind 
rbind(geometry_geocoded_cities, geometry_bachelors)

data_sf <- rbind(geometry_geocoded_cities, geometry_bachelors)

# Compute the distance matrix
dist_matrix <- st_distance(data_sf)

# To find the pair with the smallest non-zero distance
min_distance_meters <- min(dist_matrix[upper.tri(dist_matrix)])
min_distance_km <- min_distance_meters / 1000  # Convert to kilometers

closest_pair_indices <- which(dist_matrix == min_distance_meters, arr.ind = TRUE)

# Extract the names of the closest cities
city1 <- data_sf$query[closest_pair_indices[1, 1]]
city2 <- data_sf$query[closest_pair_indices[1, 2]]

# Print the names and distance
cat("The closest places are:", city1, "and", city2, "with a distance of", min_distance_km, "km")

## The closest places are: University of Ottawa and Ottawa with a distance of 0.5556235 km

Ottawa and Ottawa University is the closest pair, expectedly so. Let’s do the same with the furthest distances now.

# To find the pair with the maximum distance
max_distance_meters <- max(dist_matrix)
max_distance_km <- max_distance_meters / 1000  # Convert to kilometers

farthest_pair_indices <- which(dist_matrix == max_distance_meters, arr.ind = TRUE)

# Extract the names of the farthest cities
city1 <- data_sf$query[farthest_pair_indices[1, 1]]
city2 <- data_sf$query[farthest_pair_indices[1, 2]]

# Print the names and distance
cat("The farthest places are:", city1, "and", city2, "with a distance of", max_distance_km, "km")

## The farthest places are: Ottawa and Tianjin with a distance of 10527.2 km

We just learned how to transform information about a location into coordinates and points on a map, and to check the distance between points. Let us no do a different kind of exercise, and see how sf can help is drawing a region’s boundaries and extract information about it.

Working with polygons

st_geometry()

Let’s now take a look at the spatial data of North Carolina counties ‘nc’ in the US. We can use ‘st_geometry’ function to generate the map of geometry composed with many polygons representing the counties.

demo(nc, ask = FALSE, echo = FALSE) # show the nc dataset
plot(st_geometry(nc)) # Plotting the geometry of the 'nc' dataset

We can also check the specific information of one county. For instance, we can check some attributes of the county ‘Alleghany’ and plot it’s polygon respectively.

nc[2,]

plot(st_geometry(nc)[2]) # plot the polygon 'Alleghany'

Overlap and Intersection with st_overlaps() and st_intersects()

St_overlaps() checks if geometry A and B “spatially overlap”. They overlap if A and B have the same dimension (o for points, 1 for lines, 2 for areas) and their interiors intersect in that dimension.

# check if there is overlap between the county (should be no overlap)
st_overlaps(nc, nc)

## although coordinates are longitude/latitude, st_overlaps assumes that they are
## planar

## Sparse geometry binary predicate list of length 100, where the
## predicate was `overlaps'
## first 10 elements:
##  1: (empty)
##  2: (empty)
##  3: (empty)
##  4: (empty)
##  5: (empty)
##  6: (empty)
##  7: (empty)
##  8: (empty)
##  9: (empty)
##  10: (empty)

# check the lengths of all polygons
lengths(st_overlaps(nc, nc))

## although coordinates are longitude/latitude, st_overlaps assumes that they are
## planar

##   [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

The result shows that polygon 1 has 0 overlaps with other polygons and same for the rest of 9 polygons showed here. The lengths returns all 0 so we can be sure that there’s no overlap between any pair of polygons.

Now let’s look at another concept - intersections. ‘st_intersects’ returns a logical matrix indicating whether each geometry pair intersects.

# Determine the spatial relationship between different counties
intersections <- st_intersects(nc, nc)
print(intersections)

## Sparse geometry binary predicate list of length 100, where the
## predicate was `intersects'
## first 10 elements:
##  1: 1, 2, 18, 19
##  2: 1, 2, 3, 18
##  3: 2, 3, 10, 18, 23, 25
##  4: 4, 7, 56
##  5: 5, 6, 9, 16, 28
##  6: 5, 6, 8, 28
##  7: 4, 7, 8, 17
##  8: 6, 7, 8, 17, 20, 21
##  9: 5, 9, 15, 16, 24, 31
##  10: 3, 10, 12, 25, 26

nc_geom = st_geometry(nc)
nc_ints = st_intersection(nc_geom)

## although coordinates are longitude/latitude, st_intersection assumes that they
## are planar

# although coordinates are longitude/latitude, st_intersection
# assumes that they are planar
plot(nc_ints, main = "All intersections")

Color key mapping

Now we can see the color in polygons by applying the ‘SID74’ attribute, which looks at the death counts 1974-1978 in North Carolina counties.

plot(nc["SID74"], graticule = TRUE, axes = TRUE, key.pos = 4) # plot the map, coloring the polygons based on 'SID74' attribute

Now that we get the map with the death counts attribute, we could check the birth count attribute to see if similar pattern appears across the counties.

library(mapview) |> suppressPackageStartupMessages()
mapviewOptions(fgb = FALSE)
nc |> mapview(zcol = "BIR74", legend = TRUE, col.regions = sf.colors)

With the help of this interactive map, we can see ‘Mecklenburg’ has the most populations in North Carolina.

# Calculate the total population for each county in the 'nc' dataset
total_pop <- aggregate(nc[["BIR74"]], by = list(nc[["NAME"]]), FUN = sum)
colnames(total_pop) <- c("County", "Total_Population")
head(total_pop)

# Find the county with the highest population in the 'nc' dataset
county_highest_pop <- total_pop[which.max(total_pop$Total_Population), ]
print(county_highest_pop)

##         County Total_Population
## 60 Mecklenburg            21588

Let’s use ’st_centroid‘ to find out about the centroids in each region in North Carolina. Centroids tend to be the areas of easiest reach. For urban planning, infrastructure development, and resource allocation, knowing the centroids of regions can help authorities decide where to build facilities such as roads, schools, hospitals.

Important: run these two codes together or the plots will not add with each other!

# Plot the geometry of the 'nc' with a grey border
plot(st_geometry(nc), border = 'grey') 

# Plot the centroids of the 'nc' with a red color and add it to the previous plot
plot(st_geometry(st_centroid(nc)), pch = 3, col = 'red', add = TRUE)

## Warning: st_centroid assumes attributes are constant over geometries

References

This is where we got out datasets and pieces of our examples.

For the first example on shapes https://r-spatial.org/book/03-Geometries.html

For the second example on shapefiles https://www.naturalearthdata.com/downloads/

For the third excercise on mapping universities and cities https://www.paulamoraga.com/book-spatial/the-sf-package-for-spatial-vector-data.html

For the fourth excercise on polygons using North Carolina https://r-spatial.org/book/01-hello.html https://r-spatial.github.io/sf/articles/sf5.html

An Introduction to Geocoding with ‘sf’

Giulia Maria Petrilli, Julia Wu and Jennifer Estigene

2023-10-30