This document provides with a gentle introduction to R as it will be used in the UCL undergraduate modules SECU0013 (PSM2) and SECU0050 (Adv. Crime Analysis) of the BSc in Security and Crime Science.
We assume that you have a basic knowledge of the core concepts that most programming languages share (e.g. if-else statements, variables, for-loops). In the following steps we will show you how these concepts work in R.
If you work through this notebook carefully, you will be able to master all subsequent tutorials and assignments in these modules. Equally, the concepts introduced here and the mastery of these is an important precondition for the next lectures and we encourage you to ask for clarification if you struggle with some aspects in the Q&A forum or during the first tutorial.
We are aware that you did not sign up for a programming module, nor a full BSc in computational statistics, data science or software engineering. Nevertheless, we hope that we can show you in the current module why statistics (PSM2 module) and data science (ACA module) are core competencies not only for researchers but arguable ever more so for police and intelligence analysts, policy makers and practically any other related profession.
We adhere to a rather pragmatic approach to programming in this module: it is a vehicle that enables you to solve problems that we would otherwise not be able to solve. A nice analogy for the programming aspects is that of a toolbox. You will learn skills that you can use as tools to solve most problems that you will encounter when making sense of data.
Learning a programming language is hard but also fun. The start is always slow and paved with problems/errors/bugs that are frustrating. Struggling with a programming problem is the norm and we’d be surprised if everybody would solve all problems immediately. The most important part is: never shy away from asking a question.
Yes, you can and should use other introductions, too. Here’s a brief list:
To work through the 12 steps, we assume that you have R Studio installed and running on your machine.
You might want to re-use some data in your code, so rather than typing it each time, you can ‘store’ it as a variable.
Suppose you have one single number that you wish to re-use, say, the no. of people living in London (8,173,941):
# You can store that number as a variable by assigning it through the '=' operator
london_population = 8173941
print(london_population)
## [1] 8173941
Note: to run the code that you have typed in R, you can simply place the cursor at the end of the line and press CRTL+RETURN (Windows) or CMD+ENTER (Mac). You can also highlight multiple lines and run these. R will exectue the code line by line.
Variable names must start with a letter and are case-sensitive:
#print(London_pupulation)
# --> this would return an error because this variable does not exist
print(london_population) #returns the value assigned to this variable
## [1] 8173941
Note that the #
in the R code allows you to comment a section. You will need this when you deliver your assignments that contain R code.
Suppose we have a series of numbers that represent the (made up) number of snatching crimes in Camden in 5 months: 16, 32, 40, 12, 8
# We can store this sequence as a vector:
snatching_crimes_camden = c(16, 32, 40, 12, 8)
print(snatching_crimes_camden)
## [1] 16 32 40 12 8
Note how we use the c()
function to combine values into a vector here.
Task:
Create a variable that is equal to your age in years and call that variable my_age
:
#type+run your code here
It’s important to know how to find help. There are two ways through which you will be able to get help for most of your R problems:
?
in R, it will bring up the documentation of a function/package.#for the c() functio above, you'd use:
?c
#for the 'mean' function to calculate the average, you'd use
?mean
#if you only have a vague idea of the term you're looking for, you'd use the double ??:
??confidence
#or
??barplot
You can use R as a calculator using numbers directly or variables that you assigned previously:
#simple addition
2-3+42
## [1] 41
#multiplication
23*67
## [1] 1541
#fractions
3/4
## [1] 0.75
#exponentiation
2^9 #reads: 2 to the power of 9
## [1] 512
#square roots
sqrt(4) #reads: square root of 4
## [1] 2
Using variables:
a = 42
b = 2
a/b
## [1] 21
#assigning variables to calculated values
c = a/b
c
## [1] 21
Making use of R’s vectorised (=relying on vectors as core) structure, you can divide multiple values through one other value:
#dividing the five counts of snatching crimes by the population count of Camden
camden_population = 253400
snatching_crimes_camden/camden_population
## [1] 6.314128e-05 1.262826e-04 1.578532e-04 4.735596e-05 3.157064e-05
Task:
Use the my_age
variable and calculate your age in seconds (assuming each year has 365 days):
#type+run your code here...
For the purpose of this module, we will focus on
#single values can be numeric, characters or boolean
my_numeric = 4
my_character = "this is a character string"
my_boolean = TRUE
numeric_vector = c(0,1,2,3,4,5,6,7,8,9)
character_vector = c("word 1", "word 2", "word 3")
boolean_vector = c(TRUE, FALSE, FALSE, TRUE)
The data frame is the most important data structure in R for our module and we will use them throughout.
“[A dataframe] is a collection of values, usually either numbers (if quantitative) or strings (if qualitative). Values are organised in two ways. Every value belongs to a variable and an observation. A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes.” from: Tidy Data
Thus, a dataframe can be used to represent multiple values. More specifically, it can represent multiple variables for multiple observations.
#this will create a dataframe of 10 observations with three variables:
# - an identifier of financial trading platforms: A, B, C, D, ... J
# - count of fraudulent transactions in 2018: numerical
# - count of all transactions in 2018: numerical
trading_data = data.frame('identifier' = LETTERS[1:10]
, 'cnt_fraudulent' = rpois(n = 10
, lambda = 30
)
, 'cnt_all' = round(rnorm(n = 10
, mean = 100000
, sd = 30000
)
, 0)
)
trading_data
# you do not have to master the functions used here (rnorm and rpois generate random numbers; and LETTERS queries all letters in alphabetical order)
You can see that each row represents an observation that has multiple variables.
my_list = list() #this creates an empty list
#now we populate the list manually:
my_list[[1]] = boolean_vector #this places the my_boolean variable (a boolean vector created above) in the first position of the empty list
my_list[[2]] = character_vector
my_list[[3]] = trading_data
#take a look at the list:
my_list
## [[1]]
## [1] TRUE FALSE FALSE TRUE
##
## [[2]]
## [1] "word 1" "word 2" "word 3"
##
## [[3]]
## identifier cnt_fraudulent cnt_all
## 1 A 30 117278
## 2 B 24 157447
## 3 C 23 61221
## 4 D 25 23100
## 5 E 35 112613
## 6 F 34 124852
## 7 G 32 80302
## 8 H 30 113544
## 9 I 18 129611
## 10 J 28 123071
Rather than creating a data.frame manually as we did above, you’d normally load data into R and work with the data once imported.
The data import/export is very important because it allows you to use data from someone else and to share your own data.
We’ll focus on two common types of data import:
.csv
file (comma-separated file): this file type is often generated by government websites.In the /data
directory, you can find a file called crime_data_july_mps.csv
. These data are all police-recorded crimes to the Metropolitan Police in July 2018 from police.uk.
We will read this file into R as a dataframe as follows:
my_first_imported_data = read.csv(file = "./data/crime_data_july_mps.csv" #this points R to the file that you want to read
, header = T #this specifies whether or not there are variable names in the dataframe
)
Note: the file = ...
command assumes you are in the tutorials folder when working through this code. If you are not, R will not find the file. Alternatively, you can create your own folder with this R Notebook file and create a sub-folder named data with the crime_data_july_mps.csv
file in it.
#Have a look at the just imported data
head(my_first_imported_data) #the head command will display only the first 5 rows
.RData
file: this is one of R’s own data exchange formats that is easy to use if you exchange or save data that is used in R only.You can load the same dataset with the load
command:
load(file = "./data/crime_data_july_mps.RData")
head(crime_data_july)
#note that this automatically creates the dataframe called 'crime_data_july' in your workspace
You can see on your computer that the .csv
file is almost 3x as big as the same data in .RData
format.
Task:
crime_data_july
:#Hint: take a look at the help file for the head() function to see what you can specify there
.RData
file. Load the file called “crime_data_jul_aug_sep_mps.RData”. This dataframe contains the crime data for the months July 2018, August 2018 and September 2018.#type+run your code here
#Hint: you'd need to modify this code load(file = "./data/crime_data_july_mps.RData")
#Once loaded, take a look at the first ten rows of the new dataframe called crime_data_jul_aug_sep
Remember how dataframes are structured? In essence, it’s all about rows and columns, with each row being an observation and each column being a variable (or attribute) of that observation.
Using this notion, we can exploit R’s [row, column]
notation: using squared brackets, we can index a dataframe using the two dimensions ROWS and COLUMNS, separated by a comma.
crime_data_july[1, 1] #first row, first column
## [1] "e9fe81ec7a6f5d2a80445f04be3d7e92831dbf3090744ebf94c46f359ca94854"
crime_data_july[2, 4] #second row, fourth column
## [1] 51.89314
crime_data_july[30:50, 3:5] #row 30 to 50, column 1 to 3
Note that the :
means “to” so that you can indicate a range from one numerical value to the other (e.g. from the first column to the fifth).
#We can also look at all observations of a column by leaving the ROW dimension empty:
head(crime_data_july[, 5]
, n = 50) #all observations of the fifth column
## [1] "Other theft" "Other crime"
## [3] "Violence and sexual offences" "Anti-social behaviour"
## [5] "Anti-social behaviour" "Anti-social behaviour"
## [7] "Anti-social behaviour" "Anti-social behaviour"
## [9] "Criminal damage and arson" "Criminal damage and arson"
## [11] "Drugs" "Drugs"
## [13] "Drugs" "Other theft"
## [15] "Other theft" "Other theft"
## [17] "Possession of weapons" "Possession of weapons"
## [19] "Theft from the person" "Theft from the person"
## [21] "Vehicle crime" "Vehicle crime"
## [23] "Vehicle crime" "Vehicle crime"
## [25] "Violence and sexual offences" "Violence and sexual offences"
## [27] "Violence and sexual offences" "Violence and sexual offences"
## [29] "Violence and sexual offences" "Violence and sexual offences"
## [31] "Violence and sexual offences" "Violence and sexual offences"
## [33] "Violence and sexual offences" "Violence and sexual offences"
## [35] "Violence and sexual offences" "Violence and sexual offences"
## [37] "Violence and sexual offences" "Anti-social behaviour"
## [39] "Anti-social behaviour" "Anti-social behaviour"
## [41] "Criminal damage and arson" "Other theft"
## [43] "Violence and sexual offences" "Violence and sexual offences"
## [45] "Violence and sexual offences" "Violence and sexual offences"
## [47] "Violence and sexual offences" "Anti-social behaviour"
## [49] "Anti-social behaviour" "Anti-social behaviour"
#And we can look at all variables (columns) of a single observation by leaving the COLUMN dimension empty:
crime_data_july[111, ] #all variables of the 111st row
You can also access columns by their variable name using the $
operator:
#to see all column names, we can use the names() function
names(crime_data_july)
## [1] "crime.id" "month" "longitude" "latitude" "crime.type"
#if we want to select only the longitude variable, we can access it directlt by its name:
head(crime_data_july$longitude) #we use head() to avoid excessive output printing on the screen
## [1] 0.774271 -1.007293 0.744706 0.134947 0.137065 0.148434
Task:
crime_data_july
dataframe to display the rows 100 to 300 and the 4th and 5th column:#type+run your code here
crime_data_july
dataframe to display the rows 10 to 40 and 1000 to 2000, and the columns 1, 3 and 5.#type+run your code here
#Hint: you can specify the ROW and COLUMN dimension as vectors:
#crime_data_july[c(1,2,3), c(2,3,4)] is equal to crime_data_july[1:3, 2:4]
There are several ways in which you may want to explore a dataframe. We will briefly walk through the most useful ones:
nrow(NAME_OF_DATAFRAME)
ncol(NAME_OF_DATAFRAME)
dim(NAME_OF_DATAFRAME)
length(COLUMN)
Other useful ways to get a first glimpse at the data are summary
and str
:
summary(crime_data_july) #this gives you basic information for each column, with statistical summaries for numerical values
## crime.id month longitude latitude
## Length:95677 Length:95677 Min. :-5.4827 Min. :50.21
## Class :character Class :character 1st Qu.:-0.2021 1st Qu.:51.47
## Mode :character Mode :character Median :-0.1152 Median :51.52
## Mean :-0.1217 Mean :51.51
## 3rd Qu.:-0.0352 3rd Qu.:51.55
## Max. : 1.4236 Max. :54.99
## NA's :1100 NA's :1100
## crime.type
## Length:95677
## Class :character
## Mode :character
##
##
##
##
str(crime_data_july) #shows you the first five observations for each column
## 'data.frame': 95677 obs. of 5 variables:
## $ crime.id : chr "e9fe81ec7a6f5d2a80445f04be3d7e92831dbf3090744ebf94c46f359ca94854" "076b796ba1e1ba3f69c9144e2aa7a7bc85b61d51bf7a5966fa1a45fecb1c6aca" "163e996d58995cf87d14f15711fbd87052681919f02029af4739c2eb88be7f5e" "" ...
## $ month : chr "2018-07" "2018-07" "2018-07" "2018-07" ...
## $ longitude : num 0.774 -1.007 0.745 0.135 0.137 ...
## $ latitude : num 51.1 51.9 52 51.6 51.6 ...
## $ crime.type: chr "Other theft" "Other crime" "Violence and sexual offences" "Anti-social behaviour" ...
Often you may want to cross-tabulate the data, for example to count how many occurrences of a specific crime type there are in the current data. In R, you’d use the table()
function for this:
#to count the occurrences of each level of a variable, we can use table().
#here we want to count how many occurrences of each crime type we have in the current data
table(crime_data_july$crime.type)
##
## Anti-social behaviour Bicycle theft
## 21197 2294
## Burglary Criminal damage and arson
## 6321 5028
## Drugs Other crime
## 2416 909
## Other theft Possession of weapons
## 10411 597
## Public order Robbery
## 4956 2956
## Shoplifting Theft from the person
## 3547 3750
## Vehicle crime Violence and sexual offences
## 9102 22193
There are cases where you want to count occurrences split by another variable, for example the number of crime types per month. We can look at this using the table command for the crime_data_jul_aug_sep
dataframe.
#uncomment+run the next line, if you have not yet loaded the
load(file = "./data/crime_data_jul_aug_sep_mps.RData")
#first look at the structure
str(crime_data_jul_aug_sep)
## 'data.frame': 270869 obs. of 5 variables:
## $ crime.id : chr "e9fe81ec7a6f5d2a80445f04be3d7e92831dbf3090744ebf94c46f359ca94854" "076b796ba1e1ba3f69c9144e2aa7a7bc85b61d51bf7a5966fa1a45fecb1c6aca" "163e996d58995cf87d14f15711fbd87052681919f02029af4739c2eb88be7f5e" "" ...
## $ month : chr "2018-07" "2018-07" "2018-07" "2018-07" ...
## $ longitude : num 0.774 -1.007 0.745 0.135 0.137 ...
## $ latitude : num 51.1 51.9 52 51.6 51.6 ...
## $ crime.type: chr "Other theft" "Other crime" "Violence and sexual offences" "Anti-social behaviour" ...
summary(crime_data_jul_aug_sep)
## crime.id month longitude latitude
## Length:270869 Length:270869 Min. :-5.483 Min. :50.21
## Class :character Class :character 1st Qu.:-0.202 1st Qu.:51.47
## Mode :character Mode :character Median :-0.116 Median :51.52
## Mean :-0.121 Mean :51.51
## 3rd Qu.:-0.034 3rd Qu.:51.55
## Max. : 1.737 Max. :54.99
## NA's :3275 NA's :3275
## crime.type
## Length:270869
## Class :character
## Mode :character
##
##
##
##
#let's count how many crimes (in total) there are per month
table(crime_data_jul_aug_sep$month)
##
## 2018-07 2018-08 2018-09
## 95677 88864 86328
#what we often want is to use two variables to count one variable split by another.
#we can simply use multiple arguments in the table() function:
table(crime_data_jul_aug_sep$month, crime_data_jul_aug_sep$crime.type)
##
## Anti-social behaviour Bicycle theft Burglary
## 2018-07 21197 2294 6321
## 2018-08 19881 2093 6519
## 2018-09 17755 2200 6108
##
## Criminal damage and arson Drugs Other crime Other theft
## 2018-07 5028 2416 909 10411
## 2018-08 4584 3135 860 9738
## 2018-09 4471 2779 902 9701
##
## Possession of weapons Public order Robbery Shoplifting
## 2018-07 597 4956 2956 3547
## 2018-08 533 3952 2532 3596
## 2018-09 533 4033 2625 3443
##
## Theft from the person Vehicle crime Violence and sexual offences
## 2018-07 3750 9102 22193
## 2018-08 3317 9260 18864
## 2018-09 2972 9806 19000
#this returns the number of crime types per months
Just as you can load data quickly into R, you can also save new or modified dataframes.
Suppose you are interested in “Violence and sexual offences” only. Let’s create a new dataframe that contains only these occurrences. We can use R’s ROW, COLUMN
structure and simple add a condition, namely that crime_data_jul_aug_sep$crime.type
must be equal to Violence and sexual offences
. In R terms, we’d say: select all rows where the column crime_data_jul_aug_sep$crime.type
equals (expressed as ==
) Violence and sexual offences
. Let’s call this new dataframe “violence_and_sexual_offences”
#The correct R notation for this is:
violence_and_sexual_offences = crime_data_jul_aug_sep[crime_data_jul_aug_sep$crime.type == "Violence and sexual offences", ] #note that the COLUMN dimension is empty since we want all columns to be selected
#we can check whether this worked:
table(violence_and_sexual_offences$crime.type)
##
## Violence and sexual offences
## 60057
#you can see that we now only have incidents of violence and sexual offences
Now let’s store this new dataframe:
.csv
file: similar to read.csv
, we can now use write.csv
to store the new dataframewrite.csv(x = violence_and_sexual_offences #this is our newly created dataframe
, file = './data/new_dataframe_violence_sexual_offences.csv' #this is the file name of the csv file
)
.RData
file: the counterpart to load
is save
save(violence_and_sexual_offences #our dataframe
, file = './data/new_dataframe_violence_sexual_offences.RData' #our filename
)
You can check in the data
folder that you have just created two new files on your computer.
In principle, a function is a small sequence of computations that takes an input (e.g. a column) and returns an output (e.g. the mean of the column).
Let’s use the trading_data
dataframe that we created at the beginning to use functions in R.
Some of the most frequently used functions are also the easiest to handle:
#the mean of the count of fraudulent transactions
mean(trading_data$cnt_fraudulent)
## [1] 27.9
#the minimum of the count of fraudulent transactions
min(trading_data$cnt_fraudulent)
## [1] 18
#the max of the count of fraudulent transactions
max(trading_data$cnt_fraudulent)
## [1] 35
#the standard deviation of the count of fraudulent transactions
sd(trading_data$cnt_fraudulent)
## [1] 5.363457
Task:
Calculate the range, mean, and variance for the trading_data$cnt_all
column:
#type+run your code here
#Hint: you might need to use R's help or Google to find the function corresponding to each output
You can also apply functions indirectly by using R as a calculator. Suppose you wanted to calculate the rate of fraudulent transactions per 1000 transactions for each trading platform.
Using the dataframe structure, you can directly calculate:
(trading_data$cnt_fraudulent/trading_data$cnt_all)*1000
## [1] 0.2558025 0.1524322 0.3756881 1.0822511 0.3107989 0.2723224 0.3984957
## [8] 0.2642148 0.1388771 0.2275109
Task:
Which platform has the highest rate of fraudulent transactions?
#type+run your code here
#Hint: use a core R function to retrieve the maximum value.
R knows two ways to express the common if-else statement.
or…
Suppose you want to create a new variable that represents whether the crime occurred in the peak of summer or in the late summer. You could say that July and August are peak summer, whereas September is late summer.
Let’s create a new variable called “summer_season” which can take the value peak
(for July and August) or late
(for Sept.). Expressed in an if…else statement, we would say: IF the month is July or August THEN assign the value peak
ELSE assign the value late
In R we can use the ifelse()
function to do this intuitively:
crime_data_jul_aug_sep$summer_season = ifelse(crime_data_jul_aug_sep$month == '2018-07' | crime_data_jul_aug_sep$month == '2018-08' #if
, 'peak' #then
, 'late' #else
)
#There are a few things happening here:
#1. note that we can create new variables on the fly by using the $ notation.
#2. the ifelse statement contains three arguments: IF and ELSE
# the first argument states the condition (here "crime_data_jul_aug_sep$month == '2018-07' | crime_data_jul_aug_sep$month == '2018-08'")
# the second argument states what the new value should be if the condition is true
# the third argument states what should happen if the condition were false
Note also that we used the OR operator |
to say “IF the month is July or August”.
We can check whether the creation of the new variable worked:
table(crime_data_jul_aug_sep$month, crime_data_jul_aug_sep$summer_season)
##
## late peak
## 2018-07 0 95677
## 2018-08 0 88864
## 2018-09 86328 0
Finally, we can also create conditionals across multiple columns. Say we wanted to call all burglaries that happen in the peak summer month “holiday_burglaries” and all other crimes “other”:
crime_data_jul_aug_sep$holiday_indicator = ifelse(crime_data_jul_aug_sep$summer_season == 'peak' & crime_data_jul_aug_sep$crime.type == 'Burglary' #if
, 'holiday_burglaries' #then
, 'other' #else
)
head(crime_data_jul_aug_sep)
Note that we used the AND operator &
here to state that tow conditions must be true: crime_data_jul_aug_sep$summer_season
must be == 'peak'
AND crime_data_jul_aug_sep$crime.type
must be == 'Burglary'
.
Task:
Write an ifelse
statement to create a new variable called ‘property_crime’. This variable should be either TRUE
or FALSE
. It should be true if the crime.type
is one where a product/possession/property is taken, and false in all other cases.
#type+run your code here
#Hint: you might need to combine several crime types
Sometimes the core R functions (e.g., mean
, sd
) might not suffice to solve a particular problem. In these cases, you can write your own custom function.
Let’s suppose you want to calculate the distance between each crime location and UCL’s Department of Security and Crime Science at 35 Tavistock Square. Since the formula for the distance between two long/lat coordinates requires additional background, we’ll stick to the differences on the x-axis (longitude) and y-axis (latitude) separately. That is how far east/west and north/south the location is from 35 Tavistock Square.
The latitude and longitude coordinates of 35 Tavistock Square are 51.525066 and -0.129779 and we’d want a customised function that can perform that action generatively (i.e. without rewriting it each time).
Specifically, we’d want a function that let’s us determine whether we’re interested in langitude or longitude. The algorithm (= sequence of discrete steps or calculations) within the function should take the correct input column, calculate the difference to the 35 Tavistock Square coordinates and return an output column with that difference.
In R, we’d specify such a function as follows:
get_coordinate_difference = function(which_coordinate, absolute_value){ #here we name the function and assign function arguments
#1. we use placeholder "which_coordinate" to determine whether it's longitude or latitude), we'd want this argument to be called as a string
#2. we use the boolean placeholder absolute_value to set whether we want the non-negative absolute difference or the real value
#this is the function body
ucl_lat = 51.525066 #we set the target location coordinates
ucl_long = -0.129779
if(which_coordinate == 'long'){ #here we use an if...else statement to check whether the function parameter which_coordinate is equal to 'long'
if(absolute_value == T){ #now we check whether the user specified the function parameter absolute_value as TRUE
output_variable = abs(ucl_long - crime_data_jul_aug_sep$longitude) #the abs() function returns the asbolute value
} else if(absolute_value == F){
output_variable = ucl_long - crime_data_jul_aug_sep$longitude
}
} else if(which_coordinate == 'lat'){
if(absolute_value == T){
output_variable = abs(ucl_lat - crime_data_jul_aug_sep$latitude)
} else if(absolute_value == F){
output_variable = ucl_lat - crime_data_jul_aug_sep$latitude
}
#note that we specified an output variable called output_variable
} else { #we use this to catch errors in the function specification
print('The function arguments are not valid, please check them!')
}
#We now want this function to return the output variable.
return(output_variable)
}
You can try this function now:
crime_data_jul_aug_sep$long_difference = get_coordinate_difference(which_coordinate = 'long'
, absolute_value = T)
#we directly created a new variable and used our own function to return the longitude difference
head(crime_data_jul_aug_sep)
Task:
Use the function to create a new column that contains the latitude difference.
#type+run your code here
Now find which crime is closest to 35 Tavistock Square in latitude and longitude.
#type+run your code here
#Hint: you can use ?which.max to solve this problem
Which type of crime is closest to the department in latitude?
#type+run your code here
#Hint: use the indexing we learned above in Step 6
Advanced task:
Write your own function that returns NW (for North-West), NE (North-East), SW (South-West), SE (South-East) when the crime location is located in the North-West of the department, North-East of the department, etc.
#type+run your code here
Finally, the last core concept we will cover to have you prepared for using R in your quantitative analysis career are loops. Specifically: for-loops.
The concept of the for-loop in its simplest for is to repeat a computation (e.g. printing a name) a given number of times. This is useful and necessary if you want to log (i.e. print) output to the console, for example, or if you want to iterate of a number of columns (e.g. calculating the mean for each of 100 columns).
Ultimately, mastering loops is tricky but a massive time-saver and one of the most important tools for statistical computation and data science.
Suppose you’re tasked with the security management at UCL as to whether the crime was close or not close to the department. A colleague of yours says that anything that is within 200 metres is close and needs attention.
As an estimation, let’s convert 1.00 degree of latitude to 111,111 metres (see here).
We know want to perform the following steps:
The first part can be done by using the 1.00degree = 111111m
conversion on the absoloute difference in latitude:
#create the variable (if not already done above)
crime_data_jul_aug_sep$lat_difference = get_coordinate_difference(which_coordinate = 'lat'
, absolute_value = T)
head(crime_data_jul_aug_sep)
#create new variable that expresses the distance in metres
crime_data_jul_aug_sep$lat_in_m = crime_data_jul_aug_sep$lat_difference * 111111
head(crime_data_jul_aug_sep)
summary(crime_data_jul_aug_sep$lat_in_m)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 1867 4445 5794 8300 384934 3275
#we can now use table() to see how many crimes happened within 200m (latitude)
table(crime_data_jul_aug_sep$lat_in_m < 200)
##
## FALSE TRUE
## 260941 6653
You may notice that some distance-in-m values seem odd. In fact, the police.uk data are location-anonymised (i.e. the long, lat pairs are not the original ones from the crime). However, there is decent accuracy even for small geographical areas (see Tompson et al., 2014).
Now we want to show the very closest crimes as printed on the screen. We’ll use a for-loop for that:
#since we don't want to print all 6653 crimes, we select the first 200 crimes for illustration purposes
sub_selection = crime_data_jul_aug_sep[crime_data_jul_aug_sep$lat_in_m < 200, ] #first create a new data.frame with the crimes that are closer than 200m
sub_selection = sub_selection[1:200, ] # select only the first 200 rows
dim(sub_selection)
## [1] 200 10
head(sub_selection)
Let’s print some output to the screen in a for loop. In natural language, we want the code to do the following:
#in R, this would loop like this:
for(every_crime in 1:nrow(sub_selection)){ #here we have to state the scope of "every_crime". In this case, we want the code to test all crimes that we have in the data.
#Note that the placeholder "every_crime" can be replaced with anything you want; a common choice is "i"
#we can print to the screen by pasting elements together:
#check: ?paste
print(paste(sub_selection$crime.type[every_crime], "-->", round(sub_selection$lat_in_m[every_crime], 2), "metres", sep=" ")) #this prints the crime type and the distance (rounded off to 2 decimals)
}
## [1] "Burglary --> 0.22 metres"
## [1] "Burglary --> 15.78 metres"
## [1] "Burglary --> 34.78 metres"
## [1] "Criminal damage and arson --> 15.78 metres"
## [1] "Drugs --> 178.56 metres"
## [1] "Public order --> 0.22 metres"
## [1] "Vehicle crime --> 176.33 metres"
## [1] "Vehicle crime --> 0.22 metres"
## [1] "Violence and sexual offences --> 15.78 metres"
## [1] "Anti-social behaviour --> 172.89 metres"
## [1] "Burglary --> 147.78 metres"
## [1] "Other theft --> 147.78 metres"
## [1] "Public order --> 172.89 metres"
## [1] "Public order --> 172.89 metres"
## [1] "Public order --> 172.89 metres"
## [1] "Violence and sexual offences --> 172.89 metres"
## [1] "Violence and sexual offences --> 147.78 metres"
## [1] "Violence and sexual offences --> 192 metres"
## [1] "Violence and sexual offences --> 192 metres"
## [1] "Anti-social behaviour --> 18.44 metres"
## [1] "Anti-social behaviour --> 198.11 metres"
## [1] "Burglary --> 118.67 metres"
## [1] "Drugs --> 71 metres"
## [1] "Drugs --> 71 metres"
## [1] "Other theft --> 138.78 metres"
## [1] "Vehicle crime --> 138.78 metres"
## [1] "Vehicle crime --> 138.78 metres"
## [1] "Violence and sexual offences --> 118.67 metres"
## [1] "Anti-social behaviour --> 18.78 metres"
## [1] "Anti-social behaviour --> 18.78 metres"
## [1] "Bicycle theft --> 124.44 metres"
## [1] "Bicycle theft --> 124.44 metres"
## [1] "Burglary --> 18.78 metres"
## [1] "Other theft --> 18.78 metres"
## [1] "Other theft --> 15.44 metres"
## [1] "Other theft --> 2 metres"
## [1] "Other theft --> 18.78 metres"
## [1] "Other theft --> 2 metres"
## [1] "Other theft --> 1 metres"
## [1] "Other theft --> 18.78 metres"
## [1] "Other theft --> 15.44 metres"
## [1] "Public order --> 2.44 metres"
## [1] "Public order --> 15.44 metres"
## [1] "Shoplifting --> 18.78 metres"
## [1] "Theft from the person --> 124.44 metres"
## [1] "Theft from the person --> 2 metres"
## [1] "Theft from the person --> 18.78 metres"
## [1] "Theft from the person --> 124.44 metres"
## [1] "Vehicle crime --> 18.78 metres"
## [1] "Vehicle crime --> 30.33 metres"
## [1] "Violence and sexual offences --> 18.78 metres"
## [1] "Violence and sexual offences --> 18.78 metres"
## [1] "Violence and sexual offences --> 66.56 metres"
## [1] "Violence and sexual offences --> 124.44 metres"
## [1] "Violence and sexual offences --> 124.44 metres"
## [1] "Anti-social behaviour --> 191.67 metres"
## [1] "Anti-social behaviour --> 191.67 metres"
## [1] "Other theft --> 87.44 metres"
## [1] "Other theft --> 191.67 metres"
## [1] "Other theft --> 87.44 metres"
## [1] "Vehicle crime --> 191.67 metres"
## [1] "Violence and sexual offences --> 169.44 metres"
## [1] "Anti-social behaviour --> 144.78 metres"
## [1] "Anti-social behaviour --> 169 metres"
## [1] "Anti-social behaviour --> 169 metres"
## [1] "Anti-social behaviour --> 169 metres"
## [1] "Anti-social behaviour --> 169 metres"
## [1] "Anti-social behaviour --> 169 metres"
## [1] "Anti-social behaviour --> 169 metres"
## [1] "Anti-social behaviour --> 164.78 metres"
## [1] "Anti-social behaviour --> 182 metres"
## [1] "Bicycle theft --> 169 metres"
## [1] "Burglary --> 174.78 metres"
## [1] "Burglary --> 174.78 metres"
## [1] "Burglary --> 169 metres"
## [1] "Burglary --> 144.78 metres"
## [1] "Other theft --> 144.78 metres"
## [1] "Other theft --> 174.78 metres"
## [1] "Other theft --> 174.78 metres"
## [1] "Shoplifting --> 144.78 metres"
## [1] "Theft from the person --> 130.89 metres"
## [1] "Violence and sexual offences --> 147.44 metres"
## [1] "Violence and sexual offences --> 147.44 metres"
## [1] "Anti-social behaviour --> 84.22 metres"
## [1] "Anti-social behaviour --> 30.44 metres"
## [1] "Bicycle theft --> 68.22 metres"
## [1] "Criminal damage and arson --> 28.78 metres"
## [1] "Theft from the person --> 68.22 metres"
## [1] "Vehicle crime --> 84.22 metres"
## [1] "Vehicle crime --> 29.56 metres"
## [1] "Vehicle crime --> 68.22 metres"
## [1] "Vehicle crime --> 68.22 metres"
## [1] "Violence and sexual offences --> 12.44 metres"
## [1] "Bicycle theft --> 119.44 metres"
## [1] "Violence and sexual offences --> 143.44 metres"
## [1] "Anti-social behaviour --> 130.11 metres"
## [1] "Anti-social behaviour --> 82.78 metres"
## [1] "Anti-social behaviour --> 196.11 metres"
## [1] "Anti-social behaviour --> 66.78 metres"
## [1] "Anti-social behaviour --> 66.78 metres"
## [1] "Bicycle theft --> 49.11 metres"
## [1] "Burglary --> 82.78 metres"
## [1] "Burglary --> 82.78 metres"
## [1] "Burglary --> 82.78 metres"
## [1] "Other theft --> 130.11 metres"
## [1] "Other theft --> 60.56 metres"
## [1] "Other theft --> 66.78 metres"
## [1] "Other theft --> 25.44 metres"
## [1] "Other theft --> 60.56 metres"
## [1] "Other theft --> 25.44 metres"
## [1] "Other theft --> 66.78 metres"
## [1] "Other theft --> 66.78 metres"
## [1] "Other theft --> 49.11 metres"
## [1] "Other theft --> 66.78 metres"
## [1] "Other theft --> 25.44 metres"
## [1] "Other theft --> 82.78 metres"
## [1] "Public order --> 49.11 metres"
## [1] "Public order --> 49.11 metres"
## [1] "Robbery --> 196.11 metres"
## [1] "Robbery --> 66.78 metres"
## [1] "Robbery --> 66.78 metres"
## [1] "Shoplifting --> 25.44 metres"
## [1] "Theft from the person --> 196.11 metres"
## [1] "Vehicle crime --> 196.11 metres"
## [1] "Violence and sexual offences --> 45.89 metres"
## [1] "Violence and sexual offences --> 25.44 metres"
## [1] "Violence and sexual offences --> 45.89 metres"
## [1] "Anti-social behaviour --> 157.89 metres"
## [1] "Anti-social behaviour --> 152.33 metres"
## [1] "Other theft --> 157.89 metres"
## [1] "Other theft --> 157.89 metres"
## [1] "Public order --> 157.89 metres"
## [1] "Public order --> 157.89 metres"
## [1] "Robbery --> 141.67 metres"
## [1] "Robbery --> 141.67 metres"
## [1] "Robbery --> 141.67 metres"
## [1] "Theft from the person --> 157.89 metres"
## [1] "Theft from the person --> 157.89 metres"
## [1] "Violence and sexual offences --> 141.67 metres"
## [1] "Anti-social behaviour --> 167.44 metres"
## [1] "Anti-social behaviour --> 110 metres"
## [1] "Anti-social behaviour --> 110 metres"
## [1] "Bicycle theft --> 110 metres"
## [1] "Bicycle theft --> 110 metres"
## [1] "Bicycle theft --> 20.67 metres"
## [1] "Burglary --> 110 metres"
## [1] "Burglary --> 110 metres"
## [1] "Burglary --> 20.67 metres"
## [1] "Burglary --> 158 metres"
## [1] "Burglary --> 110 metres"
## [1] "Other theft --> 110 metres"
## [1] "Other theft --> 132.56 metres"
## [1] "Other theft --> 110 metres"
## [1] "Theft from the person --> 158 metres"
## [1] "Theft from the person --> 167.44 metres"
## [1] "Theft from the person --> 158 metres"
## [1] "Theft from the person --> 158 metres"
## [1] "Vehicle crime --> 132.56 metres"
## [1] "Violence and sexual offences --> 167.44 metres"
## [1] "Violence and sexual offences --> 132.56 metres"
## [1] "Anti-social behaviour --> 116.67 metres"
## [1] "Anti-social behaviour --> 116.67 metres"
## [1] "Anti-social behaviour --> 116.67 metres"
## [1] "Anti-social behaviour --> 116.67 metres"
## [1] "Bicycle theft --> 116.67 metres"
## [1] "Other theft --> 116.67 metres"
## [1] "Other theft --> 146.78 metres"
## [1] "Other theft --> 116.67 metres"
## [1] "Public order --> 131.44 metres"
## [1] "Robbery --> 116.67 metres"
## [1] "Robbery --> 146.78 metres"
## [1] "Violence and sexual offences --> 116.67 metres"
## [1] "Violence and sexual offences --> 146.78 metres"
## [1] "Violence and sexual offences --> 146.78 metres"
## [1] "Anti-social behaviour --> 169.11 metres"
## [1] "Anti-social behaviour --> 159.56 metres"
## [1] "Anti-social behaviour --> 105.89 metres"
## [1] "Anti-social behaviour --> 39.11 metres"
## [1] "Anti-social behaviour --> 26.11 metres"
## [1] "Anti-social behaviour --> 26.11 metres"
## [1] "Anti-social behaviour --> 26.11 metres"
## [1] "Anti-social behaviour --> 26.11 metres"
## [1] "Anti-social behaviour --> 159.56 metres"
## [1] "Anti-social behaviour --> 26.11 metres"
## [1] "Anti-social behaviour --> 145.33 metres"
## [1] "Anti-social behaviour --> 26.11 metres"
## [1] "Anti-social behaviour --> 26.11 metres"
## [1] "Anti-social behaviour --> 169.11 metres"
## [1] "Anti-social behaviour --> 26.11 metres"
## [1] "Anti-social behaviour --> 159.56 metres"
## [1] "Anti-social behaviour --> 124.78 metres"
## [1] "Anti-social behaviour --> 169.11 metres"
## [1] "Anti-social behaviour --> 26.11 metres"
## [1] "Bicycle theft --> 145.33 metres"
## [1] "Bicycle theft --> 75.33 metres"
## [1] "Bicycle theft --> 100.56 metres"
## [1] "Bicycle theft --> 39.11 metres"
## [1] "Bicycle theft --> 12.33 metres"
## [1] "Bicycle theft --> 169.11 metres"
## [1] "Burglary --> 109.44 metres"