Some datasets are far too large for R to handle by itself. Structured Query Language (“SQL”) is a widely used international standard language for managing data stored in a relational databases management system. A relational database management system itself is an approach to managing data using a structure that can be contrasted against the ‘flat file’ approach we have been using thus far with R. Why use SQL? R doesn’t work very well with really huge datasets. A relational database management system offers a way of storing large amounts of information more efficiently and reducing the size of the dataset that we are working with. There are numerous relational database management systems such as Oracle, Microsoft Access, and MySQL. We are going to use SQLite, which is probably the most widely deployed database system. SQLite is in your phone, car, airplanes, thermostats, and numerous applicances. We are going to hook up SQLite to R so that R can handle large datasets.
These are some basic clauses in a SQL query that we will explore:
SELECT fields or functions of fields INTO results table FROM tables queried WHERE conditions for selecting a record GROUP BY list of fields to group ORDER BY list of fields to sort by
However, before being able to use SQL as a tool in R, it will first be necessary to install the sqldf
package.
library(sqldf)
Loading required package: gsubfn
Loading required package: proto
Loading required package: RSQLite
We will be working with Chicago crime data, which is accessible comma separated value (csv) format. Before we can even being learning SQL, we are going to have to do a fair bit of work to acquire the dataset, format it so that it’s ready for SQLite, and then load it into the SQLite database.
Navigate to the Chicago open data website to get the data. Click the “Export” button and select the “CSV” option, or directly download from here
The Chicago crime data is huge, more than 1.5Gb. It contains about 7 million records on all crimes reported to the Chicago police department since 2001. R does not handle really large datasets well. By using SQL, you will learn how to more efficiently work with large datasets and learn a data language that is used absolutely everywhere.
Let’s use scan()
to just peek at the first three rows of the file.
scan(what="",file="Crimes_-_2001_to_present.csv",nlines=5,sep="\n")
[1] "ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location"
[2] "10000092,HY189866,03/18/2015 07:44:00 PM,047XX W OHIO ST,041A,BATTERY,AGGRAVATED: HANDGUN,STREET,false,false,1111,011,28,25,04B,1144606,1903566,2015,02/10/2018 03:50:01 PM,41.891398861,-87.744384567,\"(41.891398861, -87.744384567)\""
[3] "10000094,HY190059,03/18/2015 11:00:00 PM,066XX S MARSHFIELD AVE,4625,OTHER OFFENSE,PAROLE VIOLATION,STREET,true,false,0725,007,15,67,26,1166468,1860715,2015,02/10/2018 03:50:01 PM,41.773371528,-87.665319468,\"(41.773371528, -87.665319468)\""
[4] "10000095,HY190052,03/18/2015 10:45:00 PM,044XX S LAKE PARK AVE,0486,BATTERY,DOMESTIC BATTERY SIMPLE,APARTMENT,false,true,0222,002,4,39,08B,1185075,1875622,2015,02/10/2018 03:50:01 PM,41.81386068,-87.596642837,\"(41.81386068, -87.596642837)\""
[5] "10000096,HY190054,03/18/2015 10:30:00 PM,051XX S MICHIGAN AVE,0460,BATTERY,SIMPLE,APARTMENT,false,false,0225,002,3,40,08B,1178033,1870804,2015,02/10/2018 03:50:01 PM,41.800802415,-87.622619343,\"(41.800802415, -87.622619343)\""
scan()
is a very basic R function that reads in plain text files. We’ve told it to read in text (what==""
), the name of the file, to only read in 5 lines (nlines=5
), and to start a new row whenever it reaches a line feed character (sep="\n"
). Using scan()
without nlines=5
would cause R to try to read in the whole dataset and that could take a lot of time and you might run out of memory.
You can see that the first row contains the column names. The second row contains the first reported crime in the file. You can see date and time, address, crime descriptions, longitude and latitude of the crime, and other information.
Importantly, SQLite is very particular about the formatting of a file. It can easily read in a csv file, but this dataset has some commas in places that confuse SQLite.
For example, there is a row in this file that looks like this:
[1] "10000153,HY189345,03/18/2015 12:20:00 PM,091XX S UNIVERSITY AVE,0483,BATTERY,AGG PRO.EMP: OTHER DANG WEAPON,\"SCHOOL, PUBLIC, BUILDING\",true,false,0413,004,8,47,04B,1185475,1844606,2015,02/10/2018 03:50:01 PM,41.728740563,-87.596150779,\"(41.728740563, -87.596150779)\""
You see that the location description for this crime is "SCHOOL, PUBLIC, BUILDING"
. Those commas inside the quotes are going to cause SQLite problems. SQLite is going to think that SCHOOL
, PUBLIC
, and BUILDING
are all separate columns rather than one columns describing the location.
To resolve this, we’re going to change all the commas that separate the columns into something else besides commas, leaving the commas in elements like "SCHOOL, PUBLIC, BUILDING"
alone. It does not matter what we use to separate the fields, but it should be an unusual character that would not appear anywhere else in the dataset. Popular choices in the vertical bar (|
) and the semi-colon (;
). So let’s take a slight detour to find out how to convert a comma-separated file into a semi-colon separated file.
You’ll know if you need to convert your file if, when you try to set up your SQL database, you receive an error message about an “extra column.”
We’re going to use a while
loop to read in 1,000,000 rows of the our data file at a time. R can handle 1,000,000 rows. With the 1,000,000 rows read in, we’ll use a regular expression to replace all the commas used for separating columns with semi-colons. Then we’ll write out the resulting cleaned up rows into a new file. It is a big file so this code can take a few minutes to run to completion.
[1] 1000000
[1] 2000000
[1] 3000000
[1] 4000000
[1] 5000000
[1] 6000000
[1] 6656622
Now, let’s take a look at the first five lines of the new file we just created.
scan(what="",file="Crimes_-_2001_to_present-clean.csv",nlines=5,sep="\n")
[1] "ID;Case Number;Date;Block;IUCR;Primary Type;Description;Location Description;Arrest;Domestic;Beat;District;Ward;Community Area;FBI Code;X Coordinate;Y Coordinate;Year;Updated On;Latitude;Longitude;Location"
[2] "10000092;HY189866;03/18/2015 07:44:00 PM;047XX W OHIO ST;041A;BATTERY;AGGRAVATED: HANDGUN;STREET;false;false;1111;011;28;25;04B;1144606;1903566;2015;02/10/2018 03:50:01 PM;41.891398861;-87.744384567;\"(41.891398861, -87.744384567)\""
[3] "10000094;HY190059;03/18/2015 11:00:00 PM;066XX S MARSHFIELD AVE;4625;OTHER OFFENSE;PAROLE VIOLATION;STREET;true;false;0725;007;15;67;26;1166468;1860715;2015;02/10/2018 03:50:01 PM;41.773371528;-87.665319468;\"(41.773371528, -87.665319468)\""
[4] "10000095;HY190052;03/18/2015 10:45:00 PM;044XX S LAKE PARK AVE;0486;BATTERY;DOMESTIC BATTERY SIMPLE;APARTMENT;false;true;0222;002;4;39;08B;1185075;1875622;2015;02/10/2018 03:50:01 PM;41.81386068;-87.596642837;\"(41.81386068, -87.596642837)\""
[5] "10000096;HY190054;03/18/2015 10:30:00 PM;051XX S MICHIGAN AVE;0460;BATTERY;SIMPLE;APARTMENT;false;false;0225;002;3;40;08B;1178033;1870804;2015;02/10/2018 03:50:01 PM;41.800802415;-87.622619343;\"(41.800802415, -87.622619343)\""
You now see that semi-colons separate the columns rather than commas. That previous record that had the location description “SCHOOL, PUBLIC, BUILDING” now looks like this:
[1] "10000153;HY189345;03/18/2015 12:20:00 PM;091XX S UNIVERSITY AVE;0483;BATTERY;AGG PRO.EMP: OTHER DANG WEAPON;\"SCHOOL, PUBLIC, BUILDING\";true;false;0413;004;8;47;04B;1185475;1844606;2015;02/10/2018 03:50:01 PM;41.728740563;-87.596150779;\"(41.728740563, -87.596150779)\""
Note that the commas are still there inside the quotes. Now we will be able to tell SQLite to look for semi-colons to separate the columns.
Now that the file containing the data is ready, we can load it into SQLite. SQLite has its own way of storing and managing data. It can store multiple tables containing data in a single database. First, we’ll tell SQLite to create a new database that we will call chicagocrime
. Then we will tell SQLite to load our data file into a table called crime
.
The next step is to set up the SQL database. The following lines of code will set up the database for you. Make sure that your path is set correctly so that your database will be stored in the correct folder that you wish to work from. You will know if the database has been successfully set up if the database (stored as a .db file) is greater than 0 KB. there is no reason to run these lines of code again.
# create a connection to the database
con <- dbConnect(SQLite(), dbname="chicagocrime.db")
# peek at the first few rows of the dataset
a <- read.table("Crimes_-_2001_to_present-clean.csv",
sep=";",nrows=5,header=TRUE)
# ask SQLite what data type it plans to use to store each column (eg number, text)
variabletypes <- dbDataType(con, a)
# make sure these features are stored as TEXT
variabletypes["IUCR"] <- "TEXT"
variabletypes["Ward"] <- "TEXT"
variabletypes["District"] <- "TEXT"
variabletypes["Community.Area"] <- "TEXT"
# just in case you've already created a "crime" table, delete it
if(dbExistsTable(con, "crime")) dbRemoveTable(con, "crime")
# import the data file into the database
dbWriteTable(con, "crime", # create crime table
"Crimes_-_2001_to_present-clean.csv", # from our cleaned up file
row.names=FALSE,
header=TRUE, # first row has column names
field.types=variabletypes,
sep=";") # columns separated with ;
# does the table exist?
dbListTables(con)
# a quick check to see if all the columns are there
dbListFields(con,"crime")
# disconnect from the database to finalize
dbDisconnect(con)
[1] "crime"
[1] "ID" "Case.Number" "Date"
[4] "Block" "IUCR" "Primary.Type"
[7] "Description" "Location.Description" "Arrest"
[10] "Domestic" "Beat" "District"
[13] "Ward" "Community.Area" "FBI.Code"
[16] "X.Coordinate" "Y.Coordinate" "Year"
[19] "Updated.On" "Latitude" "Longitude"
[22] "Location"
Once you’ve successfully set up your database, there is no reason to run these lines of code again. You should never again need to turn commas into semi-colons or run dbWriteTable()
. Instead, every time you want to work with your database, you can simply need to reconnect to the database with:
con <- dbConnect(SQLite(), dbname="chicagocrime.db")
(Note that if you’re keeping you are using a cloud-based backup service like iCloud, OneDrive, or Google Drive, you might need to wait until your “db” file has completely “synced” before you can access your database.)
You’ve now created a database (called “chicagocrime.db”) containing a table called “crime” that contains those 7 million crime records.
Two important clauses with an SQL query are SELECT
and FROM
. Unlike R, SQL queries are not case-sensitive. Unlike in R, the column names in SQL aren’t case-sensitive. So if we were to type “SELECT” as “select” or “Description” as “dEsCrIpTiOn”, the SQL query would do the same thing. However, the tradition is to type SQL keywords in all uppercase to make it easier to distinguish them from table and column names.
The SELECT
clause tells SQL which columns in particular you would like to see. The FROM
clause simply tells SQL from which table it should pull the data. In this query, we are interested in only the ID
and Description
columns.
res <- dbSendQuery(con, "
SELECT ID, Description
FROM crime")
fetch(res, n = 10) # just the first 10 rows
dbClearResult(res)
ID Description
1 10000092 AGGRAVATED: HANDGUN
2 10000094 PAROLE VIOLATION
3 10000095 DOMESTIC BATTERY SIMPLE
4 10000096 SIMPLE
5 10000097 ARMED: HANDGUN
6 10000098 SIMPLE
7 10000099 DOMESTIC BATTERY SIMPLE
8 10000100 DOMESTIC BATTERY SIMPLE
9 10000101 POSS: CANNABIS 30GMS OR LESS
10 10000104 SIMPLE
What we’ve done here is to create an object called res
. res
contains the results of our query. Then, we fetch
those results. Within fetch
, we set n=10
so the first 10 rows are displayed. By convention, setting n=-1
, will display all your rows. Really large SQL queries can be memory-intensive. So if your dataset is over 25 lines long (which it probably is…..that’s why you’re using SQL!), you have to make sure that you set the value in the fetch line to something reasonable to display.
However, suppose that your dataset is over 1 million rows, and you want to work with all of them. You can set the fetch
line to something like mydata <- fetch(res, n=-1)
.
dbClearResult(res)
tells SQLite that we are all done with this query. We’ve retrieved the first 10 lines. SQLite is standing by with another 7 million rows to show us, but dbClearResult(res)
tells SQLite that we are no longer interested in this query and it can clear out whatever it has stored for us.
In the previous SQL query we just asked for ID
and Description
. Typing out all of the column names would be tiresome, so SQL lets you use a *
to select all the columns. If we want to look at the first 10 rows but all of the columns, we would use this query:
res <- dbSendQuery(con, "
SELECT *
FROM crime")
fetch(res, n = 10) # just the first 10 rows
dbClearResult(res)
ID Case.Number Date Block IUCR
1 10000092 HY189866 03/18/2015 07:44:00 PM 047XX W OHIO ST 041A
2 10000094 HY190059 03/18/2015 11:00:00 PM 066XX S MARSHFIELD AVE 4625
3 10000095 HY190052 03/18/2015 10:45:00 PM 044XX S LAKE PARK AVE 0486
4 10000096 HY190054 03/18/2015 10:30:00 PM 051XX S MICHIGAN AVE 0460
5 10000097 HY189976 03/18/2015 09:00:00 PM 047XX W ADAMS ST 031A
6 10000098 HY190032 03/18/2015 10:00:00 PM 049XX S DREXEL BLVD 0460
7 10000099 HY190047 03/18/2015 11:00:00 PM 070XX S MORGAN ST 0486
8 10000100 HY189988 03/18/2015 09:35:00 PM 042XX S PRAIRIE AVE 0486
9 10000101 HY190020 03/18/2015 10:09:00 PM 036XX S WOLCOTT AVE 1811
10 10000104 HY189964 03/18/2015 09:25:00 PM 097XX S PRAIRIE AVE 0460
Primary.Type Description Location.Description Arrest
1 BATTERY AGGRAVATED: HANDGUN STREET false
2 OTHER OFFENSE PAROLE VIOLATION STREET true
3 BATTERY DOMESTIC BATTERY SIMPLE APARTMENT false
4 BATTERY SIMPLE APARTMENT false
5 ROBBERY ARMED: HANDGUN SIDEWALK false
6 BATTERY SIMPLE APARTMENT false
7 BATTERY DOMESTIC BATTERY SIMPLE APARTMENT false
8 BATTERY DOMESTIC BATTERY SIMPLE APARTMENT false
9 NARCOTICS POSS: CANNABIS 30GMS OR LESS STREET true
10 BATTERY SIMPLE RESIDENCE PORCH/HALLWAY false
Domestic Beat District Ward Community.Area FBI.Code X.Coordinate
1 false 1111 011 28 25 04B 1144606
2 false 725 007 15 67 26 1166468
3 true 222 002 4 39 08B 1185075
4 false 225 002 3 40 08B 1178033
5 false 1113 011 28 25 03 1144920
6 false 223 002 4 39 08B 1183018
7 true 733 007 17 68 08B 1170859
8 true 213 002 3 38 08B 1178746
9 false 912 009 11 59 18 1164279
10 false 511 005 6 49 08B 1179637
Y.Coordinate Year Updated.On Latitude Longitude
1 1903566 2015 02/10/2018 03:50:01 PM 41.89140 -87.74438
2 1860715 2015 02/10/2018 03:50:01 PM 41.77337 -87.66532
3 1875622 2015 02/10/2018 03:50:01 PM 41.81386 -87.59664
4 1870804 2015 02/10/2018 03:50:01 PM 41.80080 -87.62262
5 1898709 2015 02/10/2018 03:50:01 PM 41.87806 -87.74335
6 1872537 2015 02/10/2018 03:50:01 PM 41.80544 -87.60428
7 1858210 2015 02/10/2018 03:50:01 PM 41.76640 -87.64930
8 1876914 2015 02/10/2018 03:50:01 PM 41.81755 -87.61982
9 1880656 2015 02/10/2018 03:50:01 PM 41.82814 -87.67278
10 1840444 2015 02/10/2018 03:50:01 PM 41.71745 -87.61766
Location
1 "(41.891398861, -87.744384567)"\r
2 "(41.773371528, -87.665319468)"\r
3 "(41.81386068, -87.596642837)"\r
4 "(41.800802415, -87.622619343)"\r
5 "(41.878064761, -87.743354013)"\r
6 "(41.805443345, -87.604283976)"\r
7 "(41.766402779, -87.649296123)"\r
8 "(41.817552577, -87.619818523)"\r
9 "(41.828138428, -87.672782106)"\r
10 "(41.71745472, -87.617663257)"\r
Just as SELECT
filters the columns, the WHERE
clause filters the rows. Note the use of AND
and OR
in the WHERE
clause. As you might intuitively guess, AND
and OR
are logical operators that help us further filter our rows. Here we select three columns: ID
, Description
, and Location.Description
. Also, we want only rows where * the value in the Beat
column is “611” * the value in the Arrest
column is “true” * the value in the ICUR
column is either “0486” or “0498”
Importantly, note the use of single (not double) quotation marks in the WHERE
line. The reason is that if we used double quotes, then R will think that the double quote signals the end of the query. Also note that Location.Description
has a period in its name. The period has a special meaning in SQL that we will discuss later. We use the square brakets around the name to “protect” the column name, telling SQL to treat it as a column name.
Also, note how we set the fetch()
line to the variable a
.
res <- dbSendQuery(con, "
SELECT ID, Description, [Location.Description]
FROM crime
WHERE ((Beat=611) AND
(Arrest='true')) AND
((IUCR='0486') OR (IUCR='0498'))")
a <- fetch(res, n = -1) # all the rows
dbClearResult(res)
# show the first few rows of the results
a[1:3,]
ID Description Location.Description
1 10011764 DOMESTIC BATTERY SIMPLE APARTMENT
2 10019667 DOMESTIC BATTERY SIMPLE STREET
3 10046813 DOMESTIC BATTERY SIMPLE APARTMENT
Select records from Beat 234
Select Beat, District, Ward, and Community Area for all “ASSAULT”s. Remember that, since Primary.Type
has a period in its name, you need to use square brackets like [Primary.Type]
Select records on assaults from Beat 234
Make a table in R of the number of assaults (IUCR 0560) by Ward
We’ve already covered SQL clauses SELECT
, WHERE
, and FROM
. The SQL function COUNT(*)
and GROUP BY
are also very useful. For example, the following query counts how many assaults (IUCR 0560) occurred by ward. COUNT()
is a SQL “aggregate” function, a function that performs a calculation on a group of values and returns a single number. Other SQL aggregate functions include AVG()
, MIN()
, MAX()
, and SUM()
. This query will group all the records by Ward
and then apply the aggregate function COUNT()
and report that value in a column called crimecount
.
res <- dbSendQuery(con, "
SELECT COUNT(*) AS crimecount,
Ward
FROM crime
WHERE IUCR='0560'
GROUP BY Ward")
a <- fetch(res, n = -1)
dbClearResult(res)
print(a)
crimecount Ward
1 29468
2 3973 1
3 5863 10
4 3886 11
5 3095 12
6 2917 13
7 3004 14
8 7899 15
9 7930 16
10 10152 17
11 4538 18
12 2631 19
13 9424 2
14 9714 20
15 8104 21
16 3156 22
17 3045 23
18 9227 24
19 3508 25
20 4950 26
21 7656 27
22 10688 28
23 6536 29
24 8669 3
25 3439 30
26 3405 31
27 2440 32
28 2173 33
29 8732 34
30 3429 35
31 2515 36
32 6722 37
33 2580 38
34 2085 39
35 5489 4
36 2625 40
37 2290 41
38 6828 42
39 1618 43
40 2140 44
41 2600 45
42 3656 46
43 1959 47
44 2669 48
45 3688 49
46 6842 5
47 2392 50
48 8907 6
49 8321 7
50 8089 8
51 8343 9
The GROUP BY
clause is critical. If you forget it then the result is not well defined. That is, different implementations of SQL may produce different results. The rule you should remember is that “every non-aggregate column in the SELECT
clause should appear in the GROUP BY
clause.” Here Ward
is not part of the aggregate function COUNT()
so it must appear in the GROUP BY
clause.
Count the number of crimes by Primary.Type
Count the number of crimes resulting in arrest
Count the number of crimes by Location.Description
. LocationDescription
is the variable that tells us where (e.g., a parking lot, a barbershop, a fire station, a CTA train, or a motel) a crime occurred
MAX
, MIN
, SUM
, AVG
are common (and useful) aggregating functions. The ORDER BY
clause sorts the results for us. It’s the SQL version of the sort()
command. Here is an illustration that gives the range of beat numbers in each policing district.
res <- dbSendQuery(con, "
SELECT MIN(Beat) AS min_beat,
MAX(Beat) AS max_beat,
District
FROM crime
GROUP BY District
ORDER BY District")
fetch(res, n = -1)
dbClearResult(res)
min_beat max_beat District
1 124 2535
2 111 2515 001
3 211 2133 002
4 310 2132 003
5 324 2221 004
6 333 2233 005
7 123 2424 006
8 233 2431 007
9 333 2411 008
10 134 2331 009
11 133 2534 010
12 831 2535 011
13 111 2525 012
14 411 2535 014
15 726 2533 015
16 811 2521 016
17 813 2523 017
18 111 2514 018
19 112 2533 019
20 112 2433 020
21 2112 2112 021
22 214 2234 022
23 123 2433 024
24 1011 2535 025
25 124 2535 031
Remember that the GROUP BY
clause should include every element of the SELECT
clause that is not involved with an aggregate function. We have MIN()
and MAX()
operating on Beat
, but District
is on its own and should be placed in the GROUP BY
clause.
Let’s look at our Latitude
and Longitude
columns (which, as we find in a subsequent section, will be extremely useful for mapping data points). The following query will give unexpected results.
res <- dbSendQuery(con, "
SELECT MIN(Latitude) AS min_lat,
MAX(Latitude) AS max_lat,
MIN(Longitude) AS min_lon,
MAX(Longitude) AS max_lon,
District
FROM crime
GROUP BY District
ORDER BY District")
fetch(res, n = -1)
Warning in result_fetch(res@ptr, n = n): Column `max_lat`: mixed type, first
seen values of type real, coercing other values of type string
Warning in result_fetch(res@ptr, n = n): Column `max_lon`: mixed type, first
seen values of type real, coercing other values of type string
dbClearResult(res)
min_lat max_lat min_lon max_lon District
1 41.69991 42.00030 -87.87742 -87.59533
2 36.61945 0.00000 -91.68657 0.00000 001
3 36.61945 0.00000 -91.68657 0.00000 002
4 36.61945 0.00000 -91.68657 0.00000 003
5 36.61945 0.00000 -91.68657 0.00000 004
6 36.61945 0.00000 -91.68657 0.00000 005
7 36.61945 0.00000 -91.68657 0.00000 006
8 36.61945 0.00000 -91.68657 0.00000 007
9 36.61945 0.00000 -91.68657 0.00000 008
10 36.61945 0.00000 -91.68657 0.00000 009
11 36.61945 0.00000 -91.68657 0.00000 010
12 36.61945 0.00000 -91.68657 0.00000 011
13 36.61945 0.00000 -91.68657 0.00000 012
14 36.61945 0.00000 -91.68657 0.00000 014
15 36.61945 0.00000 -91.68657 0.00000 015
16 36.61945 0.00000 -91.68657 0.00000 016
17 36.61945 0.00000 -91.68657 0.00000 017
18 36.61945 0.00000 -91.68657 0.00000 018
19 41.80933 0.00000 -87.76791 0.00000 019
20 41.79145 0.00000 -87.76303 0.00000 020
21 41.83790 41.83790 -87.62192 -87.62192 021
22 36.61945 0.00000 -91.68657 0.00000 022
23 36.61945 0.00000 -91.68657 0.00000 024
24 36.61945 0.00000 -91.68657 0.00000 025
25 41.69165 42.01939 -87.90647 -87.53528 031
The first problem is that we have some rows with blank values in Longitude
and Latitude
. Here are some of them.
fetch(dbSendQuery(con,"SELECT * FROM crime WHERE Longitude=''"), n=3)
dbClearResult(res)
Warning: Expired, result set already closed
ID Case.Number Date Block IUCR
1 10581023 HZ329792 09/01/2014 08:00:00 AM 0000X E LAKE ST 1140
2 4755307 HM367521 01/23/2002 12:00:00 AM 077XX S GREENWOOD AVE 0840
3 4757173 HM368193 05/22/2006 02:30:00 PM 074XX N ROGERS AVE 0910
Primary.Type Description
1 DECEPTIVE PRACTICE EMBEZZLEMENT
2 THEFT FINANCIAL ID THEFT: OVER $300
3 MOTOR VEHICLE THEFT AUTOMOBILE
Location.Description Arrest Domestic Beat District Ward
1 OTHER true false 111 001 42
2 APARTMENT false false 624 006 8
3 PARKING LOT/GARAGE(NON.RESID.) false false 2422 024 49
Community.Area FBI.Code X.Coordinate Y.Coordinate Year Updated.On
1 32 12 2014 03/01/2018 03:52:35 PM
2 69 06 2002 08/17/2015 03:03:40 PM
3 1 07 2006 08/17/2015 03:03:40 PM
Latitude Longitude Location
1 \r
2 \r
3 \r
Note the X.Coordinate
and the Y.Coordinate
columns. They should give the location per the State Plane Illinois East NAD 1983 projection, but in these rows they are empty.
The second problem is that there are minimum latitudes of 36.61945 and minimum longitudes of -91.68657. That’s near the Missouri/Arkansas border 500 miles south of Chicago!
fetch(dbSendQuery(con,"SELECT * FROM crime where Latitude<36.61946"), n=3)
Warning: Closing open result set, pending rows
dbClearResult(res)
Warning: Expired, result set already closed
ID Case.Number Date Block IUCR Primary.Type
1 757 G209405 04/12/2001 09:32:00 PM 056XX S NORMAL AV 0110 HOMICIDE
2 808 G259321 05/06/2001 01:30:00 AM 020XX W 55 ST 0110 HOMICIDE
3 937 G411262 07/15/2001 12:34:00 AM 030XX S HARDING ST 0110 HOMICIDE
Description Location.Description Arrest Domestic Beat District Ward
1 FIRST DEGREE MURDER STREET false false 711 007
2 FIRST DEGREE MURDER AUTO true false 915 009
3 FIRST DEGREE MURDER STREET false false 1031 010
Community.Area FBI.Code X.Coordinate Y.Coordinate Year Updated.On
1 01A 0 0 2001 05/03/2018 03:49:53 PM
2 01A 0 0 2001 08/17/2015 03:03:40 PM
3 01A 0 0 2001 05/03/2018 03:49:53 PM
Latitude Longitude Location
1 36.61945 -91.68657 "(36.619446395, -91.686565684)"\r
2 36.61945 -91.68657 "(36.619446395, -91.686565684)"\r
3 36.61945 -91.68657 "(36.619446395, -91.686565684)"\r
Note that missing X.Coordinate
and Y.Coordinate
are getting mapped to some place far from Chicago.
We can tell SQLite to make the empty or missing values NULL
, a more proper way to encode that these rows have missing coordinates. The UPDATE
clause edits our table. R will read in NULL
values as NA
. After we do the update, we can rerun the MIN()
, MAX()
query. The dataset also has some latitudes and longitudes that are very close to 0 (and Chicago is quite far from the equator), but not exactly 0. We can make those NULL
as well. We’re also going to fix the X.Coordinate
and Y.Coordinate
columns.
res <- dbSendQuery(con, "
UPDATE crime SET Latitude=NULL
WHERE (Latitude='') OR (ABS(Latitude-0.0) < 0.01) OR (Latitude < 36.7)")
Warning: Closing open result set, pending rows
dbClearResult(res)
res <- dbSendQuery(con, "
UPDATE crime SET Longitude=NULL
WHERE (Longitude='') OR (ABS(Longitude-0.0) < 0.01) OR (Longitude < -91.6)")
dbClearResult(res)
res <- dbSendQuery(con, "
UPDATE crime SET [X.Coordinate]=NULL
WHERE ([X.Coordinate]='') OR ([X.Coordinate]=0)")
dbClearResult(res)
res <- dbSendQuery(con, "
UPDATE crime SET [Y.Coordinate]=NULL
WHERE ([Y.Coordinate]='') OR ([Y.Coordinate]=0)")
dbClearResult(res)
Let’s rerun that query and check that we get more sensible results.
res <- dbSendQuery(con, "
SELECT MIN(Latitude) AS min_lat,
MAX(Latitude) AS max_lat,
MIN(Longitude) AS min_lon,
MAX(Longitude) AS max_lon,
District
FROM crime
GROUP BY District
ORDER BY District")
fetch(res, n = -1)
dbClearResult(res)
min_lat max_lat min_lon max_lon District
1 41.69991 42.00030 -87.87742 -87.59533
2 41.72827 41.98740 -87.84349 -87.54925 001
3 41.73298 41.97608 -87.70277 -87.56954 002
4 41.71424 41.79946 -87.73941 -87.55258 003
5 41.64467 41.79220 -87.72436 -87.52453 004
6 41.64459 41.88693 -87.73145 -87.54348 005
7 41.69249 42.01876 -87.77138 -87.55810 006
8 41.66806 42.01369 -87.68723 -87.57906 007
9 41.73453 42.01765 -87.80161 -87.55239 008
10 41.79018 41.97645 -87.71397 -87.58822 009
11 41.68357 41.94304 -87.74364 -87.61895 010
12 41.77163 41.90624 -87.76332 -87.62328 011
13 41.68544 41.96539 -87.76321 -87.60502 012
14 41.77688 42.01938 -87.80222 -87.65657 014
15 41.76641 41.94234 -87.77534 -87.63087 015
16 41.78464 42.01938 -87.93432 -87.58256 016
17 41.77950 42.01390 -87.75780 -87.66131 017
18 41.85926 41.96879 -87.76313 -87.60136 018
19 41.80933 41.98397 -87.76791 -87.58775 019
20 41.79145 42.00458 -87.76303 -87.62992 020
21 41.83790 41.83790 -87.62192 -87.62192 021
22 41.67709 41.85572 -87.74328 -87.58965 022
23 41.75884 42.02291 -87.79757 -87.62545 024
24 41.83930 41.94586 -87.81648 -87.64093 025
25 41.69165 42.01939 -87.90647 -87.53528 031
Now we have results that are more in line with where Chicago actually is. Make it a habit to do some checks of your data before doing too much analysis.
And what city does the following plot have the shape of? Let’s plot the location of these crimes. Plotting all 7 million would be overkill, so let’s take a random sample of 10,000 crimes. Here’s a SQL query that will do that. It uses some tricks we’ll learn more about later including the use of IN
, the use of subqueries (a query within a query), and LIMIT
. Does the shape of the plot look right?
res <- dbSendQuery(con, "
SELECT Longitude, Latitude
FROM crime
WHERE id IN (SELECT id FROM crime ORDER BY RANDOM() LIMIT 10000)")
a <- fetch(res, n = -1)
dbClearResult(res)
plot(Latitude~Longitude, data=a, pch=".", xlab="Longitude", ylab="Latitude")
Plot the location of all “ASSAULT”s for Ward 22
What is the most common (Lat,Long) for assaults in Ward 22? Add the point to your plot using the points()
function. points()
simply draws a point (or sequence of points) at the specified coordinates
res <- dbSendQuery(con, "
SELECT *
FROM crime
WHERE ((Beat=234))")
a <- fetch(res, n = -1)
dbClearResult(res)
res <- dbSendQuery(con, "
SELECT Beat, District, Ward, [Community.Area], [Primary.Type]
FROM crime
WHERE (([Primary.Type]='ASSAULT'))")
a <- fetch(res, n = -1)
dbClearResult(res)
res <- dbSendQuery(con, "
SELECT *
FROM crime
WHERE ((Beat=234) AND ([Primary.Type]='ASSAULT'))")
a <- fetch(res, n = -1)
dbClearResult(res)
# system.time() reports how long it takes to run the SQL query
system.time(
{
res <- dbSendQuery(con, "
SELECT *
FROM crime
WHERE ((IUCR='0560') AND ([Primary.Type]='ASSAULT'))")
a <- fetch(res, n = -1)
})
dbClearResult(res)
table(a$Ward)
user system elapsed
2.49 3.59 6.09
1 10 11 12 13 14 15 16 17 18 19 2
29468 3973 5863 3886 3095 2917 3004 7899 7930 10152 4538 2631 9424
20 21 22 23 24 25 26 27 28 29 3 30 31
9714 8104 3156 3045 9227 3508 4950 7656 10688 6536 8669 3439 3405
32 33 34 35 36 37 38 39 4 40 41 42 43
2440 2173 8732 3429 2515 6722 2580 2085 5489 2625 2290 6828 1618
44 45 46 47 48 49 5 50 6 7 8 9
2140 2600 3656 1959 2669 3688 6842 2392 8907 8321 8089 8343
Or, we could also try selecting all the IUCR codes and ward and then subsetting the data through R.
system.time(
{
res <- dbSendQuery(con, "
SELECT IUCR, Ward, [Primary.Type]
FROM crime")
data <- fetch(res, n = -1)
data <- subset(data, Primary.Type=="ASSAULT" & IUCR=="0560")
})
dbClearResult(res)
user system elapsed
5.46 3.89 9.35
Primary.Type
res <- dbSendQuery(con, "
SELECT COUNT(*) AS count, [Primary.Type]
FROM crime
GROUP BY [Primary.Type]")
fetch(res, n = -1)
dbClearResult(res)
count Primary.Type
1 11019 ARSON
2 411720 ASSAULT
3 1215592 BATTERY
4 383891 BURGLARY
5 222 CONCEALED CARRY LICENSE VIOLATION
6 26455 CRIM SEXUAL ASSAULT
7 762125 CRIMINAL DAMAGE
8 191154 CRIMINAL TRESPASS
9 255797 DECEPTIVE PRACTICE
10 1 DOMESTIC VIOLENCE
11 14337 GAMBLING
12 9207 HOMICIDE
13 42 HUMAN TRAFFICKING
14 14711 INTERFERENCE WITH PUBLIC OFFICER
15 3862 INTIMIDATION
16 6616 KIDNAPPING
17 13967 LIQUOR LAW VIOLATION
18 310921 MOTOR VEHICLE THEFT
19 706464 NARCOTICS
20 38 NON - CRIMINAL
21 152 NON-CRIMINAL
22 8 NON-CRIMINAL (SUBJECT SPECIFIED)
23 547 OBSCENITY
24 44552 OFFENSE INVOLVING CHILDREN
25 123 OTHER NARCOTIC VIOLATION
26 413056 OTHER OFFENSE
27 68065 PROSTITUTION
28 158 PUBLIC INDECENCY
29 47349 PUBLIC PEACE VIOLATION
30 23 RITUALISM
31 252413 ROBBERY
32 24620 SEX OFFENSE
33 3310 STALKING
34 1395322 THEFT
35 68782 WEAPONS VIOLATION
res <- dbSendQuery(con, "
SELECT COUNT(*) AS count, [Primary.Type]
FROM crime
WHERE Arrest='true'
GROUP BY [Primary.Type]")
fetch(res, n = -1)
dbClearResult(res)
count Primary.Type
1 1441 ARSON
2 95637 ASSAULT
3 277322 BATTERY
4 21957 BURGLARY
5 205 CONCEALED CARRY LICENSE VIOLATION
6 4166 CRIM SEXUAL ASSAULT
7 53857 CRIMINAL DAMAGE
8 140515 CRIMINAL TRESPASS
9 43741 DECEPTIVE PRACTICE
10 1 DOMESTIC VIOLENCE
11 14233 GAMBLING
12 4343 HOMICIDE
13 6 HUMAN TRAFFICKING
14 13476 INTERFERENCE WITH PUBLIC OFFICER
15 690 INTIMIDATION
16 739 KIDNAPPING
17 13842 LIQUOR LAW VIOLATION
18 28496 MOTOR VEHICLE THEFT
19 702233 NARCOTICS
20 6 NON - CRIMINAL
21 8 NON-CRIMINAL
22 3 NON-CRIMINAL (SUBJECT SPECIFIED)
23 453 OBSCENITY
24 9488 OFFENSE INVOLVING CHILDREN
25 88 OTHER NARCOTIC VIOLATION
26 73400 OTHER OFFENSE
27 67798 PROSTITUTION
28 157 PUBLIC INDECENCY
29 30434 PUBLIC PEACE VIOLATION
30 3 RITUALISM
31 24491 ROBBERY
32 7617 SEX OFFENSE
33 536 STALKING
34 167223 THEFT
35 54982 WEAPONS VIOLATION
Or, if we weren’t interested in differentiating based on the Primary.Type
, we could simply do the following:
res <- dbSendQuery(con, "
SELECT COUNT(*) AS count
FROM crime
WHERE Arrest='true'")
fetch(res, n = -1)
dbClearResult(res)
count
1 1853587
Location.Description
. LocationDescription
is the variable that tells us where (e.g., a parking lot, a barbershop, a fire station, a CTA train, or a motel) a crime occurredres <- dbSendQuery(con, "
SELECT COUNT(*) AS count,
[Location.Description]
FROM crime
GROUP BY [Location.Description]")
fetch(res, n = -1)
dbClearResult(res)
count Location.Description
1 3676
2 4 "CTA ""L"" PLATFORM"
3 2 "CTA ""L"" TRAIN"
4 13218 "SCHOOL, PRIVATE, BUILDING"
5 4026 "SCHOOL, PRIVATE, GROUNDS"
6 141371 "SCHOOL, PUBLIC, BUILDING"
7 28703 "SCHOOL, PUBLIC, GROUNDS"
8 128 "VEHICLE - OTHER RIDE SHARE SERVICE (E.G., UBER, LYFT)"
9 11067 ABANDONED BUILDING
10 593 AIRCRAFT
11 731 AIRPORT BUILDING NON-TERMINAL - NON-SECURE AREA
12 505 AIRPORT BUILDING NON-TERMINAL - SECURE AREA
13 674 AIRPORT EXTERIOR - NON-SECURE AREA
14 250 AIRPORT EXTERIOR - SECURE AREA
15 663 AIRPORT PARKING LOT
16 1545 AIRPORT TERMINAL LOWER LEVEL - NON-SECURE AREA
17 571 AIRPORT TERMINAL LOWER LEVEL - SECURE AREA
18 75 AIRPORT TERMINAL MEZZANINE - NON-SECURE AREA
19 610 AIRPORT TERMINAL UPPER LEVEL - NON-SECURE AREA
20 3746 AIRPORT TERMINAL UPPER LEVEL - SECURE AREA
21 84 AIRPORT TRANSPORTATION SYSTEM (ATS)
22 856 AIRPORT VENDING ESTABLISHMENT
23 16091 AIRPORT/AIRCRAFT
24 149019 ALLEY
25 748 ANIMAL HOSPITAL
26 686185 APARTMENT
27 1765 APPLIANCE STORE
28 7885 ATHLETIC CLUB
29 7465 ATM (AUTOMATIC TELLER MACHINE)
30 1097 AUTO
31 147 AUTO / BOAT / RV DEALERSHIP
32 27116 BANK
33 1 BANQUET HALL
34 35239 BAR OR TAVERN
35 12 BARBER SHOP/BEAUTY SALON
36 7624 BARBERSHOP
37 30 BASEMENT
38 650 BOAT/WATERCRAFT
39 647 BOWLING ALLEY
40 371 BRIDGE
41 2728 CAR WASH
42 345 CEMETARY
43 35782 CHA APARTMENT
44 3 CHA BREEZEWAY
45 2 CHA ELEVATOR
46 39 CHA GROUNDS
47 35 CHA HALLWAY
48 24782 CHA HALLWAY/STAIRWELL/ELEVATOR
49 6 CHA LOBBY
50 40 CHA PARKING LOT
51 55270 CHA PARKING LOT/GROUNDS
52 3 CHA PLAY LOT
53 8 CHA STAIRWELL
54 6 CHURCH
55 2 CHURCH PROPERTY
56 14681 CHURCH/SYNAGOGUE/PLACE OF WORSHIP
57 1 CLEANERS/LAUNDROMAT
58 4705 CLEANING STORE
59 15 CLUB
60 3 COACH HOUSE
61 1026 COIN OPERATED MACHINE
62 5561 COLLEGE/UNIVERSITY GROUNDS
63 1340 COLLEGE/UNIVERSITY RESIDENCE HALL
64 48546 COMMERCIAL / BUSINESS OFFICE
65 12753 CONSTRUCTION SITE
66 15302 CONVENIENCE STORE
67 2 COUNTY JAIL
68 476 CREDIT UNION
69 21298 CTA BUS
70 5893 CTA BUS STOP
71 9826 CTA GARAGE / OTHER PROPERTY
72 35934 CTA PLATFORM
73 4 CTA PROPERTY
74 3508 CTA STATION
75 90 CTA TRACKS - RIGHT OF WAY
76 23252 CTA TRAIN
77 10626 CURRENCY EXCHANGE
78 2684 DAY CARE CENTER
79 962 DELIVERY TRUCK
80 82146 DEPARTMENT STORE
81 17 DRIVEWAY
82 19321 DRIVEWAY - RESIDENTIAL
83 30228 DRUG STORE
84 7 DUMPSTER
85 1 ELEVATOR
86 1 EXPRESSWAY EMBANKMENT
87 2 FACTORY
88 6713 FACTORY/MANUFACTURING BUILDING
89 2 FARM
90 761 FEDERAL BUILDING
91 996 FIRE STATION
92 379 FOREST PRESERVE
93 1 FUNERAL PARLOR
94 64 GANGWAY
95 52 GARAGE
96 11 GARAGE/AUTO REPAIR
97 70831 GAS STATION
98 43 GAS STATION DRIVE/PROP.
99 2 GOVERNMENT BUILDING
100 13929 GOVERNMENT BUILDING/PROPERTY
101 86195 GROCERY FOOD STORE
102 84 HALLWAY
103 993 HIGHWAY/EXPRESSWAY
104 1 HORSE STABLE
105 5 HOSPITAL
106 19998 HOSPITAL BUILDING/GROUNDS
107 18 HOTEL
108 27530 HOTEL/MOTEL
109 514 HOUSE
110 1024 JAIL / LOCK-UP FACILITY
111 1 JUNK YARD/GARBAGE DUMP
112 1 KENNEL
113 1 LAGOON
114 3 LAKE
115 1085 LAKEFRONT/WATERFRONT/RIVERBANK
116 2 LAUNDRY ROOM
117 5875 LIBRARY
118 7 LIQUOR STORE
119 1 LIVERY AUTO
120 2 LIVERY STAND OFFICE
121 1 LOADING DOCK
122 7000 MEDICAL/DENTAL OFFICE
123 5 MOTEL
124 2507 MOVIE HOUSE/THEATER
125 228 NEWSSTAND
126 5 NURSING HOME
127 13164 NURSING HOME/RETIREMENT HOME
128 14 OFFICE
129 252768 OTHER
130 2871 OTHER COMMERCIAL TRANSPORTATION
131 5677 OTHER RAILROAD PROP / TRAIN DEPOT
132 51555 PARK PROPERTY
133 152 PARKING LOT
134 191224 PARKING LOT/GARAGE(NON.RESID.)
135 509 PAWN SHOP
136 17171 POLICE FACILITY/VEH PARKING LOT
137 855 POOL ROOM
138 1 POOLROOM
139 281 PORCH
140 2 PRAIRIE
141 1 PUBLIC GRAMMAR SCHOOL
142 2 PUBLIC HIGH SCHOOL
143 13 RAILROAD PROPERTY
144 1127678 RESIDENCE
145 116365 RESIDENCE PORCH/HALLWAY
146 130214 RESIDENCE-GARAGE
147 67547 RESIDENTIAL YARD (FRONT/BACK)
148 103254 RESTAURANT
149 70 RETAIL STORE
150 4 RIVER
151 4 RIVER BANK
152 2 ROOMING HOUSE
153 346 SAVINGS AND LOAN
154 11 SCHOOL YARD
155 3 SEWER
156 657450 SIDEWALK
157 116890 SMALL RETAIL STORE
158 4873 SPORTS ARENA/STADIUM
159 17 STAIRWELL
160 1750257 STREET
161 34 TAVERN
162 21639 TAVERN/LIQUOR STORE
163 6 TAXI CAB
164 7120 TAXICAB
165 3 TRAILER
166 8 TRUCK
167 1 TRUCKING TERMINAL
168 100 VACANT LOT
169 23676 VACANT LOT/LAND
170 83 VEHICLE - DELIVERY TRUCK
171 335 VEHICLE - OTHER RIDE SERVICE
172 106422 VEHICLE NON-COMMERCIAL
173 5288 VEHICLE-COMMERCIAL
174 3 VEHICLE-COMMERCIAL - ENTERTAINMENT/PARTY BUS
175 4 VEHICLE-COMMERCIAL - TROLLEY BUS
176 17 VESTIBULE
177 9112 WAREHOUSE
178 5 WOODED AREA
179 195 YARD
180 3 YMCA
res <- dbSendQuery(con, "
SELECT Latitude, Longitude
FROM crime
WHERE [Primary.Type]='ASSAULT' AND Ward='22'")
a <- fetch(res, n = -1)
dbClearResult(res)
plot(Latitude~Longitude, data=a, pch=".")
points()
function. points()
simply draws a point (or sequence of points) at the specified coordinatesres <- dbSendQuery(con, "
SELECT COUNT(*) as crimecount,
Longitude, Latitude
FROM crime
WHERE [Primary.Type]='ASSAULT' and Ward='22'
GROUP BY Longitude, Latitude")
b <- fetch(res, n=-1)
dbClearResult(res)
plot(Latitude~Longitude, data=a, pch=".")
points(b[which.max(b$crimecount), 2:3],
pch=16,
col="salmon",
cex=4)
b[which.max(b$crimecount), 2:3]
Longitude Latitude
2031 -87.70883 41.84905