Through our work with the UCR, we’ve already discussed reported crime. Nonetheless, not all crimes are reported to the police. Also, sometimes the UCR doesn’t provide us with specific information about a victim-involved crime incident such as whether the victim knew the offenders or the location of the crime incident.
Each year, the U.S. Census Bureau conducts the National Crime Victimization Survey (NCVS), which is a valuable source of self-reported victimization data. The Census Bureau interviews a sample of people about the number and characteristics of crime victimizations they experienced during the prior 6 months. In 2015, for example, they collected data from 95,760 households and 163,880 persons.
The NCVS contains valuable information about nonfatal personal crimes such as rape or robbery as well as property crimes such as burglary. Additional information about the NCVS can be found at the BJS website. To give a sense of the type of data that the NCVS contains, refer to the Official 2012-2013 BJS Crime Victimization report.
The University of Michigan consolidates the NCVS data into a format that is easily accessible in R. We will be using 2012 and 2013 NCVS data.
First, we will download the NCVS 2012 data, ICPSR 34650. We will need to download the following files, DS1, DS2, DS3, DS4, and DS5 in R format. Also, download DS0, the Codebook (which is in PDF format). We will refer to the codebook frequently. As for the DS1, DS2, DS3, DS4, and DS5 files, we are interested in the .rda files.
Next, downoad the NCVS 2013 data, ICPSR 35164. Same drill as above - retrieve DS1, DS2, DS3, DS4, and DS5 in R format.
All told you should have ten .rda files, and one PDF codebook. The codebook is extremely important for understanding what the variable names stand for, and you should become familiar with it as soon as you can. For now, we won’t be using the DS5 files that much. Also, the file names are admittedly a bit unwieldy with all the numbers so it might be a good idea to change the names to something that will help you quickly distinguish among all the files. We’ve created subfolders called NCVS2012 and NCVS2013 that contains the files extracted from the data download. Here are the files we have in our NCVS2012 and NCVS2013 subfolders.
list.files("NCVS2012/",recursive = TRUE)
[1] "34650-Codebook.pdf" "34650-descriptioncitation.pdf"
[3] "34650-manifest.txt" "34650-related_literature.txt"
[5] "DS0001/34650-0001-Data.rda" "DS0002/34650-0002-Data.rda"
[7] "DS0003/34650-0003-Data.rda" "DS0004/34650-0004-Data.rda"
[9] "DS0005/34650-0005-Data.rda" "factor_to_numeric_icpsr.R"
[11] "series-95-related_literature.txt" "TermsOfUse.html"
list.files("NCVS2013/",recursive = TRUE)
[1] "35164-Codebook.pdf" "35164-descriptioncitation.pdf"
[3] "35164-manifest.txt" "35164-related_literature.txt"
[5] "DS0001/35164-0001-Data.rda" "DS0002/35164-0002-Data.rda"
[7] "DS0003/35164-0003-Data.rda" "DS0004/35164-0004-Data.rda"
[9] "DS0005/35164-0005-Data.rda" "factor_to_numeric_icpsr.R"
[11] "series-95-related_literature.txt" "TermsOfUse.html"
Let’s see what’s in these .rda files. The DS1s for both 2012 and 2013 are the address record-type files. First, 2012:
load("NCVS2012/DS0001/34650-0001-Data.rda")
ls()
head(da34650.0001)
[1] "da34650.0001"
V1001 YEARQ IDHH V1002
1 (1) Address record 2012.1 2501017260961929294229224 27296
2 (1) Address record 2012.1 2501051210759582293728435 24034
3 (1) Address record 2012.1 2501286218428920608853213 26233
4 (1) Address record 2012.1 2501382697440982298228224 27298
5 (1) Address record 2012.1 2501533299154388298804435 24033
6 (1) Address record 2012.1 2501586708146353299320324 27299
V1003 V1004 V1005 V1006 V1008 V1009 V1010
1 (121) 2012, 1st quarter 25 01017260961929294229 2 24 2012 6172013
2 (121) 2012, 1st quarter 25 01051210759582293728 4 35 2012 6172013
3 (121) 2012, 1st quarter 25 01286218428920608853 2 13 2012 6172013
4 (121) 2012, 1st quarter 25 01382697440982298228 2 24 2012 6172013
5 (121) 2012, 1st quarter 25 01533299154388298804 4 35 2012 6172013
6 (121) 2012, 1st quarter 25 01586708146353299320 3 24 2012 6172013
As you can see, the DS1 for 2012 contains a unique identifer for each interviewed household. Let’s load the address record-type file for 2013.
load("NCVS2013/DS0001/35164-0001-Data.rda")
Let’s give these address record-type files for 2012 and 2013 more useful names.
dataAddr12 <- da34650.0001
dataAddr13 <- da35164.0001
By contrast, DS2 contains household information. Let’s load the household data and give them more useful names.
load("NCVS2012/DS0002/34650-0002-Data.rda")
load("NCVS2013/DS0002/35164-0002-Data.rda")
dataHH12 <- da34650.0002
dataHH13 <- da35164.0002
The DS3 files contain person specific information whereas the DS4 files provide incident information. Let’s load them and give them useful names.
load("NCVS2012/DS0003/34650-0003-Data.rda")
load("NCVS2013/DS0003/35164-0003-Data.rda")
dataPers12 <- da34650.0003
dataPers13 <- da35164.0003
load("NCVS2012/DS0004/34650-0004-Data.rda")
load("NCVS2013/DS0004/35164-0004-Data.rda")
dataInc12 <- da34650.0004
dataInc13 <- da35164.0004
Now that we’ve loaded and renamed all the files we’ll need, we can remove objects from our working environment that we no longer need. We can use rm() to accomplish this:
rm(da34650.0001,da34650.0002,da34650.0003,da34650.0004,
da35164.0001,da35164.0002,da35164.0003,da35164.0004)
Let’s examine in a bit more detail the first three rows of the person file. The dataset contains 240 columns so we will just show the first 40 columns here. Note IDHH (household ID), IDPER (person ID), and the relationship between the first two rows. Also, note that V3077 (Variable #3077) refers to who responded to the survey.
dataPers12[1:3, 1:40]
V3001 YEARQ IDHH
1 (3) Person record 2012.1 2501017260961929294229224
2 (3) Person record 2012.1 2501017260961929294229224
3 (3) Person record 2012.1 2501051210759582293728435
IDPER V3002 V3003 V3004
1 250101726096192929422922401 27296 (121) 2012, 1st quarter 25
2 250101726096192929422922402 27296 (121) 2012, 1st quarter 25
3 250105121075958229372843501 24034 (121) 2012, 1st quarter 25
V3005 V3006 V3008 V3009 V3010 V3011
1 01017260961929294229 2 24 1 1 (2) Telephone/self
2 01017260961929294229 2 24 2 2 (2) Telephone/self
3 01051210759582293728 4 35 1 1 (2) Telephone/self
V3012 V3013 V3014 V3015 V3016
1 (11) Reference person 22 22 (1) Married (1) Married
2 (02) Wife 18 18 (1) Married (1) Married
3 (11) Reference person 28 28 (5) Never married (6) Not inter last
V3017 V3018 V3019 V3020 V3023A
1 (1) Male (1) Male (1) Yes (28) High school grad (02) Black only
2 (2) Female (2) Female (2) No (28) High school grad (01) White only
3 (1) Male (1) Male (2) No (40) Some college(no degree) (01) White only
V3024 V3025 V3026 V3027 V3031 V3032 V3033 V3034 V3035 V3036
1 (2) No (02) February 27 2012 NA 9 NA (2) No NA <NA>
2 (1) Yes (02) February 2 2012 9 NA 3 (2) No NA (2) No
3 (2) No (03) March 11 2012 5 NA 3 (1) Yes 1 (1) Yes
V3037 V3038 V3039 V3040 V3041 V3042 V3043
1 NA <NA> NA (2) No NA (2) No NA
2 NA (2) No NA (2) No NA (2) No NA
3 1 (2) No NA (2) No NA (1) Yes 2
Let’s examine the corresponding household information. This dataset also has a lot of features so we will just show here the first 53 of 280 columns.
subset(dataHH12, IDHH=="2501017260961929294229224")[,1:53]
V2001 YEARQ IDHH V2002
1 (2) Household record 2012.1 2501017260961929294229224 27296
V2003 V2004 V2005 V2006 V2008 V2009
1 (121) 2012, 1st quarter 25 01017260961929294229 2 24 0
V2010 V2011 V2012 V2013 V2014
1 (1) Unit in smpl/prev (1) Same hhld 2 (998) Residue (2) Rented for cash
V2015 V2016 V2017 V2018 V2019
1 (2) Rented for cash (1) Urban (1) Urban <NA> (7) Item blank
V2020 V2021 V2022 V2023 V2024
1 (01) House/apt/flat (01) House/apt/flat (1) Phone/unit (1) Yes (04) Four
V2025 V2025A V2025B V2026 V2027 V2028 V2029
1 (1) Yes (1) Yes (1) Yes (07) 17,500-19,999 <NA> NA NA
V2030 V2031 V2032 V2033 V2034 V2035
1 (300) Interviewed hhld <NA> (02) Wife 18 (1) Married (1) Married
V2036 V2037 V2038 V2040A V2041 V2042
1 (2) Female (2) No (28) High school grad (01) White only (1) Yes 22
V2043 V2044 V2045 V2046 V2047
1 (1) Married (1) Married (1) Male (1) Yes (28) High school grad
V2049A V2050 V2051 V2052
1 (02) Black only (2) No NA NA
And the corresponding incident file (just the first 43 of 950 columns):
dataInc12[1:3, 1:43]
V4001 YEARQ IDHH
1 (4) Incident record 2012.1 2501051210759582293728435
2 (4) Incident record 2012.1 2501051210759582293728435
3 (4) Incident record 2012.1 2501051210759582293728435
IDPER V4002 V4003 V4004
1 250105121075958229372843501 24034 (121) 2012, 1st quarter 25
2 250105121075958229372843501 24034 (121) 2012, 1st quarter 25
3 250105121075958229372843501 24034 (121) 2012, 1st quarter 25
V4005 V4006 V4008 V4009 V4010 V4011 V4012
1 01051210759582293728 4 35 1 1 (36) 36:Indiv scrn quest 1
2 01051210759582293728 4 35 1 1 (37) 37:Hhld scrn quest 1
3 01051210759582293728 4 35 1 1 (41) 41:Indiv scrn quest 1
V4013 V4014 V4015 V4016 V4017 V4018 V4019
1 (2) Bef mov this add (09) September 2011 1 (1) 1-5 incidents <NA> <NA>
2 (2) Bef mov this add (09) September 2011 1 (1) 1-5 incidents <NA> <NA>
3 (2) Bef mov this add (09) September 2011 2 (1) 1-5 incidents <NA> <NA>
V4021B V4022 V4023 V4023B V4024 V4025
1 (01) Aft 6am-12am (4) Diff city etc (2) No (2) No (02) R/hme-det bldg (2) No
2 (01) Aft 6am-12am (4) Diff city etc (2) No (2) No (01) R/hme-own dwell (2) No
3 (06) Aft 9pm-12pm (4) Diff city etc (2) No (2) No (12) Comm-rest/bar <NA>
V4026 V4027 V4028 V4029 V4030 V4031 V4032 V4033 V4034
1 (1) Yes <NA> (1) Yes (1) At least 1 entry (0) No (0) No (0) No (0) No (0) No
2 (1) Yes <NA> (2) No <NA> <NA> <NA> <NA> <NA> <NA>
3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
V4035 V4036 V4037 V4038 V4039 V4040 V4041A
1 (0) No (1) Yes (0) No (0) No (0) No out of range <NA> <NA>
2 <NA> <NA> <NA> <NA> <NA> (04) Unlk door/win <NA>
3 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
Let’s look at the month and year of crime incident variables
with(dataInc12, table(V4014,V4015))
with(dataInc13, table(V4014,V4015))
V4015
V4014 2011 2012
(01) January 0 728
(02) February 0 658
(03) March 0 705
(04) April 0 751
(05) May 0 768
(06) June 0 825
(07) July 159 670
(08) August 296 560
(09) September 366 426
(10) October 492 298
(11) November 608 139
(12) December 766 0
(98) Residue 0 0
V4015
V4014 2012 2013
(1) January 0 566
(2) February 0 580
(3) March 0 615
(4) April 0 526
(5) May 0 688
(6) June 0 649
(7) July 144 580
(8) August 245 474
(9) September 306 306
(10) October 440 238
(11) November 557 116
(12) December 697 0
(98) Residue 0 0
Next, we can create a 2012 incident dataframe. Importantly, the 2012 data contain incidents that occurred in 2012 as well as 2011 but were all self-reported to the Census Bureau in 2012. Likewise, the 2013 data contain incidents that occurred in 2012 as well as 2013. If we wanted to analyze crime that occurred in only 2012, we’d subset the data to include only 2012. We will combine the 2012 and 2013 incident dataframes and then subset this new dataframe so that we exclude 2011 and 2013. As we can see in the Codebook PDF, the variable V4015 refers to the year of occurrence. (Helpful hint: the numbering of the variables correlate to the numbering of the dataframe. The incident-level file is DS4. Many of the variables in DS4 are V4XXX.)
rbind binds rows. This is good for when the columns in two datasets are exactly the same.
dataInc <- rbind(dataInc12,dataInc13)
table(dataInc$V4015) # year crime occured
dataInc <- subset(dataInc, V4015==2012)
2011 2012 2013
2687 8917 5338
We will also want to exclude crime that happens outside the United States or crimes for which we do not know the location (NA). According to the Codebook, V4022 refers to location.
dataInc <- subset(dataInc, (V4022!="(1) Outside U.S.") | is.na(V4022))
A lot of crimes happen in a series. The BJS convention is to include up to 10 occurrences in a series crime.
i <- with(dataInc, which((V4019=="(2) No (is series)") & (V4016>=11) & (V4016<=996)))
dataInc$V4016[i] <- 10
dataInc$V4016[dataInc$V4016>=997] <- NA
Also, BJS analyses of NCVS data generally use weights because NCVS is survey data. We want to weight the survey data so that they are representative of the wider U.S. population! There are three NCVS weight categories: household, personal, and incident.
For more information about NCVS weights, consult the section on Weighting Information found at this ICPSR resource guide to the NCVS: (https://www.icpsr.umich.edu/icpsrweb/NACJD/NCVS/accuracy.jsp).
To that extent, let’s update the weight for series crimes and create a “date year” weight.
i <- which(dataInc$V4019=="(2) No (is series)")
dataInc$WGTVICDY <- dataInc$WGTVICCY
dataInc$WGTVICDY[i] <- with(dataInc, WGTVICDY[i] * V4016[i])
We can also tabulate total weight by crime type to estimate the count of a crime. As the Codebook instructs, V4529 is the variable for crime type.
aggregate(WGTVICDY~V4529, data=dataInc, sum)
V4529 WGTVICDY
1 (01) Completed rape 74309.666
2 (02) Attempted rape 59501.772
3 (03) Sex aslt w s aslt 41212.611
4 (04) Sex aslt w m aslt 6515.781
5 (05) Rob w inj s aslt 79343.272
6 (06) Rob w inj m aslt 77564.887
7 (07) Rob wo injury 176027.246
8 (08) At rob inj s asl 28969.151
9 (09) At rob inj m asl 26869.716
10 (10) At rob w aslt 148857.011
11 (11) Ag aslt w injury 385348.494
12 (12) At ag aslt w wea 271055.951
13 (13) Thr aslt w weap 421411.004
14 (14) Simp aslt w inj 954981.736
15 (15) Sex aslt wo inj 32580.327
16 (16) Unw sex wo force 15992.059
17 (17) Asl wo weap, wo inj 2005635.943
18 (18) Verbal thr rape 39745.499
19 (19) Ver thr sex aslt 15369.782
20 (20) Verbal thr aslt 2019545.074
21 (21) Purse snatching 15990.538
22 (22) At purse snatch 7272.660
23 (23) Pocket picking 126418.096
24 (31) Burg, force ent 1215286.994
25 (32) Burg, ent wo for 1758044.551
26 (33) Att force entry 711352.327
27 (40) Motor veh theft 480278.161
28 (41) At mtr veh theft 165996.837
29 (54) Theft < $10 1115139.162
30 (55) Theft $10-$49 2899929.059
31 (56) Theft $50-$249 4918627.396
32 (57) Theft $250+ 3790419.581
33 (58) Theft value NA 1369499.977
34 (59) Attempted theft 686151.735
35 (1) Completed rape 54822.944
36 (2) Attempted rape 1640.455
37 (3) Sex aslt w s aslt 5774.439
38 (5) Rob w inj s aslt 53467.958
39 (6) Rob w inj m aslt 64188.001
40 (7) Rob wo injury 59359.504
41 (9) At rob inj m asl 10626.371
As you can see, there are some irregularities with the coding of crime types. Sometimes a type is coded as “(01)”, but other times it is coded as “(1)”. Let’s standardize this coding using regular expressions.
dataInc$V4529 <- gsub("\\(([1-9])\\)", "(0\\1)", dataInc$V4529)
aggregate(WGTVICDY~V4529, data=dataInc, sum)
V4529 WGTVICDY
1 (01) Completed rape 129132.610
2 (02) Attempted rape 61142.227
3 (03) Sex aslt w s aslt 46987.050
4 (04) Sex aslt w m aslt 6515.781
5 (05) Rob w inj s aslt 132811.230
6 (06) Rob w inj m aslt 141752.888
7 (07) Rob wo injury 235386.750
8 (08) At rob inj s asl 28969.151
9 (09) At rob inj m asl 37496.087
10 (10) At rob w aslt 148857.011
11 (11) Ag aslt w injury 385348.494
12 (12) At ag aslt w wea 271055.951
13 (13) Thr aslt w weap 421411.004
14 (14) Simp aslt w inj 954981.736
15 (15) Sex aslt wo inj 32580.327
16 (16) Unw sex wo force 15992.059
17 (17) Asl wo weap, wo inj 2005635.943
18 (18) Verbal thr rape 39745.499
19 (19) Ver thr sex aslt 15369.782
20 (20) Verbal thr aslt 2019545.074
21 (21) Purse snatching 15990.538
22 (22) At purse snatch 7272.660
23 (23) Pocket picking 126418.096
24 (31) Burg, force ent 1215286.994
25 (32) Burg, ent wo for 1758044.551
26 (33) Att force entry 711352.327
27 (40) Motor veh theft 480278.161
28 (41) At mtr veh theft 165996.837
29 (54) Theft < $10 1115139.162
30 (55) Theft $10-$49 2899929.059
31 (56) Theft $50-$249 4918627.396
32 (57) Theft $250+ 3790419.581
33 (58) Theft value NA 1369499.977
34 (59) Attempted theft 686151.735
Now, we can use the NCVS incident data to find out how many car thefts occurred in 2012.
with(subset(dataInc, V4529=="(40) Motor veh theft"),
sum(WGTVICDY))
[1] 480278.2
Also, note that the definition of rape changed in 2013.
with(subset(dataInc,V4529=="(01) Completed rape"),
sum(WGTVICDY))
[1] 129132.6
So far, we’ve created a dataframe and worked with weights for the Incident data. However, the Household and Person Data have data that we might need. Let’s first create a 2012 data year household data frame, much like we did with the incident data. Note that YEARQ refers to the year and quarter of the interview. The variable V2130 is the month allocated from panel/rotation number. The panel/rotation number refer to the process through which interviews are conducted.
dataHH <- rbind(dataHH12,dataHH13)
dataHH <- subset(dataHH, YEARQ>=2012.1 & YEARQ<=2013.2)
Let’s make the “month allocated” uniform, and using regular expressions, delete “0s” following parentheses.
table(dataHH$V2130)
dataHH$V2130 <- gsub("\\(0", "\\(", dataHH$V2130)
table(dataHH$V2130)
(01) January (02) February (03) March (04) April (05) May
10602 10567 10695 10614 10511
(06) June (07) July (08) August (09) September (10) October
10659 10572 10624 10678 10692
(11) November (12) December (1) January (2) February (3) March
10597 10630 10612 10573 10702
(4) April (5) May (6) June (7) July (8) August
10720 10661 10603 0 0
(9) September
0
(1) January (10) October (11) November (12) December (2) February
21214 10692 10597 10630 21140
(3) March (4) April (5) May (6) June (7) July
21397 21334 21172 21262 10572
(8) August (9) September
10624 10678
When you view the table again, you can see that the original 21 months listed were condensed into 12.
Next, create a 2012 data year person data frame. We need to first fix incompatible factor/numeric in 2012/2013. The factor levels in 2012 look like “(1) Yes”, but in 2013 are just “1.”
i <- sapply(dataPers12,levels) #gives factor levels for each variable
i <- i[!sapply(i,is.null)] #gives factor levels for each factor variable
#for non-factor variables, i returns a null result
i <- sapply(i, function(x) all(substring(x,1,1)=="(")) #store in i those variables where the first #character of factor levels is "(""
var.fix <- names(i)[i] #this gives us the name of variables
#where factor levels begin with "(".
for(xj in var.fix) #create a for-loop to fix these variable names. for each value "xj" in var.fix
{
dataPers12[,xj] <- gsub("\\(([0-9]+)\\).*","\\1",dataPers12[,xj]) #remove the words that follow the #numbers in parentheses
dataPers12[,xj] <- as.numeric(dataPers12[,xj]) #convert the numbers in parentheses to just numeric #variables, which present as numbers w/o parentheses
}
Then, stack the 2012 and 2013 data frames using rbind().
dataPers <- rbind(dataPers12, dataPers13)
dataPers <- subset(dataPers, YEARQ>=2012.1 & YEARQ<=2013.2)
Now that we’ve created a person dataframe and an incident dataframe, we can merge them together. We will use merge() to pull age, marital status, and sex into the incident data. The merge() function has several parameters that communicate to R which features should be used to match and which ones should be merged. Here we tell merge() to use use a pair of features from the incident data (IDPER and YEARQ) and look up a row in dataPers with the same values of IDPER and YEARQ. We’ve selected only the five columns IDPER, YEARQ, V3014, V3015, and V3018 from dataPers. The first two merge() uses to identify matching rows and the last three will be attached as new columns to dataInc.
a <- merge(dataInc, # incident data
dataPers[,c("IDPER","YEARQ", # IDPER & YEARQ unique IDs of person
"V3014", # age
"V3015", # marital status
"V3018")], # sex
by=c("IDPER","YEARQ"), # variables used to merge
all.x=TRUE) # keep all incidents, even if not matched
# a should have the same number of rows as dataInc, but 3 additional new columns
dim(dataInc)
[1] 8852 951
dim(a)
[1] 8852 954
# replace dataInc with a, now containing age, marital, and sex
dataInc <- a
# check merge for first incident
dataInc[1,c("IDPER","YEARQ","V3014","V3015","V3018")]
IDPER YEARQ V3014 V3015 V3018
1 250105121075958229372843501 2012.3 28 3 1
# check dataPers for this person's age, marital, and sex
subset(dataPers, IDPER=="250105121075958229372843501" & YEARQ==2012.3,
select = c("IDPER","YEARQ","V3014","V3015","V3018"))
IDPER YEARQ V3014 V3015 V3018
95199 250105121075958229372843501 2012.3 28 3 1
We can see that the first row of dataInc now has three additional columns, and that they have the correct values merged from the dataPers data.
Let’s give these new columns better names.
names(dataInc)[names(dataInc)=="V3014"] <- "age"
names(dataInc)[names(dataInc)=="V3015"] <- "marital"
names(dataInc)[names(dataInc)=="V3018"] <- "sex"
Let’s also create a new variable that breaks age into age categories.
dataInc$ageGroup <- cut(dataInc$age, breaks=c(0,16,21,35,45,60,110))
Note that “8” is a missing value indicator for marital status. Always refer to the Codebook if you are not sure what a variable or a categorical variable value means.
dataInc$marital[dataInc$marital==8] <- NA
Factor variables in R put meaningful labels on categorical variables. Instead of working with the numbers 1-5 for marital status, let’s assign the number values their actual corresponding names.
dataInc$marital <- factor(dataInc$marital, levels=1:5,
labels=c("married","widowed","divorced",
"separated","never married"))
dataInc$sex <- factor(dataInc$sex, levels=1:2,
labels=c("male","female"))
Let’s get estimated counts by age group and sex.
aggregate(WGTVICDY~ageGroup+sex, data=dataInc, FUN=sum)
ageGroup sex WGTVICDY
1 (0,16] male 1198909.6
2 (16,21] male 1274033.7
3 (21,35] male 3539889.7
4 (35,45] male 2095416.6
5 (45,60] male 3024668.5
6 (60,110] male 1337477.9
7 (0,16] female 887078.5
8 (16,21] female 1243057.6
9 (21,35] female 4320788.8
10 (35,45] female 2307591.3
11 (45,60] female 3240564.4
12 (60,110] female 1921647.3
We can also find out common crime type by sex. As before, aggregate() will total up the weights, but as you see in the ageGroup/sex example above, aggregate() produces the results in a long form. Sometimes this is useful, but sometimes we want to have our results side-by-side. We will use reshape() to convert the “long format” results from aggregate() to a “wide format”.
a <- aggregate(WGTVICDY~V4529+sex, data=dataInc, FUN=sum)
a <- reshape(a, timevar="sex", idvar="V4529", direction="wide")
a[is.na(a)] <- 0
names(a) <- c("crimeType","male","female")
a
crimeType male female
1 (01) Completed rape 6318.130 122814.480
2 (02) Attempted rape 42077.861 19064.366
3 (03) Sex aslt w s aslt 38218.021 8769.029
4 (05) Rob w inj s aslt 80534.437 52276.793
5 (06) Rob w inj m aslt 35610.607 106142.282
6 (07) Rob wo injury 150662.017 84724.733
7 (08) At rob inj s asl 22330.349 6638.802
8 (09) At rob inj m asl 12200.917 25295.171
9 (10) At rob w aslt 104657.340 44199.671
10 (11) Ag aslt w injury 188925.090 196423.404
11 (12) At ag aslt w wea 185157.394 85898.556
12 (13) Thr aslt w weap 237527.692 183883.312
13 (14) Simp aslt w inj 448773.257 506208.479
14 (15) Sex aslt wo inj 3119.587 29460.740
15 (16) Unw sex wo force 2957.926 13034.133
16 (17) Asl wo weap, wo inj 1042741.375 962894.567
17 (18) Verbal thr rape 26408.008 13337.490
18 (19) Ver thr sex aslt 9298.262 6071.520
19 (20) Verbal thr aslt 1099721.249 919823.826
20 (23) Pocket picking 81230.111 45187.984
21 (31) Burg, force ent 609106.185 606180.810
22 (32) Burg, ent wo for 741492.194 1016552.357
23 (33) Att force entry 269383.309 441969.018
24 (40) Motor veh theft 256959.885 223318.276
25 (41) At mtr veh theft 87364.540 78632.297
26 (54) Theft < $10 444360.185 670778.978
27 (55) Theft $10-$49 1217450.179 1682478.881
28 (56) Theft $50-$249 2261589.762 2657037.634
29 (57) Theft $250+ 1825854.971 1964564.610
30 (58) Theft value NA 588405.556 781094.421
31 (59) Attempted theft 349959.481 336192.254
35 (04) Sex aslt w m aslt 0.000 6515.781
52 (21) Purse snatching 0.000 15990.538
53 (22) At purse snatch 0.000 7272.660
We can then convert this result to column percentages. To obtain a column percentage, we divide counts for an individual cell by the total number of counts for the column. So, the sum of all the values in the male column should equal 100:
temp <- a
temp$male <- with(temp, 100*male/ sum(male))
temp$female <- with(temp, 100*female/sum(female))
colSums(temp[,-1]) # check that the columns sum to 100
male female
100 100
temp$ratio <- temp$female/temp$male
temp[order(-temp$ratio),]
crimeType male female ratio
35 (04) Sex aslt w m aslt 0.00000000 0.04680632 Inf
52 (21) Purse snatching 0.00000000 0.11486855 Inf
53 (22) At purse snatch 0.00000000 0.05224339 Inf
1 (01) Completed rape 0.05066503 0.88224180 17.4132299
14 (15) Sex aslt wo inj 0.02501594 0.21163218 8.4598928
15 (16) Unw sex wo force 0.02371958 0.09363112 3.9474183
5 (06) Rob w inj m aslt 0.28556116 0.76247652 2.6700989
8 (09) At rob inj m asl 0.09783905 0.18170868 1.8572204
23 (33) Att force entry 2.16018250 3.17489877 1.4697364
26 (54) Theft < $10 3.56332060 4.81856254 1.3522675
27 (55) Theft $10-$49 9.76272278 12.08614160 1.2379888
22 (32) Burg, ent wo for 5.94601969 7.30243682 1.2281219
30 (58) Theft value NA 4.71841922 5.61101711 1.1891731
28 (56) Theft $50-$249 18.13566934 19.08691602 1.0524517
13 (14) Simp aslt w inj 3.59870899 3.63636503 1.0104638
29 (57) Theft $250+ 14.64151570 14.11251359 0.9638697
10 (11) Ag aslt w injury 1.51498871 1.41101390 0.9313692
21 (31) Burg, force ent 4.88441739 4.35451951 0.8915126
31 (59) Attempted theft 2.80632214 2.41504796 0.8605740
16 (17) Asl wo weap, wo inj 8.36173435 6.91698435 0.8272189
25 (41) At mtr veh theft 0.70057552 0.56485766 0.8062766
24 (40) Motor veh theft 2.06055917 1.60421408 0.7785334
19 (20) Verbal thr aslt 8.81865547 6.60758428 0.7492734
12 (13) Thr aslt w weap 1.90473257 1.32093174 0.6934998
18 (19) Ver thr sex aslt 0.07456269 0.04361496 0.5849435
4 (05) Rob w inj s aslt 0.64580498 0.37553204 0.5814945
6 (07) Rob wo injury 1.20815745 0.60862286 0.5037612
20 (23) Pocket picking 0.65138358 0.32460935 0.4983382
17 (18) Verbal thr rape 0.21176560 0.09581030 0.4524356
11 (12) At ag aslt w wea 1.48477559 0.61705507 0.4155881
2 (02) Attempted rape 0.33742202 0.13694949 0.4058700
9 (10) At rob w aslt 0.83924634 0.31750977 0.3783273
7 (08) At rob inj s asl 0.17906688 0.04769005 0.2663253
3 (03) Sex aslt w s aslt 0.30646999 0.06299260 0.2055425
Or we can compute row percentages to determine what percentage of each crime is male and female.
temp <- a
row.total <- with(temp, male+female)
temp$male <- with(temp, 100*male/ row.total)
temp$female <- with(temp, 100*female/row.total)
rowSums(temp[,-1]) # check that the rows sum to 100
temp$ratio <- temp$female/temp$male
temp[order(-temp$ratio),]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
21 22 23 24 25 26 27 28 29 30 31 35 52 53
100 100 100 100 100 100 100 100 100 100 100 100 100 100
crimeType male female ratio
35 (04) Sex aslt w m aslt 0.000000 100.00000 Inf
52 (21) Purse snatching 0.000000 100.00000 Inf
53 (22) At purse snatch 0.000000 100.00000 Inf
1 (01) Completed rape 4.892745 95.10725 19.4384234
14 (15) Sex aslt wo inj 9.575063 90.42494 9.4437952
15 (16) Unw sex wo force 18.496217 81.50378 4.4065110
5 (06) Rob w inj m aslt 25.121609 74.87839 2.9806367
8 (09) At rob inj m asl 32.539173 67.46083 2.0732188
23 (33) Att force entry 37.869182 62.13082 1.6406696
26 (54) Theft < $10 39.847958 60.15204 1.5095389
27 (55) Theft $10-$49 41.982068 58.01793 1.3819694
22 (32) Burg, ent wo for 42.177099 57.82290 1.3709549
30 (58) Theft value NA 42.964992 57.03501 1.3274763
28 (56) Theft $50-$249 45.980099 54.01990 1.1748539
13 (14) Simp aslt w inj 46.992863 53.00714 1.1279827
29 (57) Theft $250+ 48.170260 51.82974 1.0759697
10 (11) Ag aslt w injury 49.027074 50.97293 1.0396894
21 (31) Burg, force ent 50.120357 49.87964 0.9951973
31 (59) Attempted theft 51.003220 48.99678 0.9606605
16 (17) Asl wo weap, wo inj 51.990561 48.00944 0.9234261
25 (41) At mtr veh theft 52.630244 47.36976 0.9000482
24 (40) Motor veh theft 53.502305 46.49770 0.8690784
19 (20) Verbal thr aslt 54.453910 45.54609 0.8364154
12 (13) Thr aslt w weap 56.364853 43.63515 0.7741553
18 (19) Ver thr sex aslt 60.497034 39.50297 0.6529736
4 (05) Rob w inj s aslt 60.638274 39.36173 0.6491235
6 (07) Rob wo injury 64.006159 35.99384 0.5623496
20 (23) Pocket picking 64.255130 35.74487 0.5562960
17 (18) Verbal thr rape 66.442765 33.55724 0.5050548
11 (12) At ag aslt w wea 68.309658 31.69034 0.4639218
2 (02) Attempted rape 68.819641 31.18036 0.4530735
9 (10) At rob w aslt 70.307297 29.69270 0.4223275
7 (08) At rob inj s asl 77.083202 22.91680 0.2972995
3 (03) Sex aslt w s aslt 81.337349 18.66265 0.2294475