I will be attempting to answer a few questions regarding baseball data. One from the view of a GM and one from the view of being the commissioner of Major League Baseball.
The first question, from the view of a GM is:
## 'data.frame': 19617 obs. of 5 variables:
## $ playerID : Factor w/ 19617 levels "aardsda01","aaronha01",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ nameFirst: Factor w/ 2446 levels "","A. J.","Aaron",..: 540 965 2203 632 80 814 1217 683 170 352 ...
## $ nameLast : Factor w/ 10034 levels "Aardsma","Aaron",..: 1 2 2 3 4 4 5 6 7 7 ...
## $ bbrefID : Factor w/ 19617 levels "","aardsda01",..: 2 3 4 5 6 7 8 9 10 11 ...
## $ address : chr "Denver, CO, USA" "Mobile, AL, USA" "Mobile, AL, USA" "Orange, CA, USA" ...
## 'data.frame': 19526 obs. of 2 variables:
## $ bbrefID : Factor w/ 19744 levels "","aardsda01",..: 2 3 4 5 6 7 8 9 10 11 ...
## $ fg_playerid: int 1902 1000001 1000002 1000003 506 4994 1000004 1000005 1000006 1000007 ...
## 'data.frame': 4029 obs. of 3 variables:
## $ Name : Factor w/ 3991 levels "A.J. Ellis","A.J. Hinch",..: 199 207 3939 3815 1795 1696 3803 3625 3350 3563 ...
## $ wOBA : num 0.513 0.435 0.409 0.445 0.408 0.403 0.436 0.493 0.459 0.435 ...
## $ fg_playerid: int 1011327 1109 1008315 1002378 1013485 1000001 1012309 1014040 1006030 1009405 ...
MLB Definition: WAR measures a player’s value in all facets of the game by deciphering how many more wins he’s worth than a replacement-level player at his same position (e.g., a Minor League replacement or a readily available fill-in free agent).
I found that WAR was higher across the board for older players/players who had played more seasons. I normalized WAR by dividing the player’s WAR by the number of innings the player had pitched over their career and multiplying by 100. I named the new statistic iWAR.
## 'data.frame': 3126 obs. of 3 variables:
## $ Name : Factor w/ 3100 levels "A.J. Burnett",..: 2539 666 3021 1235 2419 2280 168 1154 2758 2359 ...
## $ fg_playerid: int 815 1014369 1006511 104 60 1011348 1001098 1010210 1001964 1000128 ...
## $ iWAR : num 2.72 1.79 1.98 2.33 2.67 ...
I combined these 4 datasets together to get one dataset to query from to try to answer the research question.
## 'data.frame': 6936 obs. of 7 variables:
## $ bbrefID : Factor w/ 19617 levels "","aardsda01",..: 2 3 4 5 7 9 10 11 14 16 ...
## $ playerID : Factor w/ 19617 levels "aardsda01","aaronha01",..: 1 2 3 4 6 8 9 10 13 15 ...
## $ address : chr "Denver, CO, USA" "Mobile, AL, USA" "Mobile, AL, USA" "Orange, CA, USA" ...
## $ fg_playerid: int 1902 1000001 1000002 1000003 4994 1000005 1000006 1000007 1000010 1000012 ...
## $ Name : Factor w/ 6838 levels "A.J. Ellis","A.J. Hinch",..: 4693 1696 3732 4772 4956 1212 4146 668 5101 5390 ...
## $ wOBA : num 0 0.403 0.282 0 0 0.319 0 0.356 0 0 ...
## $ iWAR : num 0.326 0 0 1.055 0.221 ...
The next issue I ran into was not having coordinates for the players birthplaces. The leaflet package works better with longitude and latitude coordinates. I found google has an API I was able to get access for free with the ggmap package. I created a for loop to run through the data frame and find the longitude and latitude for each row, from the address field.
cm.all_ex <- sample_n(cm.all,5,replace = FALSE)
kable(head(cm.all_ex))
bbrefID | playerID | address | fg_playerid | Name | wOBA | iWAR |
---|---|---|---|---|---|---|
kollodo01 | kollodo01 | Posen, IL, USA | 1007087 | Don Kolloway | 0.308 | 0.000 |
higgibo02 | higgibo02 | Philadelphia, PA, USA | 469 | Bobby Higginson | 0.354 | 0.000 |
macdomi01 | macdomi01 | Las Vegas, NV, USA | 612 | Mike MacDougal | 0.000 | 0.660 |
blackbu02 | blackbu02 | San Mateo, CA, USA | 1001007 | Bud Black | 0.000 | 0.886 |
smithal01 | smithal01 | New York, NY, USA | 1012087 | Aleck Smith | 0.291 | 0.000 |
register_google(key = "AIzaSyB0wuJ0t_q77XycPYc19W0lqZe4-cVFen4")
for(i in 1:nrow(cm.all_ex)){
result <- geocode(cm.all_ex$address[i], output = "latlona", source = "google")
cm.all_ex$lon[i] <- as.numeric(result[1])
cm.all_ex$lat[i] <- as.numeric(result[2])
cm.all_ex$geoAddress[i] <- as.character(result[3])
}
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=Posen,+IL,+USA&key=xxx-cVFen4
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=Philadelphia,+PA,+USA&key=xxx-cVFen4
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=Las+Vegas,+NV,+USA&key=xxx-cVFen4
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=San+Mateo,+CA,+USA&key=xxx-cVFen4
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=New+York,+NY,+USA&key=xxx-cVFen4
kable(head(cm.all_ex))
bbrefID | playerID | address | fg_playerid | Name | wOBA | iWAR | lon | lat | geoAddress |
---|---|---|---|---|---|---|---|---|---|
kollodo01 | kollodo01 | Posen, IL, USA | 1007087 | Don Kolloway | 0.308 | 0.000 | -87.68144 | 41.63170 | posen, il, usa |
higgibo02 | higgibo02 | Philadelphia, PA, USA | 469 | Bobby Higginson | 0.354 | 0.000 | -75.16522 | 39.95258 | philadelphia, pa, usa |
macdomi01 | macdomi01 | Las Vegas, NV, USA | 612 | Mike MacDougal | 0.000 | 0.660 | -115.13983 | 36.16994 | las vegas, nv, usa |
blackbu02 | blackbu02 | San Mateo, CA, USA | 1001007 | Bud Black | 0.000 | 0.886 | -122.32553 | 37.56299 | san mateo, ca, usa |
smithal01 | smithal01 | New York, NY, USA | 1012087 | Aleck Smith | 0.291 | 0.000 | -74.00597 | 40.71278 | new york, ny, usa |
Now the data is prepped to be mapped and locate hotspots.
#All
leaflet(bb_plyrs) %>% addProviderTiles(providers$CartoDB.Positron) %>%
addHeatmap(lng = ~lon, lat = ~lat,intensity = ~wOBA, radius = 12) %>%
addMarkers(clusterOptions = markerClusterOptions()) %>%
addControl("All - Batter Heatmap wOBA", position = "topright") %>%
setView(-10.50781, 26.513789, 1)
## Assuming "lon" and "lat" are longitude and latitude, respectively
Potential hotspot #1
leaflet(bb_plyrs) %>% addProviderTiles(providers$CartoDB.Positron) %>% setView(-80.50781, 39.513789, 5) %>%
addHeatmap(lng = ~lon, lat = ~lat,intensity = ~wOBA, radius = 12) %>%
addMarkers(clusterOptions = markerClusterOptions()) %>%
addControl("Northeast - Batter Heatmap wOBA", position = "topright")
## Assuming "lon" and "lat" are longitude and latitude, respectively
Potential hotspot #2
leaflet(bb_plyrs) %>% addProviderTiles(providers$CartoDB.Positron) %>% setView(-78.50781, 22.513789, 4) %>%
addHeatmap(lng = ~lon, lat = ~lat,intensity = ~wOBA, radius = 12) %>%
addMarkers(clusterOptions = markerClusterOptions()) %>%
addControl("Caribbean Islands - Batter Heatmap wOBA", position = "topright")
## Assuming "lon" and "lat" are longitude and latitude, respectively
leaflet(bb_plyrs) %>% addProviderTiles(providers$CartoDB.Positron) %>%
addHeatmap(lng = ~lon, lat = ~lat,intensity = ~iWAR, radius = 12) %>%
addMarkers(clusterOptions = markerClusterOptions()) %>%
addControl("All - Pitcher Heatmap iWAR", position = "topright") %>%
setView(-10.50781, 26.513789, 1)
## Assuming "lon" and "lat" are longitude and latitude, respectively
Potential hotspot #1
leaflet(bb_plyrs) %>% addProviderTiles(providers$CartoDB.Positron) %>% setView(-75.50781, 41.513789, 6) %>%
addHeatmap(lng = ~lon, lat = ~lat,intensity = ~iWAR, radius = 12) %>%
addMarkers(clusterOptions = markerClusterOptions()) %>%
addControl("Northeast - Pitcher Heatmap iWAR", position = "topright")
## Assuming "lon" and "lat" are longitude and latitude, respectively
Potential hotspot #2
leaflet(bb_plyrs) %>% addProviderTiles(providers$CartoDB.Positron) %>% setView(-113.50781, 38.513789, 4) %>%
addHeatmap(lng = ~lon, lat = ~lat,intensity = ~iWAR, radius = 12) %>%
addMarkers(clusterOptions = markerClusterOptions()) %>%
addControl("West Coast - Pitcher Heatmap iWAR", position = "topright")
## Assuming "lon" and "lat" are longitude and latitude, respectively
Potential hotspot #3
leaflet(bb_plyrs) %>% addProviderTiles(providers$CartoDB.Positron) %>% setView(134.50781, 35.513789, 5) %>%
addHeatmap(lng = ~lon, lat = ~lat,intensity = ~iWAR, radius = 12) %>%
addMarkers(clusterOptions = markerClusterOptions()) %>%
addControl("Asia - Pitcher Heatmap iWAR", position = "topright")
## Assuming "lon" and "lat" are longitude and latitude, respectively
The second question, from the view of the commissioner of Major League Baseball:
## 'data.frame': 252 obs. of 6 variables:
## $ park.key : Factor w/ 252 levels "ALB01","ALT01",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ park : Factor w/ 241 levels "23rd Street Park",..: 174 43 10 11 167 17 218 205 119 140 ...
## $ park.alias: Factor w/ 57 levels "","3Com Park",..: 1 1 16 1 45 1 1 1 1 1 ...
## $ city : Factor w/ 85 levels "Albany","Altoona",..: 1 2 3 4 4 5 5 5 6 6 ...
## $ state : Factor w/ 36 levels "","AZ","CA","CO",..: 25 28 3 32 32 9 9 9 16 16 ...
## $ country : Factor w/ 6 levels "AU","CA","JP",..: 6 6 6 6 6 6 6 6 6 6 ...
## 'data.frame': 30 obs. of 12 variables:
## $ yearID : int 2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 ...
## $ teamID : Factor w/ 149 levels "ALT","ANA","ARI",..: 3 4 5 16 33 35 38 45 51 52 ...
## $ franchID: Factor w/ 120 levels "ALT","ANA","ARI",..: 3 5 6 14 29 26 30 32 38 41 ...
## $ divID : Factor w/ 4 levels "","C","E","W": 4 3 3 3 2 2 2 2 4 2 ...
## $ W : int 82 90 47 108 62 95 67 91 91 64 ...
## $ L : int 80 72 115 54 100 68 95 71 72 98 ...
## $ DivWin : Factor w/ 3 levels "","N","Y": 2 3 2 3 2 2 2 3 2 2 ...
## $ WCWin : Factor w/ 3 levels "","N","Y": 2 2 2 2 2 3 2 2 3 2 ...
## $ LgWin : Factor w/ 3 levels "","N","Y": 2 2 2 3 2 2 2 2 2 2 ...
## $ WSWin : Factor w/ 3 levels "","N","Y": 2 2 2 3 2 2 2 2 2 2 ...
## $ name : Factor w/ 139 levels "Altoona Mountain City",..: 3 4 8 16 41 37 48 52 56 59 ...
## $ park : Factor w/ 215 levels "","23rd Street Grounds",..: 37 189 133 68 198 211 75 147 49 44 ...
Merge together the two data frames to create the address of the active team’s stadium, condense down to only helpful fields.
park | name | address |
---|---|---|
Angel Stadium of Anaheim | Los Angeles Angels of Anaheim | Anaheim, CA, US |
AT&T Park | San Francisco Giants | San Francisco, CA, US |
Busch Stadium III | St. Louis Cardinals | St. Louis, MO, US |
Chase Field | Arizona Diamondbacks | Phoenix, AZ, US |
Citi Field | New York Mets | New York, NY, US |
I need to geolocate the stadiums based on the address field, I used the same for loop as before.
park | name | address | lon | lat | geoAddress |
---|---|---|---|---|---|
Angel Stadium of Anaheim | Los Angeles Angels of Anaheim | Anaheim, CA, US | -117.91430 | 33.83659 | anaheim, ca, usa |
AT&T Park | San Francisco Giants | San Francisco, CA, US | -122.41942 | 37.77493 | san francisco, ca, usa |
Busch Stadium III | St. Louis Cardinals | St. Louis, MO, US | -90.19940 | 38.62700 | st. louis, mo, usa |
Chase Field | Arizona Diamondbacks | Phoenix, AZ, US | -112.07404 | 33.44838 | phoenix, az, usa |
Citi Field | New York Mets | New York, NY, US | -74.00597 | 40.71278 | new york, ny, usa |
Now we can map where the current MLB teams are located.
After doing some research, I found that more than half of MLB fans are white males over the age of 55, another large portion of the fan demographics is white males ages 35-55. For this research question I consider this demographic to be the base demographic.
For the future demographic, I looked at the demographics of the players in the MLB. Over 30% of the players are Hispanic, this is growing every year. For my analysis I am considering Hispanic males under 30 as the future demographic.
If the MLB can find a place that has a large number of both the base and future demographics, that might be a good market to put a new franchise.
I need census data to merge together with the active teams to help answer this question. I found a package “tidycensus” it queries on the Census Bureau data warehouse.
I used the Census Bureau variable ids to query off of their tables and only pull populations by county for white males over 35 and Hispanic males under 30. I also filtered out outlier counties, these were mostly covered by active MLB franchises anyway.
vars <- c("B01001A_011","B01001A_012","B01001A_013","B01001A_014","B01001A_015","B01001A_016",'B01001I_003', 'B01001I_004', 'B01001I_005', 'B01001I_006', 'B01001I_007', 'B01001I_008', 'B01001I_009')
pop <- get_acs(geography = "county",
variables = vars,
geometry = TRUE) %>%
group_by(GEOID,NAME) %>%
summarise(sub_pop = sum(estimate)) %>%
filter(sub_pop >= 40000 & sub_pop <= 90000)
## Getting data from the 2013-2017 5-year ACS
pander(head(pop))
GEOID | NAME | sub_pop |
---|---|---|
01003 | Baldwin County, Alabama | 53990 |
01089 | Madison County, Alabama | 73607 |
01097 | Mobile County, Alabama | 70005 |
01117 | Shelby County, Alabama | 50001 |
02020 | Anchorage Municipality, Alaska | 59893 |
04003 | Cochise County, Arizona | 42308 |
geometry |
---|
MULTIPOLYGON (((-88.03 30.2… |
MULTIPOLYGON (((-86.79 34.5… |
MULTIPOLYGON (((-88.05 30.5… |
MULTIPOLYGON (((-87.03 33.2… |
MULTIPOLYGON (((-150.1 61.1… |
MULTIPOLYGON (((-110.5 31.5… |
Now I can map the active teams with the census data to see if there are holes in the locations of active MLB franchises and the demographics I have identified.
pop %>%
st_transform(crs = "+init=epsg:4326") %>%
leaflet() %>%
addProviderTiles(provider = "CartoDB.Positron") %>%
addMarkers(data = active_teams,~lon, ~lat, icon = stadium_icon, layerId = 2) %>%
addPolygons(popup = ~ paste(NAME,"-",sub_pop),
layerId = 1,
stroke = FALSE,
smoothFactor = 0,
fillOpacity = 20,
color = ~ pop_pal(sub_pop)) %>%
addControl("Site Suitability - New MLB Team", position = "topright") %>%
addLegend("bottomleft",
pal = pop_pal,
values = ~sub_pop,
title = "Population:<br>White Males > 35 &<br>Hispanic Males < 30",
opacity = 1) %>%
setView(lng =-96.62500, lat=29.65538, 3)
I identified the Carolina’s as a possible market for a new MLB franchise. I then created a buffer for that area and calculated the base and future demographic population within that buffer to see a possible fan base for the MLB franchise.
pop %>%
st_transform(crs = "+init=epsg:4326") %>%
leaflet() %>%
addProviderTiles(provider = "CartoDB.Positron") %>%
addMarkers(data = active_teams,~lon, ~lat, icon = stadium_icon, layerId = 2) %>%
addMarkers(lng =-80.925293, lat=34.488448, layerId = 3, label = paste("Buffer Zone Target Demo Pop:",formatC(buff_pop[[1]], format="f", big.mark=",", digits=0)),labelOptions = labelOptions(noHide = T)) %>%
addPolygons(popup = ~ paste(NAME,"-",sub_pop),
layerId = 1,
stroke = FALSE,
smoothFactor = 0,
fillOpacity = 20,
color = ~ pop_pal(sub_pop)) %>%
addPolygons(data=NewTeamBuf) %>%
addMeasure( position = "bottomleft",
primaryLengthUnit = "miles") %>%
addControl("Site Suitability - New MLB Team", position = "topright") %>%
addLegend("bottomright",
pal = pop_pal,
values = ~sub_pop,
title = "Population:<br>White Males > 35 &<br>Hispanic Males < 30",
opacity = 1) %>%
setView(lng =-78.925293, lat=33.988448, 6)
The possible fan base would be about 1 million people. This buffer is about a 3-hour drive from the proposed stadium site. A 3-hour drive seems reasonable for fans to drive to games4, and the teams games will be on TV and radio within this radius.
This is one small step the MLB would take to do market analysis for a new franchise, given more time I would have looked at a wider spectrum of demographics, economics, and other levels of baseball teams in the area.