Research Question

I will be attempting to answer a few questions regarding baseball data. One from the view of a GM and one from the view of being the commissioner of Major League Baseball.

The first question, from the view of a GM is:

  1. Looking at historical Major League Baseball data, back to the early 1900s; can we identify “hot spot” locations where elite pitchers and hitters were born? We will use this data to help decide where to send our scout to try and find the next elite player.

Data Wrangling

The Data

Load in 4 separate .CSV files:

  1. all_ppl = a historical list of players who have played in the MLB
## 'data.frame':    19617 obs. of  5 variables:
##  $ playerID : Factor w/ 19617 levels "aardsda01","aaronha01",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ nameFirst: Factor w/ 2446 levels "","A. J.","Aaron",..: 540 965 2203 632 80 814 1217 683 170 352 ...
##  $ nameLast : Factor w/ 10034 levels "Aardsma","Aaron",..: 1 2 2 3 4 4 5 6 7 7 ...
##  $ bbrefID  : Factor w/ 19617 levels "","aardsda01",..: 2 3 4 5 6 7 8 9 10 11 ...
##  $ address  : chr  "Denver, CO, USA" "Mobile, AL, USA" "Mobile, AL, USA" "Orange, CA, USA" ...
  1. crswlk = a cross walk to lookup playerIDs to properly merge location information with statistics.
## 'data.frame':    19526 obs. of  2 variables:
##  $ bbrefID    : Factor w/ 19744 levels "","aardsda01",..: 2 3 4 5 6 7 8 9 10 11 ...
##  $ fg_playerid: int  1902 1000001 1000002 1000003 506 4994 1000004 1000005 1000006 1000007 ...
  1. bat = a file pulled from Fangraphs which includes hitting statistics for every player. I condensed the fie down to only include the statistic I will be gauging for batters: wOBA. wOBA stands for “Weighted On Base Average”, the statistic was created by Tom Tango in 1997.
  • Definition from Fangraphs: Weighted On-Base Average combines all the different aspects of hitting into one metric, weighting each of them in proportion to their actual run value. While batting average, on-base percentage, and slugging percentage fall short in accuracy and scope, wOBA measures and captures offensive value more accurately and comprehensively.
## 'data.frame':    4029 obs. of  3 variables:
##  $ Name       : Factor w/ 3991 levels "A.J. Ellis","A.J. Hinch",..: 199 207 3939 3815 1795 1696 3803 3625 3350 3563 ...
##  $ wOBA       : num  0.513 0.435 0.409 0.445 0.408 0.403 0.436 0.493 0.459 0.435 ...
##  $ fg_playerid: int  1011327 1109 1008315 1002378 1013485 1000001 1012309 1014040 1006030 1009405 ...
  1. pitch = a file pulled from Fangraphs which includes pitching statistics for every player. I condensed the file down to only include the statistic I will be gauging for pitchers: WAR.
  • MLB Definition: WAR measures a player’s value in all facets of the game by deciphering how many more wins he’s worth than a replacement-level player at his same position (e.g., a Minor League replacement or a readily available fill-in free agent).

  • I found that WAR was higher across the board for older players/players who had played more seasons. I normalized WAR by dividing the player’s WAR by the number of innings the player had pitched over their career and multiplying by 100. I named the new statistic iWAR.

## 'data.frame':    3126 obs. of  3 variables:
##  $ Name       : Factor w/ 3100 levels "A.J. Burnett",..: 2539 666 3021 1235 2419 2280 168 1154 2758 2359 ...
##  $ fg_playerid: int  815 1014369 1006511 104 60 1011348 1001098 1010210 1001964 1000128 ...
##  $ iWAR       : num  2.72 1.79 1.98 2.33 2.67 ...

I combined these 4 datasets together to get one dataset to query from to try to answer the research question.

## 'data.frame':    6936 obs. of  7 variables:
##  $ bbrefID    : Factor w/ 19617 levels "","aardsda01",..: 2 3 4 5 7 9 10 11 14 16 ...
##  $ playerID   : Factor w/ 19617 levels "aardsda01","aaronha01",..: 1 2 3 4 6 8 9 10 13 15 ...
##  $ address    : chr  "Denver, CO, USA" "Mobile, AL, USA" "Mobile, AL, USA" "Orange, CA, USA" ...
##  $ fg_playerid: int  1902 1000001 1000002 1000003 4994 1000005 1000006 1000007 1000010 1000012 ...
##  $ Name       : Factor w/ 6838 levels "A.J. Ellis","A.J. Hinch",..: 4693 1696 3732 4772 4956 1212 4146 668 5101 5390 ...
##  $ wOBA       : num  0 0.403 0.282 0 0 0.319 0 0.356 0 0 ...
##  $ iWAR       : num  0.326 0 0 1.055 0.221 ...

The next issue I ran into was not having coordinates for the players birthplaces. The leaflet package works better with longitude and latitude coordinates. I found google has an API I was able to get access for free with the ggmap package. I created a for loop to run through the data frame and find the longitude and latitude for each row, from the address field.

cm.all_ex <- sample_n(cm.all,5,replace = FALSE)
kable(head(cm.all_ex))
bbrefID playerID address fg_playerid Name wOBA iWAR
kollodo01 kollodo01 Posen, IL, USA 1007087 Don Kolloway 0.308 0.000
higgibo02 higgibo02 Philadelphia, PA, USA 469 Bobby Higginson 0.354 0.000
macdomi01 macdomi01 Las Vegas, NV, USA 612 Mike MacDougal 0.000 0.660
blackbu02 blackbu02 San Mateo, CA, USA 1001007 Bud Black 0.000 0.886
smithal01 smithal01 New York, NY, USA 1012087 Aleck Smith 0.291 0.000
register_google(key = "AIzaSyB0wuJ0t_q77XycPYc19W0lqZe4-cVFen4")
for(i in 1:nrow(cm.all_ex)){
  result <- geocode(cm.all_ex$address[i], output = "latlona", source = "google")
  cm.all_ex$lon[i] <- as.numeric(result[1])
  cm.all_ex$lat[i] <- as.numeric(result[2])
  cm.all_ex$geoAddress[i] <- as.character(result[3])
}
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=Posen,+IL,+USA&key=xxx-cVFen4
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=Philadelphia,+PA,+USA&key=xxx-cVFen4
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=Las+Vegas,+NV,+USA&key=xxx-cVFen4
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=San+Mateo,+CA,+USA&key=xxx-cVFen4
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=New+York,+NY,+USA&key=xxx-cVFen4
kable(head(cm.all_ex))
bbrefID playerID address fg_playerid Name wOBA iWAR lon lat geoAddress
kollodo01 kollodo01 Posen, IL, USA 1007087 Don Kolloway 0.308 0.000 -87.68144 41.63170 posen, il, usa
higgibo02 higgibo02 Philadelphia, PA, USA 469 Bobby Higginson 0.354 0.000 -75.16522 39.95258 philadelphia, pa, usa
macdomi01 macdomi01 Las Vegas, NV, USA 612 Mike MacDougal 0.000 0.660 -115.13983 36.16994 las vegas, nv, usa
blackbu02 blackbu02 San Mateo, CA, USA 1001007 Bud Black 0.000 0.886 -122.32553 37.56299 san mateo, ca, usa
smithal01 smithal01 New York, NY, USA 1012087 Aleck Smith 0.291 0.000 -74.00597 40.71278 new york, ny, usa

Visualization

Now the data is prepped to be mapped and locate hotspots.

Batters - wOBA:

#All
leaflet(bb_plyrs) %>% addProviderTiles(providers$CartoDB.Positron) %>%
  addHeatmap(lng = ~lon, lat = ~lat,intensity = ~wOBA, radius = 12) %>%
  addMarkers(clusterOptions = markerClusterOptions()) %>%
  addControl("All - Batter Heatmap wOBA", position = "topright") %>%
  setView(-10.50781, 26.513789, 1)
## Assuming "lon" and "lat" are longitude and latitude, respectively

Potential hotspot #1

leaflet(bb_plyrs) %>% addProviderTiles(providers$CartoDB.Positron) %>% setView(-80.50781, 39.513789, 5) %>%
  addHeatmap(lng = ~lon, lat = ~lat,intensity = ~wOBA, radius = 12) %>%
  addMarkers(clusterOptions = markerClusterOptions()) %>%
  addControl("Northeast - Batter Heatmap wOBA", position = "topright") 
## Assuming "lon" and "lat" are longitude and latitude, respectively

Potential hotspot #2

leaflet(bb_plyrs) %>% addProviderTiles(providers$CartoDB.Positron) %>% setView(-78.50781, 22.513789, 4) %>%
  addHeatmap(lng = ~lon, lat = ~lat,intensity = ~wOBA, radius = 12) %>%
  addMarkers(clusterOptions = markerClusterOptions()) %>%
  addControl("Caribbean Islands - Batter Heatmap wOBA", position = "topright")
## Assuming "lon" and "lat" are longitude and latitude, respectively

Pitchers - iWAR:

leaflet(bb_plyrs) %>% addProviderTiles(providers$CartoDB.Positron) %>%
  addHeatmap(lng = ~lon, lat = ~lat,intensity = ~iWAR, radius = 12) %>%
  addMarkers(clusterOptions = markerClusterOptions()) %>%
  addControl("All - Pitcher Heatmap iWAR", position = "topright") %>%
  setView(-10.50781, 26.513789, 1)
## Assuming "lon" and "lat" are longitude and latitude, respectively

Potential hotspot #1

leaflet(bb_plyrs) %>% addProviderTiles(providers$CartoDB.Positron) %>% setView(-75.50781, 41.513789, 6) %>%
  addHeatmap(lng = ~lon, lat = ~lat,intensity = ~iWAR, radius = 12) %>%
  addMarkers(clusterOptions = markerClusterOptions()) %>%
  addControl("Northeast - Pitcher Heatmap iWAR", position = "topright")
## Assuming "lon" and "lat" are longitude and latitude, respectively

Potential hotspot #2

leaflet(bb_plyrs) %>% addProviderTiles(providers$CartoDB.Positron) %>% setView(-113.50781, 38.513789, 4) %>%
  addHeatmap(lng = ~lon, lat = ~lat,intensity = ~iWAR, radius = 12) %>%
  addMarkers(clusterOptions = markerClusterOptions()) %>%
  addControl("West Coast - Pitcher Heatmap iWAR", position = "topright") 
## Assuming "lon" and "lat" are longitude and latitude, respectively

Potential hotspot #3

leaflet(bb_plyrs) %>% addProviderTiles(providers$CartoDB.Positron) %>% setView(134.50781, 35.513789, 5) %>%
  addHeatmap(lng = ~lon, lat = ~lat,intensity = ~iWAR, radius = 12) %>%
  addMarkers(clusterOptions = markerClusterOptions()) %>%
  addControl("Asia - Pitcher Heatmap iWAR", position = "topright") 
## Assuming "lon" and "lat" are longitude and latitude, respectively

Research Question

The second question, from the view of the commissioner of Major League Baseball:

  1. If we wanted to expand the MLB and add a new franchise, where in the United States would make the most sense to put the new franchise?

Data Wrangling

The Data

Load in 2 separate .CSV files:

  1. parks = A list of all stadiums in MLB history.
## 'data.frame':    252 obs. of  6 variables:
##  $ park.key  : Factor w/ 252 levels "ALB01","ALT01",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ park      : Factor w/ 241 levels "23rd Street Park",..: 174 43 10 11 167 17 218 205 119 140 ...
##  $ park.alias: Factor w/ 57 levels "","3Com Park",..: 1 1 16 1 45 1 1 1 1 1 ...
##  $ city      : Factor w/ 85 levels "Albany","Altoona",..: 1 2 3 4 4 5 5 5 6 6 ...
##  $ state     : Factor w/ 36 levels "","AZ","CA","CO",..: 25 28 3 32 32 9 9 9 16 16 ...
##  $ country   : Factor w/ 6 levels "AU","CA","JP",..: 6 6 6 6 6 6 6 6 6 6 ...
  1. all_teams = A list of every team in MLB history. I filtered it down to just active teams.
## 'data.frame':    30 obs. of  12 variables:
##  $ yearID  : int  2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 ...
##  $ teamID  : Factor w/ 149 levels "ALT","ANA","ARI",..: 3 4 5 16 33 35 38 45 51 52 ...
##  $ franchID: Factor w/ 120 levels "ALT","ANA","ARI",..: 3 5 6 14 29 26 30 32 38 41 ...
##  $ divID   : Factor w/ 4 levels "","C","E","W": 4 3 3 3 2 2 2 2 4 2 ...
##  $ W       : int  82 90 47 108 62 95 67 91 91 64 ...
##  $ L       : int  80 72 115 54 100 68 95 71 72 98 ...
##  $ DivWin  : Factor w/ 3 levels "","N","Y": 2 3 2 3 2 2 2 3 2 2 ...
##  $ WCWin   : Factor w/ 3 levels "","N","Y": 2 2 2 2 2 3 2 2 3 2 ...
##  $ LgWin   : Factor w/ 3 levels "","N","Y": 2 2 2 3 2 2 2 2 2 2 ...
##  $ WSWin   : Factor w/ 3 levels "","N","Y": 2 2 2 3 2 2 2 2 2 2 ...
##  $ name    : Factor w/ 139 levels "Altoona Mountain City",..: 3 4 8 16 41 37 48 52 56 59 ...
##  $ park    : Factor w/ 215 levels "","23rd Street Grounds",..: 37 189 133 68 198 211 75 147 49 44 ...

Merge together the two data frames to create the address of the active team’s stadium, condense down to only helpful fields.

park name address
Angel Stadium of Anaheim Los Angeles Angels of Anaheim Anaheim, CA, US
AT&T Park San Francisco Giants San Francisco, CA, US
Busch Stadium III St. Louis Cardinals St. Louis, MO, US
Chase Field Arizona Diamondbacks Phoenix, AZ, US
Citi Field New York Mets New York, NY, US

I need to geolocate the stadiums based on the address field, I used the same for loop as before.

park name address lon lat geoAddress
Angel Stadium of Anaheim Los Angeles Angels of Anaheim Anaheim, CA, US -117.91430 33.83659 anaheim, ca, usa
AT&T Park San Francisco Giants San Francisco, CA, US -122.41942 37.77493 san francisco, ca, usa
Busch Stadium III St. Louis Cardinals St. Louis, MO, US -90.19940 38.62700 st. louis, mo, usa
Chase Field Arizona Diamondbacks Phoenix, AZ, US -112.07404 33.44838 phoenix, az, usa
Citi Field New York Mets New York, NY, US -74.00597 40.71278 new york, ny, usa

Analysis

Now we can map where the current MLB teams are located.

After doing some research, I found that more than half of MLB fans are white males over the age of 55, another large portion of the fan demographics is white males ages 35-55. For this research question I consider this demographic to be the base demographic.

For the future demographic, I looked at the demographics of the players in the MLB. Over 30% of the players are Hispanic, this is growing every year. For my analysis I am considering Hispanic males under 30 as the future demographic.

If the MLB can find a place that has a large number of both the base and future demographics, that might be a good market to put a new franchise.

I need census data to merge together with the active teams to help answer this question. I found a package “tidycensus” it queries on the Census Bureau data warehouse.

I used the Census Bureau variable ids to query off of their tables and only pull populations by county for white males over 35 and Hispanic males under 30. I also filtered out outlier counties, these were mostly covered by active MLB franchises anyway.

vars <- c("B01001A_011","B01001A_012","B01001A_013","B01001A_014","B01001A_015","B01001A_016",'B01001I_003', 'B01001I_004', 'B01001I_005', 'B01001I_006', 'B01001I_007', 'B01001I_008', 'B01001I_009')
pop <- get_acs(geography = "county", 
                  variables = vars,
                  geometry = TRUE) %>% 
  group_by(GEOID,NAME) %>% 
  summarise(sub_pop = sum(estimate)) %>%
  filter(sub_pop >= 40000 & sub_pop <= 90000)
## Getting data from the 2013-2017 5-year ACS
pander(head(pop))
Table continues below
GEOID NAME sub_pop
01003 Baldwin County, Alabama 53990
01089 Madison County, Alabama 73607
01097 Mobile County, Alabama 70005
01117 Shelby County, Alabama 50001
02020 Anchorage Municipality, Alaska 59893
04003 Cochise County, Arizona 42308
geometry
MULTIPOLYGON (((-88.03 30.2…
MULTIPOLYGON (((-86.79 34.5…
MULTIPOLYGON (((-88.05 30.5…
MULTIPOLYGON (((-87.03 33.2…
MULTIPOLYGON (((-150.1 61.1…
MULTIPOLYGON (((-110.5 31.5…

Now I can map the active teams with the census data to see if there are holes in the locations of active MLB franchises and the demographics I have identified.

pop %>% 
  st_transform(crs = "+init=epsg:4326") %>%
  leaflet() %>%
  addProviderTiles(provider = "CartoDB.Positron") %>%
  addMarkers(data = active_teams,~lon, ~lat, icon = stadium_icon, layerId = 2) %>%
  addPolygons(popup = ~ paste(NAME,"-",sub_pop),
              layerId = 1,
              stroke = FALSE,
              smoothFactor = 0,
              fillOpacity = 20,
              color = ~ pop_pal(sub_pop)) %>%
  addControl("Site Suitability - New MLB Team", position = "topright") %>%
  addLegend("bottomleft",
            pal = pop_pal,
            values = ~sub_pop,
            title = "Population:<br>White Males > 35 &<br>Hispanic Males < 30",
            opacity = 1) %>%
  setView(lng =-96.62500, lat=29.65538, 3)

I identified the Carolina’s as a possible market for a new MLB franchise. I then created a buffer for that area and calculated the base and future demographic population within that buffer to see a possible fan base for the MLB franchise.

pop %>% 
  st_transform(crs = "+init=epsg:4326") %>%
  leaflet() %>%
  addProviderTiles(provider = "CartoDB.Positron") %>%
  addMarkers(data = active_teams,~lon, ~lat, icon = stadium_icon, layerId = 2) %>%
  addMarkers(lng =-80.925293, lat=34.488448, layerId = 3, label = paste("Buffer Zone Target Demo Pop:",formatC(buff_pop[[1]], format="f", big.mark=",", digits=0)),labelOptions = labelOptions(noHide = T)) %>%
  addPolygons(popup = ~ paste(NAME,"-",sub_pop),
              layerId = 1,
              stroke = FALSE,
              smoothFactor = 0,
              fillOpacity = 20,
              color = ~ pop_pal(sub_pop)) %>%
  addPolygons(data=NewTeamBuf) %>%
  addMeasure( position = "bottomleft",
              primaryLengthUnit = "miles") %>%
  addControl("Site Suitability - New MLB Team", position = "topright") %>%
  addLegend("bottomright",
            pal = pop_pal,
            values = ~sub_pop,
            title = "Population:<br>White Males > 35 &<br>Hispanic Males < 30",
            opacity = 1) %>%
  setView(lng =-78.925293, lat=33.988448, 6) 

The possible fan base would be about 1 million people. This buffer is about a 3-hour drive from the proposed stadium site. A 3-hour drive seems reasonable for fans to drive to games4, and the teams games will be on TV and radio within this radius.

This is one small step the MLB would take to do market analysis for a new franchise, given more time I would have looked at a wider spectrum of demographics, economics, and other levels of baseball teams in the area.

Questions?