The City of Chicago Data Portal (https://data.cityofchicago.org/) holds a rich repository of data regarding the city’s infrastructure. Data sets can be downloaded for free and are available, at different organizational levels, in a variety of categories ranging from administration and finance information to sanitation and public safety requests.
Here, we will be focusing on a data sets that contain information about Chicago Public Libraries (CPL) such as location, hours of operation, number of monthly visitors, etc. Public libraries are crucial institutions that provide free access to information and resources for all. They promote lifelong learning, support education, and bridge the digital divide by offering computers and internet access. Libraries serve as community hubs, fostering social connections and cultural engagement. They contribute to democracy by providing unbiased information and resources for civic engagement.
Thus, it is important that they are mantained and that adequate resources are provided, specially to libraries that see a lot of visitors per month. For this project, we are particularly interested in the seasonality behind the use of public libraries. Is there a time of the year for which they experience more visitors (e.g., summer vs. winter)? Has there been a decline or increase in the use of public libraries? And if so, did the COVID-19 pandemic contribute to any more changes?
We hope to answer all of these questions within the following exploration.
# Import Libraries
import os
import re
import folium
import pandas as pd
import numpy as np
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from scipy import stats
from scipy.stats import ttest_ind
from geopy import distance
from shapely.geometry import Point
from pandas_profiling import ProfileReport
from IPython.display import display, HTML
from IPython.display import clear_output
from sklearn.linear_model import LinearRegression
Let's start our investigation by first looking at how public libraries are distributed within the city of Chicago, for that we will be using the following data:
The data set contains a total of 81 rows. Each row in the dataset represents a library location in Chicago and includes the following fields:
Please note that branches may experience short-term closures due to utility interruptions, HVAC problems, and other unforeseen issues. For more up-to-date hours, please see https://chipublib.bibliocommons.com/locations or the Hours & Locations link at the top of any page on https://www.chipublib.org. For current closures, please see https://www.chipublib.org/news/cpl-location-closures.
# Load local copy of the data set
# libraries_df = pd.read_csv("~/Raw_Data/Library_Locations/Libraries_Locations_Contact_Information_and_Usual_Hours_of_Operation.csv")
# Or download a more recent copy from URL
url = "https://data.cityofchicago.org/api/views/x8fc-8rcq/rows.csv?accessType=DOWNLOAD"
libraries_df = pd.read_csv(url)
libraries_df.head()
NAME | HOURS OF OPERATION | ADDRESS | CITY | STATE | ZIP | PHONE | WEBSITE | LOCATION | |
---|---|---|---|---|---|---|---|---|---|
0 | Vodak-East Side | Mon. & Wed., Noon-8; Tues. & Thurs., 10-6; Fri... | 3710 E. 106th St. | Chicago | IL | 60617 | (312) 747-5500 | https://www.chipublib.org/locations/71/ | (41.70283443594318, -87.61428978448026) |
1 | Albany Park | Mon. & Wed., 10-6; Tues. & Thurs., Noon-8; Fri... | 3401 W. Foster Ave. | Chicago | IL | 60625 | (773) 539-5450 | https://www.chipublib.org/locations/3/ | (41.97557881655979, -87.71361314512697) |
2 | Avalon | Mon. & Wed., Noon-8; Tues. & Thurs., 10-6; Fri... | 8148 S. Stony Island Ave. | Chicago | IL | 60617 | (312) 747-5234 | https://www.chipublib.org/locations/8/ | (41.746393038286826, -87.5860053710736) |
3 | Brainerd | Mon. & Wed., 10-6; Tues. & Thurs., Noon-8; Fri... | 1350 W. 89th St. | Chicago | IL | 60620 | (312) 747-6291 | https://www.chipublib.org/locations/13/ | (41.73244482025524, -87.65772892721816) |
4 | Popular Library at Water Works | Mon. - Thurs., 10-6; Fri. & Sat., 9-5; Sun., 1-5 | 163 E. Pearson St. | Chicago | IL | 60611 | (312) 742-8811 | https://www.chipublib.org/locations/73/ | (41.897484072390675, -87.62337776811282) |
libraries_df.tail()
NAME | HOURS OF OPERATION | ADDRESS | CITY | STATE | ZIP | PHONE | WEBSITE | LOCATION | |
---|---|---|---|---|---|---|---|---|---|
76 | Woodson Regional Library | Mon. - Thurs., 9-8; Fri. & Sat., 9-5; Sun., 1-5 | 9525 S. Halsted St. | Chicago | IL | 60628 | (312) 747-6900 | https://www.chipublib.org/locations/81/ | (41.720694885749005, -87.64304817213312) |
77 | Mayfair | Mon. & Wed., Noon-8; Tues. & Thurs., 10-6; Fri... | 4400 W. Lawrence Ave. | Chicago | IL | 60630 | (312) 744-1254 | https://www.chipublib.org/locations/49/ | (41.968242773953044, -87.737968778247) |
78 | Chicago Bee | Mon. & Wed., Noon-8; Tues. & Thurs., 10-6; Fri... | 3647 S. State St. | Chicago | IL | 60609 | (312) 747-6872 | https://www.chipublib.org/locations/18/ | (41.82824306445502, -87.6263495444489) |
79 | Uptown | Mon. & Wed., 10-6; Tues. & Thurs., Noon-8; Fri... | 929 W. Buena Ave. | Chicago | IL | 60613 | (312) 744-8400 | https://www.chipublib.org/locations/70/ | (41.95832305807637, -87.65424744448335) |
80 | Sherman Park | Mon. & Wed., Noon-8; Tues. & Thurs., 10-6; Fri... | 5440 S. Racine Ave. | Chicago | IL | 60609 | (312) 747-0477 | https://www.chipublib.org/locations/64/ | (41.79476901885989, -87.65502837616037) |
# Inspect data set structure
libraries_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 81 entries, 0 to 80 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 NAME 81 non-null object 1 HOURS OF OPERATION 81 non-null object 2 ADDRESS 81 non-null object 3 CITY 81 non-null object 4 STATE 81 non-null object 5 ZIP 81 non-null int64 6 PHONE 81 non-null object 7 WEBSITE 81 non-null object 8 LOCATION 81 non-null object dtypes: int64(1), object(8) memory usage: 5.8+ KB
Note, the 'LOCATION' column values are currently strings in the format of '(latitude, longitude)'. Before they can be used in a mapping function, they need to be converted into a tuple of floats.
For mapping, we will use the Python library folium
because it makes it easy to create interactive maps directly from Python, and it integrates well with pandas
data structures. The alternative would be to use a JavaScript library like Leaflet
directly, but this would require writing JavaScript code and wouldn't integrate as smoothly with our pandas
dataframe.
# Define a function to parse location strings into tuples of floats
def parse_location(location_str):
# Remove parentheses from the location string and split it into two parts by the comma
lat_str, lon_str = location_str[1:-1].split(", ")
# Convert the parts to floats and return them as a tuple
return float(lat_str), float(lon_str)
# Remove leading and trailing white spaces from all column names in the dataframe
# This is to prevent issues where a column name can't be found due to unexpected white spaces
libraries_df.columns = libraries_df.columns.str.strip()
# Create a new folium Map object
chicago_map = folium.Map(location = [41.8781, -87.6298], zoom_start = 11)
# Iterate over each row in the library dataframe
for idx, row in libraries_df.iterrows():
# Use the parse_location function to convert the location string into a tuple of floats
location = parse_location(row['LOCATION'])
# Add a marker to the map at the given location
# The tooltip includes the label 'Public Library:' followed by the name of the library (with HTML bold formatting)
tooltip_text = f'<b>Public Library:</b> {row["NAME"]}'
folium.Marker(location, tooltip = tooltip_text).add_to(chicago_map)
# Display the map
# Note, this will only work in an interactive environment like a Jupyter notebook
# If you're running this script from a file, you may need to save the map to an HTML file and open it in a web browser
chicago_map
Determining whether libraries are "equally spaced" can be a bit complex due to the irregular shape of most city boundaries and the fact that population density can vary significantly across a city. However, there are a few different ways we could analyze the distribution of libraries across the city.
One way is to create a heatmap of library locations. This would give us a visual representation of areas with a high density of libraries versus areas with a low density.
Another way is to calculate the nearest neighbor distance for each library, which is the distance to the closest other library. If the libraries were perfectly equally spaced, all these distances would be the same. By calculating these distances and looking at their distribution (e.g., by calculating the mean and standard deviation or creating a histogram), we can get an idea of how equally spaced the libraries are.
# Define a function to calculate distances from each library to all other libraries
def calculate_distances(data):
# Initialize an empty list to hold the minimum distances for each library
distances = []
# Iterate over each row (library) in the data
for idx1, row1 in data.iterrows():
# Initialize an empty list to hold the distances from the current library to all other libraries
library_distances = []
# Parse the location of the current library
location1 = parse_location(row1['LOCATION'])
# Again iterate over each row (library) in the data
for idx2, row2 in data.iterrows():
# If the second library is not the same as the current library
if idx1 != idx2:
# Parse the location of the second library
location2 = parse_location(row2['LOCATION'])
# Calculate the distance between the current library and the second library
dist = distance.distance(location1, location2).miles
# Add the calculated distance to the list of distances for the current library
library_distances.append(dist)
# Find the minimum distance for the current library and add it to the list of minimum distances
distances.append(min(library_distances))
# Return the list of minimum distances
return distances
# Call the function to calculate the minimum distance for each library
min_distances = calculate_distances(libraries_df)
# Calculate the mean (average) of the minimum distances
mean_min_dist = round(sum(min_distances) / len(min_distances), 2)
print(f"Mean minimum distance: {mean_min_dist} miles")
Mean minimum distance: 1.22 miles
# Create a histogram of the minimum distances
plt.hist(min_distances, bins = 5, edgecolor = 'black')
plt.xlabel('Minimum Distance (miles)')
plt.ylabel('Frequency')
plt.title('Histogram of Minimum Distances between Libraries')
plt.show()
The result we got, 1.22 miles, is the average distance to the nearest neighboring library for each library in our data set.
This number by itself doesn't tell us whether the libraries are "too close" or "too far apart", because that depends on many factors including the size and population density of Chicago, and what we consider to be an acceptable distance to a library.
However, it can provide some useful context. For example, if we know that the average block in Chicago is about 0.1 miles long, this result suggests that the average library has another library within a distance of about 1.2 miles (i.e., about 12 blocks).
To better interpret this number, we might want to compare it to other cities, or to some standard of what we believe is a reasonable distance to a library. We might also want to consider other factors, like how this distance varies across different neighborhoods, and whether there are any areas that are particularly underserved by libraries.
For example, we could calculate the standard deviation of these minimum distances to see how much they vary. If the standard deviation is small, it suggests that most libraries are about the same distance from their nearest neighbor. If it's large, it means there's a lot of variation, and some libraries are much further from their nearest neighbor than others.
# Calculate the standard deviation of the minimum distances
std_min_dist = round(np.std(min_distances), 2)
print(f"Standard deviation of minimum distances: {std_min_dist} miles")
Standard deviation of minimum distances: 0.34 miles
The standard deviation of 0.34 miles tells us about the variation or dispersion of the minimum distances between libraries. In other words, it gives us a measure of how much these distances typically deviate from the average minimum distance, which was calculated to be 1.22 miles.
A small standard deviation (relative to the mean) indicates that the data points tend to be very close to the mean. In this context, a small standard deviation would mean that most libraries have their nearest neighboring library at a distance close to the average of 1.22 miles. While a high standard deviation indicates that the data points are spread out over a large range of values. In this context, a large standard deviation would mean that the distances to the nearest neighboring library vary widely from the average of 1.22 miles.
With a standard deviation of 0.34 miles, we can say that while the average distance to the nearest library is 1.22 miles, the distances vary somewhat, and it's not uncommon for a library to have its nearest neighbor anywhere between roughly 0.88 miles (1.22 - 0.34) and 1.56 miles (1.22 + 0.34) away.
However, remember that the standard deviation is just a rough guideline - about 68% of libraries will have their nearest neighbor within this range if the data follows a normal distribution. There may still be some libraries with much larger or smaller distances to their nearest neighbor.
This suggests that while libraries in Chicago are somewhat evenly distributed, there is still some variation, and some areas of the city might be better served than others.
Let's continue our investigation by looking at how this distance varies across different neighborhoods, and whether there are any areas that are particularly underserved by libraries by incorporating the following data:
Current community area boundaries in Chicago.
# Create a new folium Map object
chicago_map2 = folium.Map(location = [41.8781, -87.6298], zoom_start = 11)
# Load GeoJSON file with community areas from a local copy
# community_areas_gdf = gpd.read_file("~/Raw_Data/Boundaries_Community_Areas/Boundaries_Community_Areas_Current.geojson")
# Load GeoJSON file with community areas from a URL
community_areas_url = "https://data.cityofchicago.org/api/geospatial/cauq-8yn6?method=export&format=GeoJSON"
community_areas_gdf = gpd.read_file(community_areas_url)
# Make sure 'community' is in the columns of community_areas_gdf
assert 'community' in community_areas_gdf.columns, "'community' column not found in community_areas_gdf"
# Add the community areas layer to the map
community_layer = folium.GeoJson(community_areas_gdf, name = "Community Areas")
# Add tooltips to the community layer using the 'community' column
# The tooltip includes the label 'Community Area:' followed by the name of the community area
tooltips = folium.features.GeoJsonTooltip(fields = ['community'], labels = True, sticky = True,
aliases = ['Community Area: '])
community_layer.add_child(tooltips)
# Add the community layer to the map
community_layer.add_to(chicago_map2)
# Iterate over each row in the library dataframe
for idx, row in libraries_df.iterrows():
# Use the parse_location function to convert the location string into a tuple of floats
location = parse_location(row['LOCATION'])
# Add a marker to the map at the given location
# The tooltip includes the label 'Public Library:' followed by the name of the library (with HTML bold formatting)
tooltip_text = f'<b>Public Library:</b> {row["NAME"]}'
folium.Marker(location, tooltip = tooltip_text).add_to(chicago_map2)
# Display the map
chicago_map2
By adding the adding the community area layer to the map, we can visually discern if there are any community areas that do not have public libraries within their boundaries. Out of 77 community areas, it seems that only 16 of them don't have a public library positioned exactly inside their boundaries; the list includes: OHare, Edison Park, Lincoln Park, Belmont Cragin, Hermosa, East Garfield Park, Near South Side, Oakland, Hyde Park, Washington Park, Chatam, Burnside, Calumet Heights, East Side, Auburn, and Gresham. However, note that in most of these cases the public library lies at the boundary between two community areas.
This is not necessarily a bad thing, as must public libraries seem to be evenly dispersed across Chicago. To make the investigation more rigorous, we could select points within each community area (e.g., centroid) and calculate distances to the nearest public library. Additionaly, connecting to an API, like Google Maps, could give us additional information in regards to distances and how different modes of transportation might affect them giving us further insights into what neighboorhoods might be underserved by libraries.
The "HOURS OF OPERATION" field in our data appears to include operating hours for different days of the week in a semi-structured format.
To make better sense of it, we could separate the hours of operation for each day of the week into separate columns. This process involves parsing the text and can be a bit complex because the format isn't perfectly consistent (for example, sometimes there's a space after the comma, and sometimes there isn't). Let's create a function that extracts the hours for each day of the week and applies it to the "HOURS OF OPERATION" column. We'll create a new dataframe that includes these parsed hours, and then merge it with the original dataframe.
Regular expressions allow us to match and extract information from strings based on patterns.The structure of the "HOURS OF OPERATION" strings seems to follow this pattern:
# Regular expression patterns for each day
patterns = {
"MON": r"Mon\..*?,\s*(.*?);",
"TUES": r"Tues\..*?,\s*(.*?);",
"WED": r"Wed\..*?,\s*(.*?);",
"THURS": r"Thurs\..*?,\s*(.*?);",
"FRI": r"Fri\..*?,\s*(.*?);",
"SAT": r"Sat\..*?,\s*(.*?);",
"SUN": r"Sun\..*?,\s*(.*?);",
}
# Function to parse hours of operation using regex
def parse_hours_regex(hours_str):
hours_dict = {}
for day, pattern in patterns.items():
match = re.search(pattern, hours_str)
hours_dict[day] = match.group(1).strip() if match else ""
return hours_dict
# Apply the function to the "HOURS OF OPERATION" column
hours_df = libraries_df['HOURS OF OPERATION'].apply(parse_hours_regex).apply(pd.Series)
# Merge the hours dataframe with the original dataframe
parsed_data = pd.concat([libraries_df, hours_df], axis = 1)
parsed_data.head()
NAME | HOURS OF OPERATION | ADDRESS | CITY | STATE | ZIP | PHONE | WEBSITE | LOCATION | MON | TUES | WED | THURS | FRI | SAT | SUN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Vodak-East Side | Mon. & Wed., Noon-8; Tues. & Thurs., 10-6; Fri... | 3710 E. 106th St. | Chicago | IL | 60617 | (312) 747-5500 | https://www.chipublib.org/locations/71/ | (41.70283443594318, -87.61428978448026) | Noon-8 | 10-6 | Noon-8 | 10-6 | 9-5 | 9-5 | |
1 | Albany Park | Mon. & Wed., 10-6; Tues. & Thurs., Noon-8; Fri... | 3401 W. Foster Ave. | Chicago | IL | 60625 | (773) 539-5450 | https://www.chipublib.org/locations/3/ | (41.97557881655979, -87.71361314512697) | 10-6 | Noon-8 | 10-6 | Noon-8 | 9-5 | 9-5 | |
2 | Avalon | Mon. & Wed., Noon-8; Tues. & Thurs., 10-6; Fri... | 8148 S. Stony Island Ave. | Chicago | IL | 60617 | (312) 747-5234 | https://www.chipublib.org/locations/8/ | (41.746393038286826, -87.5860053710736) | Noon-8 | 10-6 | Noon-8 | 10-6 | 9-5 | 9-5 | |
3 | Brainerd | Mon. & Wed., 10-6; Tues. & Thurs., Noon-8; Fri... | 1350 W. 89th St. | Chicago | IL | 60620 | (312) 747-6291 | https://www.chipublib.org/locations/13/ | (41.73244482025524, -87.65772892721816) | 10-6 | Noon-8 | 10-6 | Noon-8 | 9-5 | 9-5 | |
4 | Popular Library at Water Works | Mon. - Thurs., 10-6; Fri. & Sat., 9-5; Sun., 1-5 | 163 E. Pearson St. | Chicago | IL | 60611 | (312) 742-8811 | https://www.chipublib.org/locations/73/ | (41.897484072390675, -87.62337776811282) | 10-6 | 10-6 | 9-5 | 9-5 |
Now that we have separated the hours of operation by day of the week, we can also compute the total hours of operation per week, by first converting the hours of operation into a numerical format. Currently, the hours of operation are stored as strings in a 12-hour clock format, such as "Noon-8" or "10-6".
We'll start by defining a function that converts a string representing a range of hours into a number representing the total hours. This function will handle different ways of representing times, such as "Noon" or "10".
Then, we'll apply this function to the hours of operation for each day, sum up the total hours for each week, and store the result in a new column.
# Function to calculate total hours from a time range string
def calculate_hours(time_range_str):
if not time_range_str:
return 0
# Convert "Noon" to "12"
time_range_str = time_range_str.replace("Noon", "12")
# Split the string into start time and end time
start_str, end_str = time_range_str.split("-")
# Convert start and end times to integers
start = int(start_str.strip())
end = int(end_str.strip())
# If the end time is less than the start time, add 12 to it (to handle times in the evening)
if end < start:
end += 12
# Calculate the total hours
total_hours = end - start
return total_hours
# Apply the function to the hours for each day and sum up the total hours for each week
parsed_data["TOTAL HOURS"] = parsed_data[["MON", "TUES", "WED", "THURS", "FRI", "SAT", "SUN"]].applymap(calculate_hours).sum(axis=1)
parsed_data.head()
NAME | HOURS OF OPERATION | ADDRESS | CITY | STATE | ZIP | PHONE | WEBSITE | LOCATION | MON | TUES | WED | THURS | FRI | SAT | SUN | TOTAL HOURS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Vodak-East Side | Mon. & Wed., Noon-8; Tues. & Thurs., 10-6; Fri... | 3710 E. 106th St. | Chicago | IL | 60617 | (312) 747-5500 | https://www.chipublib.org/locations/71/ | (41.70283443594318, -87.61428978448026) | Noon-8 | 10-6 | Noon-8 | 10-6 | 9-5 | 9-5 | 48 | |
1 | Albany Park | Mon. & Wed., 10-6; Tues. & Thurs., Noon-8; Fri... | 3401 W. Foster Ave. | Chicago | IL | 60625 | (773) 539-5450 | https://www.chipublib.org/locations/3/ | (41.97557881655979, -87.71361314512697) | 10-6 | Noon-8 | 10-6 | Noon-8 | 9-5 | 9-5 | 48 | |
2 | Avalon | Mon. & Wed., Noon-8; Tues. & Thurs., 10-6; Fri... | 8148 S. Stony Island Ave. | Chicago | IL | 60617 | (312) 747-5234 | https://www.chipublib.org/locations/8/ | (41.746393038286826, -87.5860053710736) | Noon-8 | 10-6 | Noon-8 | 10-6 | 9-5 | 9-5 | 48 | |
3 | Brainerd | Mon. & Wed., 10-6; Tues. & Thurs., Noon-8; Fri... | 1350 W. 89th St. | Chicago | IL | 60620 | (312) 747-6291 | https://www.chipublib.org/locations/13/ | (41.73244482025524, -87.65772892721816) | 10-6 | Noon-8 | 10-6 | Noon-8 | 9-5 | 9-5 | 48 | |
4 | Popular Library at Water Works | Mon. - Thurs., 10-6; Fri. & Sat., 9-5; Sun., 1-5 | 163 E. Pearson St. | Chicago | IL | 60611 | (312) 742-8811 | https://www.chipublib.org/locations/73/ | (41.897484072390675, -87.62337776811282) | 10-6 | 10-6 | 9-5 | 9-5 | 32 |
Now, lets visualize all this information.
# Convert hours to numeric for plot
hours_numeric_df = hours_df.applymap(calculate_hours)
# Remove Sunday from the hours dataframe as it contains no values (Public libraries are closed Sundays!)
hours_numeric_df = hours_numeric_df.drop(columns="SUN")
# Create histograms for hours of operation for each day
plt.figure(figsize = (14, 10))
for i, day in enumerate(["MON", "TUES", "WED", "THURS", "FRI", "SAT"], start = 1):
plt.subplot(2, 3, i)
plt.hist(hours_numeric_df[day], bins = range(0, 13), alpha = 0.7, color = 'skyblue', edgecolor = 'black')
plt.title(f"{day} Hours of Operation")
plt.xlabel("Hours")
plt.ylabel("Frequency")
plt.tight_layout()
plt.show()
# Create histogram for total hours per week
plt.figure(figsize = (10, 6))
plt.hist(parsed_data["TOTAL HOURS"], bins = range(0, 60, 2), alpha = 0.7, color = 'skyblue', edgecolor = 'black')
plt.title("Total Hours of Operation Per Week")
plt.xlabel("Hours")
plt.ylabel("Frequency")
plt.show()
The histograms at the top show the distribution of hours of operation for each day from Monday to Saturday across all libraries. Each bar represents the number of libraries operating for a given number of hours. For example, in the "MON Hours of Operation" histogram, the bar over "8" hours means that there are over 70 libraries that operate for 8 hours on Mondays.
The histogram at the bottom shows the distribution of total hours of operation per week across all libraries. Each bar represents the number of libraries operating for a given total number of hours per week. For example, the tallest bar over "48" hours means that the most common total hours of operation per week is around 48 hours.
Let's dig a little deeper and obtain the names of the public libraries which are open less than the average number of 48 hours per week.
# Filter the dataframe for libraries open less than 48 hours per week
less_than_48 = parsed_data[parsed_data["TOTAL HOURS"] < 48][["NAME", "TOTAL HOURS"]]
# Sort by total hours
less_than_48 = less_than_48.sort_values("TOTAL HOURS")
less_than_48
NAME | TOTAL HOURS | |
---|---|---|
42 | Galewood-Mont Clare | 0 |
4 | Popular Library at Water Works | 32 |
5 | Little Italy | 32 |
13 | Portage-Cragin | 32 |
33 | Archer Heights | 32 |
47 | West Pullman | 32 |
51 | Austin | 32 |
26 | Sulzer Regional Library | 38 |
49 | Legler Regional | 38 |
60 | Harold Washtington Library Center | 38 |
76 | Woodson Regional Library | 38 |
Compared to the rest, it appears that most of these libraries are open for 32 or 38 hours per week, with one library (Galewood-Mont Clare) having a total of 0 hours of operation listed, which might be a data error or indicate that it's currently closed. Based on the list provided at https://www.chipublib.org/news/cpl-location-closures; we can see that Galewood-Mont Clare is currently closed.
As it stands this data set does not provide any additional information that could inform us if differences in hours of operation, for example, affects they way people utilize Chicago Public Libraries. Next, we will incorporate visitor count data to answer this and other relevant questions to our investigation.
But first, lets save the changes we made for later use.
# Melt the dataframe to long format (long format is better for analysis)
location_long_format = pd.melt(parsed_data, id_vars = ['NAME', 'HOURS OF OPERATION', 'ADDRESS', 'CITY', 'STATE', 'ZIP',
'PHONE', 'WEBSITE', 'LOCATION', 'TOTAL HOURS'],
value_vars = ["MON", "TUES", "WED", "THURS", "FRI", "SAT", "SUN"],
var_name = 'DAY', value_name = 'HOURS')
# Convert hours to numerical format
location_long_format['HOURS'] = location_long_format['HOURS'].apply(calculate_hours)
# Rename specific columns
column_rename_dict = {'TOTAL HOURS': 'TOTAL HOURS WEEK', 'HOURS': 'HOURS OPEN'}
location_long_format = location_long_format.rename(columns=column_rename_dict)
# Specify the saving path for the CSV file
saving_path = '~/Processed_Data/Library_Locations_Processed.csv'
# Save the long format DataFrame to a CSV file
location_long_format.to_csv(saving_path, index = False)
location_long_format.head()
NAME | HOURS OF OPERATION | ADDRESS | CITY | STATE | ZIP | PHONE | WEBSITE | LOCATION | TOTAL HOURS WEEK | DAY | HOURS OPEN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Vodak-East Side | Mon. & Wed., Noon-8; Tues. & Thurs., 10-6; Fri... | 3710 E. 106th St. | Chicago | IL | 60617 | (312) 747-5500 | https://www.chipublib.org/locations/71/ | (41.70283443594318, -87.61428978448026) | 48 | MON | 8 |
1 | Albany Park | Mon. & Wed., 10-6; Tues. & Thurs., Noon-8; Fri... | 3401 W. Foster Ave. | Chicago | IL | 60625 | (773) 539-5450 | https://www.chipublib.org/locations/3/ | (41.97557881655979, -87.71361314512697) | 48 | MON | 8 |
2 | Avalon | Mon. & Wed., Noon-8; Tues. & Thurs., 10-6; Fri... | 8148 S. Stony Island Ave. | Chicago | IL | 60617 | (312) 747-5234 | https://www.chipublib.org/locations/8/ | (41.746393038286826, -87.5860053710736) | 48 | MON | 8 |
3 | Brainerd | Mon. & Wed., 10-6; Tues. & Thurs., Noon-8; Fri... | 1350 W. 89th St. | Chicago | IL | 60620 | (312) 747-6291 | https://www.chipublib.org/locations/13/ | (41.73244482025524, -87.65772892721816) | 48 | MON | 8 |
4 | Popular Library at Water Works | Mon. - Thurs., 10-6; Fri. & Sat., 9-5; Sun., 1-5 | 163 E. Pearson St. | Chicago | IL | 60611 | (312) 742-8811 | https://www.chipublib.org/locations/73/ | (41.897484072390675, -87.62337776811282) | 32 | MON | 8 |
We will continue our investigation into Chicago Public Libraries by adding information on the number of visitors that each location experiences per month/year. The Chicago Data portal has multiple data sets, similarly structured to the the one below for 2011.
The data set contains a total of 79 rows. Each row in the data set represents a library location in Chicago and includes the following fields:
This data set provides information on the number of visitors to each library in the Chicago Public Library system, which includes the Harold Washington Library Center, Sulzer and Woodson regional libraries, and over 70 neighborhood branches. Please note that some locations may experience sporadic closures due to facilities upgrades, emergency closures due to heating or air conditioning issues, or area power outages.
An asterisk (*) in the 'LOCATION' name signifies that the count does not reflect the total building visitor count due to the location of the traffic counter. Community room and program traffic are not included in the totals.
# Inspect data set structure
visitors_2011 = df = pd.read_csv("https://data.cityofchicago.org/api/views/xxwy-zyzu/rows.csv?accessType=DOWNLOAD") # Load data from URL
visitors_2011.head()
LOCATION | ADDRESS | CITY | ZIP CODE | JANUARY | FEBRUARY | MARCH | APRIL | MAY | JUNE | JULY | AUGUST | SEPTEMBER | OCTOBER | NOVEMBER | DECEMBER | YTD | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Albany Park | 5150 N. Kimball Avenue | CHICAGO | 60625.0 | 9604.0 | 10500.0 | 9050.0 | 9300.0 | 8271.0 | 10984.0 | 9986.0 | 11078.0 | 9453.0 | 10213.0 | 9377.0 | 9609.0 | 117425.0 |
1 | Altgeld | 13281 S. Corliss Avenue | CHICAGO | 60827.0 | 5809.0 | 3899.0 | 5207.0 | 5201.0 | 4494.0 | 5760.0 | 3653.0 | 2414.0 | 4552.0 | 6891.0 | 5698.0 | 5079.0 | 58657.0 |
2 | Archer Heights* | 5055 S. Archer Avenue | CHICAGO | 60632.0 | 9829.0 | 9394.0 | 11342.0 | 11114.0 | 9365.0 | 11247.0 | 10329.0 | 11231.0 | 10373.0 | 11364.0 | 10011.0 | 9054.0 | 124653.0 |
3 | Austin | 5615 W. Race Avenue | CHICAGO | 60644.0 | 6713.0 | 6250.0 | 7054.0 | 9139.0 | 8857.0 | 9586.0 | 8352.0 | 10359.0 | 9151.0 | 10016.0 | 8461.0 | 8368.0 | 102306.0 |
4 | Austin-Irving | 6100 W. Irving Park Road | CHICAGO | 60634.0 | 11556.0 | 9904.0 | 13214.0 | 13064.0 | 10969.0 | 12587.0 | 12596.0 | 13638.0 | 12542.0 | 13286.0 | 11868.0 | 10628.0 | 145852.0 |
visitors_2011.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 81 entries, 0 to 80 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 LOCATION 80 non-null object 1 ADDRESS 79 non-null object 2 CITY 79 non-null object 3 ZIP CODE 79 non-null float64 4 JANUARY 80 non-null float64 5 FEBRUARY 80 non-null float64 6 MARCH 79 non-null float64 7 APRIL 79 non-null float64 8 MAY 79 non-null float64 9 JUNE 78 non-null float64 10 JULY 78 non-null float64 11 AUGUST 77 non-null float64 12 SEPTEMBER 78 non-null float64 13 OCTOBER 78 non-null float64 14 NOVEMBER 78 non-null float64 15 DECEMBER 78 non-null float64 16 YTD 79 non-null float64 dtypes: float64(14), object(3) memory usage: 10.9+ KB
I inspected all the data sets, from 2011 to 2023, and noticed a few things:
These discrepancies make it difficult to analyze the data in aggregate and should be corrected. The code below cleans the data sets, in an automated fashion, and makes it possible to merge them all together. The format of each data set is also changed from wide to long to accomodate future exploration, particularly time series analysis. Additional changes, such as naming conventions, were also implemented to improve the readability and comprehensiveness. Note, columns that include location information (address, city, zip code, etc.) were removed as this information can be incoporated from the previous data set (i.e., 'Libraries - Locations, Contact Information, and Usual Hours of Operation') later on.
We will clean local copies of the data sets as it is easier to iterate through them. However, it should be noted that the data for 2023 is incomplete and should be redownloaded at a later date, if you wish to repeat the analysis.
# Get list of all the CSV files within a specific directory
def get_csv_files_in_path(main_path, subdirectory):
# Combine the main path and subdirectory to get the full directory path
path = os.path.join(main_path, subdirectory)
# Check if the given path exists and is a directory
if not os.path.exists(path) or not os.path.isdir(path):
print(f"Invalid path: {path}")
return []
# Get a list of all files in the directory
file_list = os.listdir(path)
# Filter the list to include only CSV files
csv_files = [os.path.join(path, file) for file in file_list if file.lower().endswith('.csv')]
return csv_files
# Example usage
main_path_to_files = "~/Raw_Data"
subdirectory_path = "Visitors_By_Location"
csv_files = get_csv_files_in_path(main_path_to_files, subdirectory_path)
print("CSV files in the specified path:")
for csv_file in csv_files:
print(csv_file)
CSV files in the specified path: /Users/Jesse/Desktop/Workspace/Tableau/Chicago_Public_Libraries/Raw_Data/Visitors_By_Location/Libraries_2023_Visitors_by_Location.csv /Users/Jesse/Desktop/Workspace/Tableau/Chicago_Public_Libraries/Raw_Data/Visitors_By_Location/Libraries_2014_Visitors_by_Location.csv /Users/Jesse/Desktop/Workspace/Tableau/Chicago_Public_Libraries/Raw_Data/Visitors_By_Location/Libraries_2018_Visitors_by_Location.csv /Users/Jesse/Desktop/Workspace/Tableau/Chicago_Public_Libraries/Raw_Data/Visitors_By_Location/Libraries_2015_Visitors_by_Location.csv /Users/Jesse/Desktop/Workspace/Tableau/Chicago_Public_Libraries/Raw_Data/Visitors_By_Location/Libraries_2011_Visitors_by_Location.csv /Users/Jesse/Desktop/Workspace/Tableau/Chicago_Public_Libraries/Raw_Data/Visitors_By_Location/Libraries_2019_Visitors_by_Location.csv /Users/Jesse/Desktop/Workspace/Tableau/Chicago_Public_Libraries/Raw_Data/Visitors_By_Location/Libraries_2013_Visitors_by_Location.csv /Users/Jesse/Desktop/Workspace/Tableau/Chicago_Public_Libraries/Raw_Data/Visitors_By_Location/Libraries_2020_Visitors_by_Location.csv /Users/Jesse/Desktop/Workspace/Tableau/Chicago_Public_Libraries/Raw_Data/Visitors_By_Location/Libraries_2017_Visitors_by_Location.csv /Users/Jesse/Desktop/Workspace/Tableau/Chicago_Public_Libraries/Raw_Data/Visitors_By_Location/Libraries_2021_Visitors_by_Location.csv /Users/Jesse/Desktop/Workspace/Tableau/Chicago_Public_Libraries/Raw_Data/Visitors_By_Location/Libraries_2016_Visitors_by_Location.csv /Users/Jesse/Desktop/Workspace/Tableau/Chicago_Public_Libraries/Raw_Data/Visitors_By_Location/Libraries_2012_Visitors_by_Location.csv
Again, all these files pertain to visitor count data for each library for the years 2011 to 2023. Lets process them!
The following process_csv_files
function reads and processes a list of CSV files, applies transformations to the data, adds a 'SEASON' column, saves the processed files, and returns the merged dataframe containing the processed data from all the files.
def process_csv_files(csv_files, new_directory):
processed_dfs = [] # List to store processed dataframes
# Define a function to map month number to season
def month_to_season(month):
if month in [3, 4, 5]:
return 'Spring'
elif month in [6, 7, 8]:
return 'Summer'
elif month in [9, 10, 11]:
return 'Fall'
else:
return 'Winter'
for file_path in csv_files:
# Read the CSV file
df = pd.read_csv(file_path)
# Replace missing values with 0
df = df.fillna(0)
# Define the data types for each column
data_types = {
'JANUARY': int,
'FEBRUARY': int,
'MARCH': int,
'APRIL': int,
'MAY': int,
'JUNE': int,
'JULY': int,
'AUGUST': int,
'SEPTEMBER': int,
'OCTOBER': int,
'NOVEMBER': int,
'DECEMBER': int,
'YTD': int,
}
# Extract the year from the file name using regular expression
match = re.search(r"Libraries_(\d{4})", file_path)
if match:
year = int(match.group(1))
# Remove the 'LOCATION' column for data after 2019
if year > 2019:
df = df.drop(columns = df.columns[df.columns.str.contains(r'(?i)^.*location.*\d*$')])
# Check if the 'ZIP' or 'ZIP CODE' column exists and drop them
if 'ZIP' in df.columns:
df = df.drop(columns = ['ZIP'])
if 'ZIP CODE' in df.columns:
df = df.drop(columns = ['ZIP CODE'])
# Check if the 'LOCATION' or 'BRANCH' column exists
if 'LOCATION' in df.columns:
# Rename the 'LOCATION' column to 'NAME'
df = df.rename(columns = {'LOCATION': 'NAME'})
elif 'BRANCH' in df.columns:
# Rename the 'BRANCH' column to 'NAME'
df = df.rename(columns = {'BRANCH': 'NAME'})
# Check if the 'ADDRESS' column exists
if 'ADDRESS' in df.columns:
# Remove the 'ADDRESS' column
df = df.drop(columns = ['ADDRESS'])
# Check if the 'CITY' column exists
if 'CITY' in df.columns:
# Remove the 'CITY' column
df = df.drop(columns = ['CITY'])
# Apply the specified data types to the DataFrame
df = df.astype(data_types)
# Change column names
df = df.rename(columns = {'JANUARY': 'JAN', 'FEBRUARY': 'FEB', 'MARCH': 'MAR', 'APRIL': 'APR', 'MAY': 'MAY',
'JUNE': 'JUN', 'JULY': 'JUL', 'AUGUST': 'AUG', 'SEPTEMBER': 'SEP', 'OCTOBER': 'OCT',
'NOVEMBER': 'NOV', 'DECEMBER': 'DEC', 'YTD': 'TOTAL_VISITORS_YEAR'})
# Add the 'Year' column
df['YEAR'] = year
# Add a column for PROBLEM_WITH_COUNTER
df['PROBLEM_WITH_COUNTER'] = df['NAME'].str.contains(r'\*', regex = True)
# Remove trailing white spaces from the 'NAME' column
df['NAME'] = df['NAME'].str.strip()
# Remove asterisks from the 'NAME' column
df['NAME'] = df['NAME'].str.replace('\*', '', regex = True)
# Remove rows where 'Name' is empty or equals 'Total'
df = df[df['NAME'].notna()]
df = df[~df['NAME'].str.lower().str.contains('total')]
# Convert from wide to long format
df = df.melt(id_vars = ['NAME', 'YEAR', 'TOTAL_VISITORS_YEAR', 'PROBLEM_WITH_COUNTER'],
var_name = 'MONTH', value_name = 'VISITOR_COUNT')
# Add a column for month number
df['MONTH_NUMBER'] = pd.to_datetime(df['MONTH'], format = '%b', errors = 'coerce').dt.month
# Add the 'SEASON' column
df['SEASON'] = df['MONTH_NUMBER'].apply(month_to_season)
# Rearrange the columns if desired
df = df[['NAME', 'YEAR', 'TOTAL_VISITORS_YEAR', 'PROBLEM_WITH_COUNTER', 'MONTH', 'MONTH_NUMBER', 'SEASON', 'VISITOR_COUNT']]
# Rearrange the rows by name, year, and month number
df = df.sort_values(by = ['NAME', 'YEAR', 'MONTH_NUMBER'])
# Reset the index if desired
df = df.reset_index(drop = True)
# Create the new file name
file_name = os.path.basename(file_path)
new_file_name = os.path.splitext(file_name)[0] + '_long_format.csv'
# Construct the new file path
new_file_path = os.path.join(new_directory, new_file_name)
# Save the DataFrame to a new CSV file
df.to_csv(new_file_path, index = False)
processed_dfs.append(df) # Store the processed DataFrame
# Merge all the processed dataframes
merged_df = pd.concat(processed_dfs, ignore_index = True)
# Rearrange the rows in the merged dataframe by name, year, and month number
merged_df = merged_df.sort_values(by = ['NAME', 'YEAR', 'MONTH_NUMBER'])
# Reset the index of the merged dataframe
merged_df = merged_df.reset_index(drop = True)
# Save the merged dataframe to a separate CSV file
merged_file_path = os.path.join(new_directory, 'Visitors_by_Location_merged_data_long_format.csv')
merged_df.to_csv(merged_file_path, index = False)
print("Done processing csv files!")
print(f"Clean files saved to: {new_directory}.")
return merged_df
# Example usage
new_directory = "~/Processed_Data" # Path where you want the processed files saved
merged_df = process_csv_files(csv_files, new_directory)
Done processing csv files! Clean files saved to: /Users/Jesse/Desktop/Workspace/Tableau/Chicago_Public_Libraries/Processed_Data.
Here's a more detailed summary of what the code above does:
processed_dfs
to store processed dataframes.csv_files
list:processed_dfs
list.merged_df
.merged_df
dataframe by 'NAME', 'YEAR', and 'MONTH_NUMBER'.merged_df
dataframe.Finally, we can also incoporate the previous data set, which inlcudes information about library location, hours of operation, etc:
# Merge the two data sets on 'NAME' column
merged_full_df = pd.merge(merged_df, location_long_format, on = 'NAME')
# Save the full merged dataframe
merged_full_df.to_csv("~/Processed_Data/Public_Library_Location_Visitor_Count_2011_2023.csv", index = False)
merged_full_df.head()
NAME | YEAR | TOTAL_VISITORS_YEAR | PROBLEM_WITH_COUNTER | MONTH | MONTH_NUMBER | SEASON | VISITOR_COUNT | HOURS OF OPERATION | ADDRESS | CITY | STATE | ZIP | PHONE | WEBSITE | LOCATION | TOTAL HOURS WEEK | DAY | HOURS OPEN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Albany Park | 2011 | 117425 | False | JAN | 1 | Winter | 9604 | Mon. & Wed., 10-6; Tues. & Thurs., Noon-8; Fri... | 3401 W. Foster Ave. | Chicago | IL | 60625 | (773) 539-5450 | https://www.chipublib.org/locations/3/ | (41.97557881655979, -87.71361314512697) | 48 | MON | 8 |
1 | Albany Park | 2011 | 117425 | False | JAN | 1 | Winter | 9604 | Mon. & Wed., 10-6; Tues. & Thurs., Noon-8; Fri... | 3401 W. Foster Ave. | Chicago | IL | 60625 | (773) 539-5450 | https://www.chipublib.org/locations/3/ | (41.97557881655979, -87.71361314512697) | 48 | TUES | 8 |
2 | Albany Park | 2011 | 117425 | False | JAN | 1 | Winter | 9604 | Mon. & Wed., 10-6; Tues. & Thurs., Noon-8; Fri... | 3401 W. Foster Ave. | Chicago | IL | 60625 | (773) 539-5450 | https://www.chipublib.org/locations/3/ | (41.97557881655979, -87.71361314512697) | 48 | WED | 8 |
3 | Albany Park | 2011 | 117425 | False | JAN | 1 | Winter | 9604 | Mon. & Wed., 10-6; Tues. & Thurs., Noon-8; Fri... | 3401 W. Foster Ave. | Chicago | IL | 60625 | (773) 539-5450 | https://www.chipublib.org/locations/3/ | (41.97557881655979, -87.71361314512697) | 48 | THURS | 8 |
4 | Albany Park | 2011 | 117425 | False | JAN | 1 | Winter | 9604 | Mon. & Wed., 10-6; Tues. & Thurs., Noon-8; Fri... | 3401 W. Foster Ave. | Chicago | IL | 60625 | (773) 539-5450 | https://www.chipublib.org/locations/3/ | (41.97557881655979, -87.71361314512697) | 48 | FRI | 8 |
merged_full_df.tail()
NAME | YEAR | TOTAL_VISITORS_YEAR | PROBLEM_WITH_COUNTER | MONTH | MONTH_NUMBER | SEASON | VISITOR_COUNT | HOURS OF OPERATION | ADDRESS | CITY | STATE | ZIP | PHONE | WEBSITE | LOCATION | TOTAL HOURS WEEK | DAY | HOURS OPEN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
75175 | Wrightwood-Ashburn | 2023 | 26008 | False | DEC | 12 | Winter | 0 | Mon. & Wed., Noon-8; Tues. & Thurs., 10-6; Fri... | 8530 S. Kedzie Ave. | Chicago | IL | 60652 | (312) 747-2696 | https://www.chipublib.org/locations/82/ | (41.73795704970435, -87.70221598421591) | 48 | WED | 8 |
75176 | Wrightwood-Ashburn | 2023 | 26008 | False | DEC | 12 | Winter | 0 | Mon. & Wed., Noon-8; Tues. & Thurs., 10-6; Fri... | 8530 S. Kedzie Ave. | Chicago | IL | 60652 | (312) 747-2696 | https://www.chipublib.org/locations/82/ | (41.73795704970435, -87.70221598421591) | 48 | THURS | 8 |
75177 | Wrightwood-Ashburn | 2023 | 26008 | False | DEC | 12 | Winter | 0 | Mon. & Wed., Noon-8; Tues. & Thurs., 10-6; Fri... | 8530 S. Kedzie Ave. | Chicago | IL | 60652 | (312) 747-2696 | https://www.chipublib.org/locations/82/ | (41.73795704970435, -87.70221598421591) | 48 | FRI | 8 |
75178 | Wrightwood-Ashburn | 2023 | 26008 | False | DEC | 12 | Winter | 0 | Mon. & Wed., Noon-8; Tues. & Thurs., 10-6; Fri... | 8530 S. Kedzie Ave. | Chicago | IL | 60652 | (312) 747-2696 | https://www.chipublib.org/locations/82/ | (41.73795704970435, -87.70221598421591) | 48 | SAT | 8 |
75179 | Wrightwood-Ashburn | 2023 | 26008 | False | DEC | 12 | Winter | 0 | Mon. & Wed., Noon-8; Tues. & Thurs., 10-6; Fri... | 8530 S. Kedzie Ave. | Chicago | IL | 60652 | (312) 747-2696 | https://www.chipublib.org/locations/82/ | (41.73795704970435, -87.70221598421591) | 48 | SUN | 0 |
To summarize, our full merged data set contains the following attributes:
Data profiling involves analyzing the contents and structure of a data set to gain insights into its characteristics. Here's an example of how we can perform data profiling on the full merged dataframe to further review it:
def perform_data_profiling(df):
# Display general information about the dataframe
print("Data Profile:")
print("-" * 50)
print(df.info())
print("-" * 50)
# Display summary statistics
print("Summary Statistics:")
print("-" * 50)
print(df.describe())
print("-" * 50)
# Display the first few rows of the dataframe
print("Sample Data:")
print("-" * 50)
print(df.head())
print("-" * 50)
# Display the number of unique values in each column
print("Unique Value Counts:")
print("-" * 50)
for column in df.columns:
unique_values = df[column].nunique()
print(f"{column}: {unique_values} unique values")
print("-" * 50)
# Display missing value information
print("Missing Value Counts:")
print("-" * 50)
print(df.isnull().sum())
print("-" * 50)
# Display data types of each column
print("Data Types:")
print("-" * 50)
print(df.dtypes)
print("-" * 50)
# Display correlations between numeric columns
numeric_columns = df.select_dtypes(include = 'number').columns
if len(numeric_columns) > 1:
print("Correlation Matrix:")
print("-" * 50)
print(df[numeric_columns].corr())
print("-" * 50)
# Perform data profiling on the merged dataframe
perform_data_profiling(merged_full_df)
Data Profile: -------------------------------------------------- <class 'pandas.core.frame.DataFrame'> Int64Index: 75180 entries, 0 to 75179 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 NAME 75180 non-null object 1 YEAR 75180 non-null int64 2 TOTAL_VISITORS_YEAR 75180 non-null int64 3 PROBLEM_WITH_COUNTER 75180 non-null object 4 MONTH 75180 non-null object 5 MONTH_NUMBER 75180 non-null int64 6 SEASON 75180 non-null object 7 VISITOR_COUNT 75180 non-null int64 8 HOURS OF OPERATION 75180 non-null object 9 ADDRESS 75180 non-null object 10 CITY 75180 non-null object 11 STATE 75180 non-null object 12 ZIP 75180 non-null int64 13 PHONE 75180 non-null object 14 WEBSITE 75180 non-null object 15 LOCATION 75180 non-null object 16 TOTAL HOURS WEEK 75180 non-null int64 17 DAY 75180 non-null object 18 HOURS OPEN 75180 non-null int64 19 Z_SCORE 75180 non-null float64 dtypes: float64(1), int64(7), object(12) memory usage: 14.1+ MB None -------------------------------------------------- Summary Statistics: -------------------------------------------------- YEAR TOTAL_VISITORS_YEAR MONTH_NUMBER VISITOR_COUNT \ count 75180.000000 7.518000e+04 75180.000000 75180.000000 mean 2016.737430 8.269647e+04 6.500000 6891.372439 std 3.621391 7.017199e+04 3.452075 6257.785494 min 2011.000000 0.000000e+00 1.000000 0.000000 25% 2014.000000 4.366100e+04 3.750000 3520.000000 50% 2017.000000 7.505200e+04 6.500000 6403.000000 75% 2020.000000 1.096040e+05 9.250000 9338.250000 max 2023.000000 1.405964e+06 12.000000 141696.000000 ZIP TOTAL HOURS WEEK HOURS OPEN Z_SCORE count 75180.000000 75180.000000 75180.000000 75180.000000 mean 60633.706145 46.248045 6.606864 0.597374 std 28.428708 6.669656 3.061526 0.801968 min 60605.000000 0.000000 0.000000 0.000060 25% 60617.000000 48.000000 8.000000 0.214833 50% 60630.000000 48.000000 8.000000 0.467321 75% 60643.000000 48.000000 8.000000 0.857447 max 60827.000000 48.000000 11.000000 21.542049 -------------------------------------------------- Sample Data: -------------------------------------------------- NAME YEAR TOTAL_VISITORS_YEAR PROBLEM_WITH_COUNTER MONTH \ 0 Albany Park 2011 117425 False JAN 1 Albany Park 2011 117425 False JAN 2 Albany Park 2011 117425 False JAN 3 Albany Park 2011 117425 False JAN 4 Albany Park 2011 117425 False JAN MONTH_NUMBER SEASON VISITOR_COUNT \ 0 1 Winter 9604 1 1 Winter 9604 2 1 Winter 9604 3 1 Winter 9604 4 1 Winter 9604 HOURS OF OPERATION ADDRESS \ 0 Mon. & Wed., 10-6; Tues. & Thurs., Noon-8; Fri... 3401 W. Foster Ave. 1 Mon. & Wed., 10-6; Tues. & Thurs., Noon-8; Fri... 3401 W. Foster Ave. 2 Mon. & Wed., 10-6; Tues. & Thurs., Noon-8; Fri... 3401 W. Foster Ave. 3 Mon. & Wed., 10-6; Tues. & Thurs., Noon-8; Fri... 3401 W. Foster Ave. 4 Mon. & Wed., 10-6; Tues. & Thurs., Noon-8; Fri... 3401 W. Foster Ave. CITY STATE ZIP PHONE \ 0 Chicago IL 60625 (773) 539-5450 1 Chicago IL 60625 (773) 539-5450 2 Chicago IL 60625 (773) 539-5450 3 Chicago IL 60625 (773) 539-5450 4 Chicago IL 60625 (773) 539-5450 WEBSITE \ 0 https://www.chipublib.org/locations/3/ 1 https://www.chipublib.org/locations/3/ 2 https://www.chipublib.org/locations/3/ 3 https://www.chipublib.org/locations/3/ 4 https://www.chipublib.org/locations/3/ LOCATION TOTAL HOURS WEEK DAY \ 0 (41.97557881655979, -87.71361314512697) 48 MON 1 (41.97557881655979, -87.71361314512697) 48 TUES 2 (41.97557881655979, -87.71361314512697) 48 WED 3 (41.97557881655979, -87.71361314512697) 48 THURS 4 (41.97557881655979, -87.71361314512697) 48 FRI HOURS OPEN Z_SCORE 0 8 0.433483 1 8 0.433483 2 8 0.433483 3 8 0.433483 4 8 0.433483 -------------------------------------------------- Unique Value Counts: -------------------------------------------------- NAME: 80 unique values YEAR: 12 unique values TOTAL_VISITORS_YEAR: 885 unique values PROBLEM_WITH_COUNTER: 2 unique values MONTH: 12 unique values MONTH_NUMBER: 12 unique values SEASON: 4 unique values VISITOR_COUNT: 6961 unique values HOURS OF OPERATION: 6 unique values ADDRESS: 80 unique values CITY: 1 unique values STATE: 1 unique values ZIP: 49 unique values PHONE: 80 unique values WEBSITE: 80 unique values LOCATION: 80 unique values TOTAL HOURS WEEK: 4 unique values DAY: 7 unique values HOURS OPEN: 3 unique values Z_SCORE: 6961 unique values -------------------------------------------------- Missing Value Counts: -------------------------------------------------- NAME 0 YEAR 0 TOTAL_VISITORS_YEAR 0 PROBLEM_WITH_COUNTER 0 MONTH 0 MONTH_NUMBER 0 SEASON 0 VISITOR_COUNT 0 HOURS OF OPERATION 0 ADDRESS 0 CITY 0 STATE 0 ZIP 0 PHONE 0 WEBSITE 0 LOCATION 0 TOTAL HOURS WEEK 0 DAY 0 HOURS OPEN 0 Z_SCORE 0 dtype: int64 -------------------------------------------------- Data Types: -------------------------------------------------- NAME object YEAR int64 TOTAL_VISITORS_YEAR int64 PROBLEM_WITH_COUNTER object MONTH object MONTH_NUMBER int64 SEASON object VISITOR_COUNT int64 HOURS OF OPERATION object ADDRESS object CITY object STATE object ZIP int64 PHONE object WEBSITE object LOCATION object TOTAL HOURS WEEK int64 DAY object HOURS OPEN int64 Z_SCORE float64 dtype: object -------------------------------------------------- Correlation Matrix: -------------------------------------------------- YEAR TOTAL_VISITORS_YEAR MONTH_NUMBER \ YEAR 1.000000e+00 -3.583990e-01 4.462930e-17 TOTAL_VISITORS_YEAR -3.583990e-01 1.000000e+00 -7.013253e-17 MONTH_NUMBER 4.462930e-17 -7.013253e-17 1.000000e+00 VISITOR_COUNT -3.349105e-01 9.344625e-01 -1.777759e-02 ZIP -2.902156e-02 -9.906362e-02 -7.209972e-15 TOTAL HOURS WEEK -3.199835e-02 4.678836e-02 6.400582e-17 HOURS OPEN -9.958526e-03 1.456147e-02 3.840921e-17 Z_SCORE 1.401029e-01 6.466537e-01 3.503484e-02 VISITOR_COUNT ZIP TOTAL HOURS WEEK \ YEAR -0.334910 -2.902156e-02 -3.199835e-02 TOTAL_VISITORS_YEAR 0.934463 -9.906362e-02 4.678836e-02 MONTH_NUMBER -0.017778 -7.209972e-15 6.400582e-17 VISITOR_COUNT 1.000000 -9.257124e-02 4.372197e-02 ZIP -0.092571 1.000000e+00 -2.547387e-01 TOTAL HOURS WEEK 0.043722 -2.547387e-01 1.000000e+00 HOURS OPEN 0.013607 -7.927978e-02 3.112200e-01 Z_SCORE 0.655241 -3.663064e-02 -8.461968e-02 HOURS OPEN Z_SCORE YEAR -9.958526e-03 0.140103 TOTAL_VISITORS_YEAR 1.456147e-02 0.646654 MONTH_NUMBER 3.840921e-17 0.035035 VISITOR_COUNT 1.360715e-02 0.655241 ZIP -7.927978e-02 -0.036631 TOTAL HOURS WEEK 3.112200e-01 -0.084620 HOURS OPEN 1.000000e+00 -0.026335 Z_SCORE -2.633533e-02 1.000000 --------------------------------------------------
We can also perform data profiling using pandas
profiling. Pandas
profiling is a powerful Python tool for exploratory data analysis, providing a quick and comprehensive overview of datasets. It automates the generation of statistical summaries, data visualizations, and insights, saving time and effort for data analysts and scientists. The HTML report generated by Pandas profiling facilitates data documentation and easy sharing of analysis results.
# Perform data profiling using pandas_profiling
profile = ProfileReport(merged_full_df, title = 'Pandas Profiling Report: Merged Dataset')
profile.to_file("visitors_by_location_data_profiling_report.html")
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
Export report to file: 0%| | 0/1 [00:00<?, ?it/s]
To open the HTML file generated by pandas
profiling in a Jupyter notebook, you can use the IPython.display
module or click on the file to open the document on your browser.
# Specify the file path of the HTML report
html_file = "visitors_by_location_data_profiling_report.html"
# Display the HTML report in the notebook
display(HTML(filename = html_file))
# Clear the displayed output
# clear_output()
The 'Overview' section of the report assures us that the processed data does not have any duplicates or missing values/cells. While the rest of the summary statistics/plots gives us important information we might want to consider moving forward, such as the high negative correlation between the variables 'YEAR' & 'VISITOR_COUNT'.
We can group the data by 'SEASON' and compute the average 'VISITOR_COUNT' for each season. This would tell us whether there's a significant difference in library usage across seasons. Similarly, we can examine monthly trends to capture finer temporal patterns.
# Set the style of the plots
sns.set_style("whitegrid")
# Define a color palette for the seasons
season_colors = {'Spring': '#9ACD32', 'Summer': '#FF0000', 'Fall': '#FFA500', 'Winter': '#87CEEB'}
# Question 1: Is there a time of the year for which they experience more visitors (e.g., summer vs winter)?
seasonal_data_avg = merged_full_df.groupby('SEASON')['VISITOR_COUNT'].mean().reset_index()
# Plot the average visitor count by season
plt.figure(figsize = (8, 6))
sns.barplot(x ='SEASON', y = 'VISITOR_COUNT', data = seasonal_data_avg,
order = ['Spring', 'Summer', 'Fall', 'Winter'],
palette = season_colors)
plt.title('Average Visitor Count by Season')
plt.ylabel('Average Number of Visitors')
plt.show()
The bar chart represents the average visitor count for each season. As we can see, the average number of visitors is highest in the summer and lowest in the winter among Chicago Public Libraries.
# Define custom colors for each season (using hex codes)
colors = {
'Spring': '#9ACD32', # Pastel green
'Summer': '#FF0000', # Red
'Fall': '#FFA500', # Orange
'Winter': '#87CEEB' # Light blue
}
# Plot the counts of visitors by year and season with custom colors
fig, ax = plt.subplots(figsize = (12, 8))
count_by_year_season.plot(kind = 'bar', stacked = True, ax = ax, color = [colors[col] for col in count_by_year_season.columns])
ax.set_title('Number of Visitors to Libraries Each Year by Season')
ax.set_xlabel('Year')
ax.set_ylabel('Number of Visitors')
# Remove scientific notation and set the y-axis labels to regular format
ax.yaxis.set_major_formatter('{:.0f}'.format)
legend = ax.legend(title = 'Seasons')
plt.setp(legend.get_title(), fontsize = 'medium')
plt.xticks(rotation = 45)
plt.tight_layout()
plt.show()
Looking at visitor counts for each season over time highlights not only the patterns mentioned above but also a decrease in the number of visitors over time for each of the seasons.
We can also plot 'YEAR' against 'TOTAL_VISITORS_YEAR' to visualize total changes in library usage over time.
# Set the style of the plots
sns.set_style("whitegrid")
# Question 2: Has there been a decline or increase in the use of public libraries?
yearly_data = merged_full_df.groupby('YEAR')['VISITOR_COUNT'].sum().reset_index()
# Include years 2011 and 2023 if missing and set their visitor count to 0
yearly_data = yearly_data.set_index('YEAR').reindex(range(2011, 2024)).fillna(0).reset_index()
# Plot the visitor count by year
plt.figure(figsize = (12, 6))
sns.lineplot(x = 'YEAR', y = 'VISITOR_COUNT', data = yearly_data, marker = 'o', color = 'b')
plt.title('Total Visitor Count by Year')
plt.ylabel('Number of Visitors')
# Format y-axis tick labels to show the count directly without scientific notation
plt.ticklabel_format(style = 'plain', axis = 'y')
# Add a vertical line for the start of the COVID-19 pandemic (assumed in January 2020)
plt.axvline(x = 2020, color = 'red', linestyle = '--', label = 'Start of COVID-19 Pandemic')
plt.legend() # Show the legend with the label for the vertical line
plt.xticks(rotation = 45) # Rotate x-axis labels for better readability
plt.tight_layout() # Adjust layout to avoid cropping labels
plt.show()
The line plot above shows the total number of visitors across Chicago Public Libraries each year. There is a clear trend showing an overall decrease in the number of visitors over time, which may be due to various factors such as the rise of digital media and the internet.
We can compare visitor counts before and after 2020 (when the COVID-19 pandemic started to significantly impact public life). We can visualize this using a line plot or a bar chart and confirm the significance of the observed differences using a statistical test.
An independent sample t-test is appropriate in this situation because it aims to compare the means of two independent groups: the visitor counts before the COVID-19 pandemic (pre-COVID) and the visitor counts from 2020 onwards (during COVID). The t-test is used to determine whether there is a statistically significant difference between the average visitor counts in the two periods. The assumption of independence between the samples is met as each observation in one group is unrelated to the other group, and the assumption of approximately normal distributions for both samples allows for the valid application of the t-test. Additionally, using the equal_var = False
argument indicates that the variance in the two groups is not assumed to be equal, which is appropriate for this analysis.
# Question 3: Did the COVID-19 pandemic contribute to any more changes?
# Let's consider years before 2020 (pre-COVID) and from 2020 onwards (during COVID)
pre_covid = merged_full_df[merged_full_df['YEAR'] < 2020]['VISITOR_COUNT']
during_covid = merged_full_df[merged_full_df['YEAR'] >= 2020]['VISITOR_COUNT']
# Calculate the average visitor count
average_pre_covid = pre_covid.mean()
average_during_covid = during_covid.mean()
# Print the average visitor count
print(f'Average visitor count before COVID-19: {average_pre_covid:.0f}')
print(f'Average visitor count during COVID-19: {average_during_covid:.0f}')
# Conduct the independent samples t-test
t_stat, p_value = stats.ttest_ind(pre_covid, during_covid, equal_var = False)
# Print the results
print(f'T-test statistic: {t_stat:.2f}')
print(f'P-value: {p_value:.5f}')
Average visitor count before COVID-19: 8283 Average visitor count during COVID-19: 3092 T-test statistic: 111.10 P-value: 0.00000
The average visitor count before the COVID-19 pandemic was 8283, while during the COVID-19 pandemic, it decreased significantly to 3092. The t-test statistic value of 111.10 indicates a large difference in means between the two periods. Additionally, the extremely low p-value of 0.00000 suggests that this difference is statistically significant.
Based on this test, it appears that the COVID-19 pandemic had a significant impact on visitor counts, leading to a considerable reduction in visitors during the pandemic compared to the pre-COVID period. However, we need to account for the overall decreasing trend in library usage when assessing the impact of COVID-19. One way to do this would be to compare the observed visitor count in 2020 and 2021 with the count predicted by a regression line fitted to the pre-2020 data. If the observed count is significantly lower than the predicted count, this would suggest an additional impact of COVID-19.
yearly_visitors = merged_full_df.groupby('YEAR')['TOTAL_VISITORS_YEAR'].sum()
# Fit a linear regression to the pre-2020 data
model = LinearRegression()
model.fit(pre_2020_data.index.values.reshape(-1, 1), pre_2020_data.values)
# Predict visitor count for 2020, 2021, 2022, and 2023
predicted_2020 = model.predict(np.array([[2020]]))[0]
predicted_2021 = model.predict(np.array([[2021]]))[0]
predicted_2022 = model.predict(np.array([[2022]]))[0]
predicted_2023 = model.predict(np.array([[2023]]))[0]
# Plot the observed and predicted visitor count
plt.figure(figsize = (10, 6))
sns.lineplot(x = yearly_visitors.index, y = yearly_visitors.values, label = 'Observed')
plt.plot([2020, 2021, 2022, 2023], [predicted_2020, predicted_2021, predicted_2022, predicted_2023],
'r--', label = 'Predicted')
plt.title('Total Visitors per Year')
plt.xlabel('Year')
plt.ylabel('Total Visitors')
plt.legend()
# Disable scientific notation for the y-axis tick labels
plt.ticklabel_format(axis = 'y', style = 'plain', useMathText = False)
plt.show()
The line plot shows the total number of visitors to public libraries each year. The blue line represents the observed visitor count, while the dashed red line represents the count predicted by a regression line fitted to the data from before 2020.
We can see that the observed visitor counts for 2020 and 2021 are significantly lower than the predicted counts. This suggests that the COVID-19 pandemic likely had an additional impact on library usage beyond the overall decreasing trend.
The decrease in visitors during the pandemic could be due to various factors, such as library closures, restrictions on public gatherings, and people avoiding public places to reduce the risk of virus transmission.
Note, that this analysis is quite simplified. A more robust analysis might take into account other factors that could affect library usage, such as population changes and the availability of digital resources. Additionally, the linear trend line is a very simple model of the pre-2020 trend, and a more complex model might provide a more accurate prediction.
Here we will use the Z-score method to detect outliers in the 'VISITOR_COUNT' column. We will then remove these outliers from the data set and try to improve our statistical testing.
# Calculate Z-scores
merged_full_df['Z_SCORE'] = np.abs(stats.zscore(merged_full_df['VISITOR_COUNT']))
# Filter out outliers (we define outliers as data points with a Z-score greater than 3)
df_filtered_no_outliers = merged_full_df[merged_full_df['Z_SCORE'] <= 3]
# Drop the 'Z_SCORE' column as we no longer need it
df_filtered_no_outliers = df_filtered_no_outliers.drop('Z_SCORE', axis = 1)
# Segmentation Analysis
# Here we will perform separate analyses for each library
# As an example, let's compare the total visitor count for each library in 2023
df_2023 = df_filtered_no_outliers[df_filtered_no_outliers['YEAR'] == 2023]
visitor_counts_2023 = df_2023.groupby('NAME')['VISITOR_COUNT'].sum().sort_values(ascending = False)
# Display the libraries with the highest and lowest visitor counts in 2023
visitor_counts_2023
NAME Sulzer Regional Library 709492 Chinatown 518014 Lincoln Park 417928 Merlo 382445 Logan Square 380919 ... Sherman Park 39837 Chicago Bee 37443 Brainerd 37079 Harold Washtington Library Center 0 Galewood-Mont Clare 0 Name: VISITOR_COUNT, Length: 80, dtype: int64
The output above shows the total visitor count for each library in 2023, sorted in descending order. The Sulzer Regional Library had the highest visitor count, while the Galewood-Mont Clare and Harold Washington Library Center had zero visitors.
Please note that these results are based on the filtered dataset where outliers in visitor count (defined as having a Z-score greater than 3) have been removed.
Next, let's perform a hypothesis test again to assess whether the observed decrease in visitor count after 2020 is statistically significant. We'll use a t-test for this purpose, which compares the mean visitor count before and after 2020. The null hypothesis is that the means are equal, while the alternative hypothesis is that the mean visitor count after 2020 is lower.
Before we conduct the t-test, let's check the assumptions. The t-test assumes that the data are normally distributed and that the variances of the two groups are equal. Let's check these assumptions first.
# Check the distribution of visitor counts before and after 2020
before_2020 = df_filtered_no_outliers[df_filtered_no_outliers['YEAR'] < 2020]['VISITOR_COUNT']
after_2020 = df_filtered_no_outliers[df_filtered_no_outliers['YEAR'] >= 2020]['VISITOR_COUNT']
# Plot histograms
plt.figure(figsize = (12, 6))
plt.hist(before_2020, bins = 30, alpha = 0.5, label = 'Before 2020')
plt.hist(after_2020, bins = 30, alpha = 0.5, label = 'After 2020')
plt.title('Distribution of Visitor Count Before and After 2020')
plt.xlabel('Visitor Count')
plt.ylabel('Frequency')
plt.legend()
plt.show()
# Perform Levene's test for equal variances
from scipy.stats import levene
stat, p = levene(before_2020, after_2020)
print(f'Levene\'s test for equal variances: p = {p}')
Levene's test for equal variances: p = 3.3231360711195104e-300
The histograms show the distribution of visitor counts before and after 2020. Both distributions are positively skewed, suggesting that the assumption of normality may not hold. However, the t-test is quite robust to violations of this assumption, especially for large sample sizes.
Levene's test for equal variances gives a very small p-value, indicating that the variances of the two groups are significantly different. This violates one of the assumptions of the t-test. In this case, we can use a version of the t-test that does not assume equal variances, known as Welch's t-test.
Let's proceed with Welch's t-test to compare the mean visitor count before and after 2020.
# Perform Welch's t-test
t_stat, p = ttest_ind(before_2020, after_2020, equal_var = False)
# Print the results of the Welch's t-test
print(f'Welch\'s t-test statistic: {t_stat:.2f}')
print(f'p-value: {p:.5f}')
Welch's t-test statistic: 196.75 p-value: 0.00000
Welch's t-test gives a very small p-value, which is less than 0.05. This indicates that we can reject the null hypothesis of equal means, and conclude that the mean visitor count is significantly lower after 2020 compared to before 2020. This provides statistical evidence for a decrease in library usage after 2020, supporting the visual evidence from the regression line plot.
Please note that this result should be interpreted with caution, as the t-test does not prove causality. The observed decrease in library usage could be due to the COVID-19 pandemic, but it could also be influenced by other factors.
Finally, as mentioned earlier, a more complex analysis might involve multiple regression or time-series analysis to model the relationship between visitor count and multiple variables simultaneously, and to handle temporal patterns in the data. However, these types of analysis are quite complex and beyond the scope of this discussion. They would also require additional data that is not available in the current data set.
The following queries could be performed to gain additional insights on the following questions.
merged_full_df.groupby('NAME')['TOTAL_VISITORS_YEAR'].max()
NAME Albany Park 159397 Altgeld 63052 Archer Heights 124653 Austin 103092 Austin-Irving 145852 ... West Pullman 121069 West Town 146851 Whitney M. Young, Jr. 128395 Woodson Regional Library 214173 Wrightwood-Ashburn 89580 Name: TOTAL_VISITORS_YEAR, Length: 80, dtype: int64
merged_full_df.groupby('NAME')['TOTAL_VISITORS_YEAR'].min()
NAME Albany Park 421 Altgeld 5666 Archer Heights 19245 Austin 19049 Austin-Irving 16357 ... West Pullman 15932 West Town 35757 Whitney M. Young, Jr. 0 Woodson Regional Library 30324 Wrightwood-Ashburn 12274 Name: TOTAL_VISITORS_YEAR, Length: 80, dtype: int64
merged_full_df.groupby(['NAME', 'YEAR'])['VISITOR_COUNT'].agg(['sum', 'mean'])
sum | mean | ||
---|---|---|---|
NAME | YEAR | ||
Albany Park | 2011 | 821975 | 9785.416667 |
2012 | 549395 | 6540.416667 | |
2013 | 2947 | 35.083333 | |
2014 | 371812 | 4426.333333 | |
2015 | 1076257 | 12812.583333 | |
... | ... | ... | ... |
Wrightwood-Ashburn | 2018 | 368270 | 4384.166667 |
2019 | 331471 | 3946.083333 | |
2020 | 149366 | 1778.166667 | |
2021 | 85918 | 1022.833333 | |
2023 | 182056 | 2167.333333 |
895 rows × 2 columns
merged_full_df.groupby(['NAME', 'MONTH'])['VISITOR_COUNT'].agg(['sum', 'mean'])
sum | mean | ||
---|---|---|---|
NAME | MONTH | ||
Albany Park | APR | 637819 | 7593.083333 |
AUG | 676844 | 8057.666667 | |
DEC | 630105 | 7501.250000 | |
FEB | 642061 | 7643.583333 | |
JAN | 673918 | 8022.833333 | |
... | ... | ... | ... |
Wrightwood-Ashburn | MAR | 406714 | 4841.833333 |
MAY | 348362 | 4147.166667 | |
NOV | 397754 | 4735.166667 | |
OCT | 439789 | 5235.583333 | |
SEP | 409878 | 4879.500000 |
960 rows × 2 columns
merged_full_df.groupby(['NAME', 'SEASON'])['VISITOR_COUNT'].agg(['sum', 'mean'])
sum | mean | ||
---|---|---|---|
NAME | SEASON | ||
Albany Park | Fall | 2150246 | 8532.722222 |
Spring | 1944299 | 7715.472222 | |
Summer | 2027312 | 8044.888889 | |
Winter | 1946084 | 7722.555556 | |
Altgeld | Fall | 875259 | 3473.250000 |
... | ... | ... | ... |
Woodson Regional Library | Winter | 770105 | 9167.916667 |
Wrightwood-Ashburn | Fall | 1247421 | 4950.083333 |
Spring | 1153495 | 4577.361111 | |
Summer | 1257305 | 4989.305556 | |
Winter | 1150429 | 4565.194444 |
320 rows × 2 columns
Each of these lines of code returns a dataframe with the library names as the index and the max, min, sum, or mean visitors as the columns. The groupby
method groups the data by the specified columns, and the agg
method applies one or more aggregating functions to each group. These dataframes could provide a great starting point for future research.
# Find the library with the highest attendance
highest_attendance_library = merged_full_df.groupby('NAME')['TOTAL_VISITORS_YEAR'].sum().idxmax()
highest_attendance_count = merged_full_df.groupby('NAME')['TOTAL_VISITORS_YEAR'].sum().max()
# Find the library with the lowest attendance
lowest_attendance_library = merged_full_df.groupby('NAME')['TOTAL_VISITORS_YEAR'].sum().idxmin()
lowest_attendance_count = merged_full_df.groupby('NAME')['TOTAL_VISITORS_YEAR'].sum().min()
print("Library with highest attendance:", highest_attendance_library, "with attendance count of", highest_attendance_count)
print("Library with lowest attendance:", lowest_attendance_library, "with attendance count of", lowest_attendance_count)
Library with highest attendance: Harold Washtington Library Center with attendance count of 225055320 Library with lowest attendance: Legler Regional with attendance count of 8639904
The Jupyter notebook provided carries out an extensive analysis of the public libraries in the city of Chicago. The analysis begins with an exploration of libraries' geographical distribution across the city, achieved by using the Python library folium
to generate interactive maps. The data used in the analysis includes the libraries' names, hours of operation, addresses, contact information, and geographical locations.
One of the key insights drawn from the spatial analysis is the distribution of libraries across the city. A heatmap of library locations is generated to provide a visual representation of areas with a high density of libraries versus areas with a low density. The analysis also explores the concept of nearest neighbor distance for each library, which provides an idea of how equally spaced the libraries are. The notebook then proceeds to analyze the number of monthly visitors to these libraries both through visualization and statistical testing, as well as any changes that might have occur in attendance over time (2011-2023) due to the effect of important events, such as the COVID-19 pandemic.
In conclusion, this project is a comprehensive exploration of the public libraries in Chicago, focusing on their spatial distribution and their usage patterns. The findings could help in strategic decision-making related to the placement of new libraries, resource allocation, and understanding the impact of factors like the COVID-19 pandemic on library usage.
For a more robust analysis, we can consider incorporating additional data such as population and temperature.
Population Data: The number of visitors to a library is likely to be influenced by the population of the area it serves. For instance, if the population of a city decreases, we might expect the number of library visitors to decrease as well. To account for this, we could incorporate population data into our analysis. We can obtain this data from the U.S. Census Bureau, which provides population estimates for cities and towns. Once we have this data, we could adjust the visitor counts by the population to get a per capita visitor count, which might provide a more accurate measure of library usage.
Temperature Data: The weather could also influence library usage. For example, people might be more likely to visit a library if it's too hot or too cold outside. To account for this, we could incorporate historical temperature data into our analysis. This data can be obtained from various sources, such as the National Centers for Environmental Information (NCEI). We could then examine the correlation between temperature and library usage, and include temperature as a factor in our regression model.
Incorporating these additional data would involve several steps:
Data Collection: We would first need to collect the necessary data. For population data, we could use the U.S. Census Bureau's API to download population estimates for the relevant years and locations. For temperature data, we would need to find a source that provides historical temperature data for the locations we're interested in.
Data Cleaning and Preprocessing: The new data would likely require some cleaning and preprocessing. For example, we might need to handle missing values, convert data types, or merge multiple data sources.
Exploratory Data Analysis: Before incorporating the new data into our model, we would perform some exploratory data analysis to understand its characteristics and its relationship with library usage.
Modeling: We would then incorporate the new data into our regression model. This could be as simple as adding new variables to our model, or it might involve more complex procedures such as feature engineering or model selection.
Evaluation and Interpretation: After fitting our model, we would evaluate its performance and interpret the results. This could involve calculating error metrics, checking the model's assumptions, and interpreting the coefficients.
Remember that the availability of data can vary, and the specifics of the analysis would depend on the data that we are able to obtain.