Project Task: Clean some OpenStreetMap data for a part of the world that you care about.
Data source: https://mapzen.com/data/metro-extracts/
Chosen City: Reno, NV, USA. Reno is not my hometown. So why Reno? Because it is "The Biggest Little City in the World," because there was a silly and hilarious TV show about it, and because I wanted something nice and small to work on for this project.
Project Overview:
This Jupyter notebook represents all the steps of my actual workflow. Only one piece is not done in Python: launching Ubuntu via Vagrant and importing the cleaned up data set into MongoDB. Please note that throughout this notebook I have comments inside the code sections. There is discussion in both markdown sections and printed by the code sections as well.
import os
import xml.etree.cElementTree as ET
import pprint
import re
import codecs
import json
from collections import defaultdict
import pandas as pd
#remove this when finished
#full file is reno.new.osm and 100MB, sample is on google drive
#sample_file = r"C:\Users\rf\Google Drive\Education\Python\codes\Udacity\Mongodb\ud032-master\data\reno_sample.osm"
sample_file = r"D:\Python_progs\mongo\reno.new.osm"
CREATED = [ "version", "changeset", "timestamp", "user", "uid"]
#########################################
# Part 1: Auditing and Clean up
#Descriptive Information
print "Name of the file"
print sample_file
print "Size of the file (in megabytes)"
print os.stat(sample_file).st_size/1e6
print "\n"
tree = ET.parse(sample_file)
#the whole document is a tree #root is the very top
root = tree.getroot()
print "Root node is:",root.tag #osm
print "Root node's attributes are:"
print root.attrib
print "\n"
#Or can drill down the root to get a feel as well
print "First child node is:", root[0].tag
print "Its attribute is:", root[0].attrib
print "\n"
print "Similarly for the second child node."
print root[2].tag
print root[2].attrib
print "\n"
#If possible can drill down further like this:
#print root[3][0].attrib
The file I downloaded is named "reno.new.osm," was created on 2016-11-20 and weighs 108MB. By eyeballing a few of the nodes I can already see that it is structured like the course examples and will require similar restructuring and clean up.
#Content Summary Information
#count only the first level tags
def count_1sttags(filename):
tree = ET.parse(filename)
root = tree.getroot()
tags = {}
for child in root: #creates an iterable object
tag = child.tag
if tag not in tags.keys():
tags[tag]=1
else:
tags[tag]= tags[tag]+1
return tags
tags_1st = count_1sttags(sample_file)
print "Here are the first level tags with the counts of their appearance in the file:"
#Convert to a df to show as a table
tags_1stdf = pd.DataFrame(tags_1st.items(), columns=["tag", "count"])
print tags_1stdf
print "\n"
#count all tags
def count_alltags(filename):
tags = {}
for ev, elem in ET.iterparse(filename): #creates an iterable object
tag = elem.tag
if tag not in tags.keys():
tags[tag]=1
else:
tags[tag]= tags[tag]+1
return tags
tags_all = count_alltags(sample_file)
print "Here are all the level tags with the counts of their appearance in the file:"
#Convert to a df to show as a table
tags_alldf = pd.DataFrame(tags_all.items(), columns=["tag", "count"])
print tags_alldf
print "\n"
#what are all the possible attributes of the tags?
def process_allkey(sample_file):
dict1={}
for _, element in ET.iterparse(sample_file):
child = element.tag
kset=set()
for key in element.attrib:
kset.add(key)
dict1[child]=list(kset)
return dict1
kdict = process_allkey(sample_file)
print "Here all the possible attributes for the tags:"
print "\n"
pprint.pprint(kdict)
print "\n"
#Noticed that inside <tag there is k= and v= keys, I want to know the attributes (values) of the k-key
def process_addrkey(sample_file,mytag,key):
kset=set()
for _, element in ET.iterparse(sample_file):
for tag in element.iter(mytag):
ks = tag.attrib[key]
if ks[:4] == 'addr':
kset.add(ks)
return kset
addrkeylist = list (process_addrkey(sample_file,'tag','k'))
print "Here all the address related attributes for the 'k' key:"
pprint.pprint(addrkeylist)
print "\n"
#another fn for this?
#Also, this fn can be used to find out the attributes under k="building" and k="tourism"
def process_types(sample_file,mytag,key, type1, key2):
kset=set()
for _, element in ET.iterparse(sample_file):
for tag in element.iter(mytag):
ks = tag.attrib[key]
if ks == type1:
vs = tag.attrib[key2]
kset.add(vs)
return kset
building_types = list (process_types(sample_file,'tag','k','building','v'))
print "There are",len(building_types), "types of buildings and here are they listed:"
print building_types
print "\n"
tourism_types = list (process_types(sample_file,'tag','k','tourism','v'))
print "There are",len(tourism_types), "types of entries in the field 'tourism' and here are they listed:"
print tourism_types
print "\n"
#Do the same for street and postal code - might need to add more fixes to the street map
street_names = list (process_types(sample_file,'tag','k','addr:street','v'))
post_codes = list (process_types(sample_file,'tag','k','addr:postcode','v'))
#Export these to csv to under what needs to be cleaned up.
#pd.DataFrame(kkeylist).to_csv('kkeylist.csv')
#pd.DataFrame(building_types).to_csv('building_types.csv')
#pd.DataFrame(street_names).to_csv('street_names.csv')
#pd.DataFrame(post_codes).to_csv('post_codes.csv')
Content Summary Information
I can see that I will have to create new nestings for "pos", "node_refs", "created". Expecting most of the cleanup will have to do with the addresses though - to convert the abbreviations to standard names and also standardize zip codes.
I see two kinds of problems with the data: 1). Structure of the data set 2). Quality of the content. Regarding the structure, I would prefer to re-design the dataset into a more logical and hierarchical structure for easier analysis in MongodB.
Regarding the content of the data, I would standardize the street names and the zip codes whenever possible to avoid inconsistent abbreviations and formatting.
In order to prepare the data, I will create several mini-functions and a final main function to re-structure the data set and modify the content. After that I will export the data into a .json file for easy import into MongoDB where I will continue my analysis.
#First, seek to understand what kind of addresses we have in the data
#use reg expression for the keys
lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
second_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')
#find each key's type
def key_type(element, keys):
if element.tag in ['node','way','relation']:
for tag in element.iter("tag"):
if lower.match(tag.attrib["k"]):
keys["lower"] = keys["lower"]+1
elif lower_colon.match(tag.attrib["k"]):
keys["lower_colon"] = keys["lower_colon"]+1
elif problemchars.search(tag.attrib["k"]):
keys["problemchars"] = keys["problemchars"]+1
else:
keys["other"] = keys["other"]+1
return keys
#understand the diversity of the data
def process_map(filename):
keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
for _, element in ET.iterparse(filename):
keys = key_type(element, keys)
return keys
key_review = process_map(sample_file)
#ok, now I know what to fix
# we want "addr:street" to turn into {"address": {"street": "Some value"}}
#Now look at the actual street values - in the v= attribute
street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)
expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road",
"Trail", "Parkway", "Commons"]
mapping = { "St": "Street",
"St.": "Street",
"Ave": "Avenue",
"Ave.": "Avenue",
"Rd": "Road",
"Rd.": "Road",
"N": "North",
"N.": "North",
"Ct": "Court",
"Ln": "Lane",
"Dr": "Drive",
"W": "West",
"Blvd": "Boulevard",
"Blvd.": "Boulevard"
}
#save all the unique street names
def process_maps(sample_file):
sset=set()
for event, elem in ET.iterparse(sample_file, events=("start",)):
if elem.tag == "node" or elem.tag == "way":
for tag in elem.iter("tag"):
if tag.attrib['k'] == "addr:street":
ss = tag.attrib['v']
sset.add(ss)
return sset
slist = list(process_maps(sample_file))
#now got a unique list of street names
#Use the regular expressions to find everything that needs to be fixed
def audit_street_type(street_types, street_name):
m = street_type_re.search(street_name)
if m:
street_type = m.group()
if street_type not in expected:
street_types[street_type].add(street_name)
#check if element is a street name
def is_street_name(elem):
return (elem.attrib['k'] == "addr:street")
#the main audit fn that calls others mini fns
def audit(osmfile):
osm_file = open(osmfile, "r")
street_types = defaultdict(set)
for event, elem in ET.iterparse(osmfile, events=("start",)):
if elem.tag == "node" or elem.tag == "way":
for tag in elem.iter("tag"):
if is_street_name(tag):
audit_street_type(street_types, tag.attrib['v'])
osm_file.close()
return street_types
st_types = audit(sample_file)
st_typeslist = list(audit(sample_file))
#these are the ones I need to add to the map - if empty, all good format
#now will map wrong names to correct ones - by splitting and reconstructing
def update_name(name, mapping):
words = name.split()
neww = []
for w in words:
if w in mapping:
w = mapping[w]
neww.append(w)
else:
neww.append(w)
nwords = ' '.join(neww)
return nwords
#another fn to actually map old to new format
def map_names(street_types):
new_types = {}
for st_type, ways in street_types.iteritems():
for name in ways:
better_name = update_name(name, mapping)
new_types[name]=better_name
return new_types
new_types = map_names(st_types)
#Remember, we want "addr:street" to turn into {"address": {"street": "Some value"}}
#Also, some zip codes are not of proper length 5 - e.g. NV 89523-3829
#just keep the first 5 digits
def update_zip(z):
zip = ''
for one in z:
if one.isdigit():
zip += one
if len(zip) == 5:
break
return zip
#the main processing fn - transform the data before importing into MongoDB
#Shape_element - restructures and cleans the data
def shape_element(element):
node = {}
if element.tag in tags_1st:
node = {}
myTempDictionary = {}
node['type'] = element.tag
#print element.tag #all the nodes and ways
posList = [None for i in range(2)]
for key in element.attrib: #to go through all the attributes - the text inside the <node and <tag
#print key #all the attributes insde the nodes
if key in CREATED: #quicker way to create this group
myTempDictionary[key]= element.attrib[key]
node['created'] = myTempDictionary
elif key == 'lat':
posList[0] = float(element.attrib[key])
node['pos'] = posList
elif key == 'lon':
posList[1] = float(element.attrib[key])
node['pos'] = posList
else:
node[key] = element.attrib[key]
# access the nested elements - under the <node and <way
nodeRefsList = [] #and this has to be here
adict = {}
for secondLevelTag in element:
#print secondLevelTag.tag, secondLevelTag.attrib #- under the <node and <way
for key in secondLevelTag.attrib: #again go their attributes - which are text
if 'ref' in secondLevelTag.attrib: #if ref exists among the attributes
nodeRefsList.append( secondLevelTag.attrib['ref'])
node['node_refs'] = nodeRefsList #this has to be after, so it overwrites the values in node dict
#print key, nodeRefsList
elif 'k' in secondLevelTag.attrib:
katt = secondLevelTag.attrib['k']
vatt = secondLevelTag.attrib['v']
if katt[:5] == "addr:" and not second_colon.match(katt):
if katt=="addr:street":
vatt = update_name(vatt,mapping)
elif katt=="addr:postcode": #did this step cause duplicates?
vatt = update_zip(vatt)
adict[katt[5:]] = vatt
#print katt[0:5], katt[5:],vatt
elif not second_colon.match(katt):
node[katt] = vatt
#print adict
if any(adict):
node["address"]=adict
return node
else:
return None
#output into json for easier import into Mongodb
def export_json(file_in, pretty = False):
# You do not need to change this file
file_out = "{0}.json".format(file_in)
data = []
with codecs.open(file_out, "w") as fo:
for _, element in ET.iterparse(file_in):
el = shape_element(element) #the main work is done here
if el:
data.append(el)
if pretty:
fo.write(json.dumps(el, indent=2)+"\n")
else:
fo.write(json.dumps(el) + "\n")
return data
#This is the main call that will produce a .json file
#data_out = export_json(sample_file, False)
#Final output is D:\Python_progs\mongo\reno.new.osm.json - load this into MongoDB
print "Name of the output file (to import into MongoDB)"
print "D:\Python_progs\mongo\reno.new.osm.json"
print "Size of the file (in megabytes)"
print os.stat(r"D:\Python_progs\mongo\reno.new.osm.json").st_size/1e6
#import into Mongo - don't forget to drop the collection first, otherwise will append
# mongoimport -d examples -c reno --type json --file /vagrant/shared/reno.new.osm.json
# Imported 543,641 documents (aka rows)
The cleaned up file is slightly larger than the original - 124MB vs 108MB but the data is now structured in a more intuitive fashion to be used in MongoDB.
from pymongo import MongoClient
import pprint
client = MongoClient("mongodb://localhost:27017")
db = client.examples
col1=db.reno
print "How many documents in total?"
print col1.count() #543,641 documents
print "\n"
print "A few basic queries"
print "\n"
print "Output a single document"
print col1.find_one({})
print "\n"
print "Output a single document with type:bounds"
print col1.find_one({"type":"bounds"})
print "\n"
print "Output a single document with type:node"
print col1.find_one({"type":"node"})
print "\n"
print "Output a single document with type:way"
print col1.find_one({"type":"way"})
print "\n"
print "Count types of tags."
print "\n"
print "type:node"
print col1.find({"type": "node"}).count() #505783
print "type:way - locations"
print col1.find({"type": "way"}).count() #37516
print "type:relation"
print col1.find({"type": "relation"}).count() #2
print "created.user - unique users who contributed to this file"
print col1.distinct("created.user").__len__() #297
print "amenity"
print col1.distinct("amenity").__len__() #59
print "tourism"
print col1.distinct("tourism").__len__() #11
print "address.postcode - unique zip codes"
print col1.distinct("address.postcode").__len__() #21
print "address.street - unique streets"
print col1.distinct("address.street").__len__() #172
The "eno.new.osm.json" file that I imported into MongoDB was 123.9MB.
297 unique users contributed to the creation of this file.
There are 505,783 nodes and 37,516 ways in this file.
There are 59 entries in "amenity", 11 entries in "tourism."
Interesting to see what are the elements in the amenity and tourism tags. Also who are the top contributing users.
#Create a flexible two stage fn to create aggregations
def reno_agg(db, pipeline):
result = db.reno.aggregate(pipeline)
return result
#how many amenity types
def amenity_pipe():
pipeline = [
{"$match" : {"$and": [{"amenity": {"$exists": True, "$ne": None}} ]}},
{"$group" : {"_id" : "$amenity", "count" : {"$sum" : 1 }}},
{"$sort" : { "count" : -1 }}
]
return pipeline
amenity_count_out = list(reno_agg(db, amenity_pipe()))
print "Top 3 amenities in Reno:"
pprint.pprint(amenity_count_out[:3])
print "\n"
#how many tourism types
def tourism_pipe():
pipeline = [
{"$match" : {"$and": [{"tourism": {"$exists": True, "$ne": None}} ]}},
{"$group" : {"_id" : "$tourism", "count" : {"$sum" : 1 }}},
{"$sort" : { "count" : -1 }}
]
return pipeline
tourism_count_out = list(reno_agg(db, tourism_pipe()))
print "Top 3 entries in 'tourism'"
pprint.pprint(tourism_count_out[:3])
print "\n"
#User - contributer summary
def user_pipe():
pipeline = [
{"$group" : {"_id" : "$created.user",
"count" : {"$sum" : 1 }}},
{"$sort" : { "count" : -1 }}
]
return pipeline
user_count_out = list(reno_agg(db, user_pipe()))
print "Top 3 contributing users"
pprint.pprint(user_count_out[:3])
print "\n"
Looks like the most common "amenities" in Reno are parking, school and restaurant.
The most numerous entries for "tourism" are motels, hotels and information.
By far the main contributor to this open street map is someone named 'theangrytomato', with the next two contributors lagging much further behind.
Cleaning up the address data has been fairly straightforward with the exception of data entry errors (e.g. zip codes) that need to cross validated by a geocoder or manually corrected. Restructuring the data into a more intuitive hierarchical design is beneficial for easier access in a NoSQL database like MongoDB. The top contributor for this file is a user named "theangrytomato" (unknown whether a bot or a real person). The second contributor is "woodpeck_fixbot" who is a bot. Perhaps, promoting contributions by local residents and visitors could improve the quality and add more content to the street data.