OpenStreetMap Data Case Study¶

By Michael Eryan¶

Project Task: Clean some OpenStreetMap data for a part of the world that you care about.

Data source: https://mapzen.com/data/metro-extracts/

Chosen City: Reno, NV, USA. Reno is not my hometown. So why Reno? Because it is "The Biggest Little City in the World," because there was a silly and hilarious TV show about it, and because I wanted something nice and small to work on for this project.

Project Overview:

Auditing the data downloaded from Mapzen
Cleaning and restructuring the data using Python
Importing the data into MongoDB and analyzing it

This Jupyter notebook represents all the steps of my actual workflow. Only one piece is not done in Python: launching Ubuntu via Vagrant and importing the cleaned up data set into MongoDB. Please note that throughout this notebook I have comments inside the code sections. There is discussion in both markdown sections and printed by the code sections as well.

Part 1: Auditing and Clean up¶

import os
import xml.etree.cElementTree as ET
import pprint
import re
import codecs
import json
from collections import defaultdict
import pandas as pd

#remove this when finished
#full file is reno.new.osm and 100MB, sample is on google drive
#sample_file = r"C:\Users\rf\Google Drive\Education\Python\codes\Udacity\Mongodb\ud032-master\data\reno_sample.osm"

sample_file = r"D:\Python_progs\mongo\reno.new.osm"
CREATED = [ "version", "changeset", "timestamp", "user", "uid"]

#########################################
# Part 1: Auditing and Clean up

#Descriptive Information 
print "Name of the file"
print sample_file
print "Size of the file (in megabytes)"
print os.stat(sample_file).st_size/1e6
print "\n"

tree = ET.parse(sample_file)
#the whole document is a tree #root is the very top
root = tree.getroot()
print "Root node is:",root.tag #osm
print "Root node's attributes are:"
print root.attrib 
print "\n"

#Or can drill down the root to get a feel as well
print "First child node is:", root[0].tag
print "Its attribute is:", root[0].attrib
print "\n"
print "Similarly for the second child node."
print root[2].tag 
print root[2].attrib
print "\n"

#If possible can drill down further like this:
#print root[3][0].attrib

Name of the file
D:\Python_progs\mongo\reno.new.osm
Size of the file (in megabytes)
108.217976


Root node is: osm
Root node's attributes are:
{'timestamp': '2016-11-20T15:04:01Z', 'version': '0.6', 'generator': 'osmconvert 0.8.5'}


First child node is: bounds
Its attribute is: {'minlat': '39.3503645', 'maxlon': '-119.6626711', 'minlon': '-120.0399015', 'maxlat': '39.7520192'}


Similarly for the second child node.
node
{'changeset': '12599511', 'uid': '169004', 'timestamp': '2012-08-03T15:23:02Z', 'lon': '-120.0175432', 'version': '3', 'user': 'oldtopos', 'lat': '39.4407936', 'id': '26798726'}

Overview of the data - Descriptive Information¶

The file I downloaded is named "reno.new.osm," was created on 2016-11-20 and weighs 108MB. By eyeballing a few of the nodes I can already see that it is structured like the course examples and will require similar restructuring and clean up.

#Content Summary Information

#count only the first level tags
def count_1sttags(filename):
    tree = ET.parse(filename)
    root = tree.getroot()
    tags = {}
    for child in root: #creates an iterable object
    	tag = child.tag
    	if tag not in tags.keys():
    		tags[tag]=1
    	else:
    		tags[tag]= tags[tag]+1
    return tags
tags_1st = count_1sttags(sample_file)
print "Here are the first level tags with the counts of their appearance in the file:" 
#Convert to a df to show as a table
tags_1stdf = pd.DataFrame(tags_1st.items(), columns=["tag", "count"]) 
print tags_1stdf
print "\n"

#count all tags
def count_alltags(filename):
    tags = {}
    for ev, elem in ET.iterparse(filename): #creates an iterable object
    	tag = elem.tag
    	if tag not in tags.keys():
    		tags[tag]=1
    	else:
    		tags[tag]= tags[tag]+1
    return tags
tags_all = count_alltags(sample_file)
print "Here are all the level tags with the counts of their appearance in the file:" 
#Convert to a df to show as a table
tags_alldf = pd.DataFrame(tags_all.items(), columns=["tag", "count"]) 
print tags_alldf
print "\n"

#what are all the possible attributes of the tags?
def process_allkey(sample_file):
    dict1={}
    for _, element in ET.iterparse(sample_file):
            child = element.tag
            kset=set()
            for key in element.attrib:
                kset.add(key)
            dict1[child]=list(kset)
    return dict1
    
kdict = process_allkey(sample_file)
print "Here all the possible attributes for the tags:"
print "\n"
pprint.pprint(kdict)
print "\n"

#Noticed that inside <tag there is k= and v= keys, I want to know the attributes (values) of the k-key
def process_addrkey(sample_file,mytag,key):
    kset=set()
    for _, element in ET.iterparse(sample_file):
            for tag in element.iter(mytag):
                ks = tag.attrib[key]
                if ks[:4] == 'addr':
                    kset.add(ks)
    return kset
addrkeylist = list (process_addrkey(sample_file,'tag','k'))
print "Here all the address related attributes for the 'k' key:"
pprint.pprint(addrkeylist)
print "\n"

#another fn for this?
#Also, this fn can be used to find out the attributes under k="building" and k="tourism"
def process_types(sample_file,mytag,key, type1, key2):
    kset=set()
    for _, element in ET.iterparse(sample_file):
            for tag in element.iter(mytag):
                ks = tag.attrib[key]
                if ks == type1:
                    vs = tag.attrib[key2]
                    kset.add(vs)
    return kset
building_types = list (process_types(sample_file,'tag','k','building','v'))
print "There are",len(building_types), "types of buildings and here are they listed:"
print building_types
print "\n"

tourism_types = list (process_types(sample_file,'tag','k','tourism','v'))
print "There are",len(tourism_types), "types of entries in the field 'tourism' and here are they listed:"
print tourism_types
print "\n"

#Do the same for street and postal code - might need to add more fixes to the street map
street_names = list (process_types(sample_file,'tag','k','addr:street','v'))
post_codes = list (process_types(sample_file,'tag','k','addr:postcode','v'))

#Export these to csv to under what needs to be cleaned up. 
#pd.DataFrame(kkeylist).to_csv('kkeylist.csv')
#pd.DataFrame(building_types).to_csv('building_types.csv')
#pd.DataFrame(street_names).to_csv('street_names.csv')
#pd.DataFrame(post_codes).to_csv('post_codes.csv')

Here are the first level tags with the counts of their appearance in the file:
        tag   count
0      node  505784
1  relation     337
2    bounds       1
3       way   37519


Here are all the level tags with the counts of their appearance in the file:
        tag   count
0      node  505784
1        nd  599447
2    bounds       1
3    member    3829
4       tag  211055
5  relation     337
6       way   37519
7       osm       1


Here all the possible attributes for the tags:


{'bounds': ['minlat', 'maxlon', 'minlon', 'maxlat'],
 'member': ['role', 'ref', 'type'],
 'nd': ['ref'],
 'node': ['changeset',
          'uid',
          'timestamp',
          'lon',
          'version',
          'user',
          'lat',
          'id'],
 'osm': ['timestamp', 'version', 'generator'],
 'relation': ['changeset', 'uid', 'timestamp', 'version', 'user', 'id'],
 'tag': ['k', 'v'],
 'way': ['changeset', 'uid', 'timestamp', 'version', 'user', 'id']}


Here all the address related attributes for the 'k' key:
['addr:housenumber',
 'addr:unit',
 'addr:city',
 'addr:postcode',
 'addr:full',
 'addr:housename',
 'addr:state',
 'addr:county',
 'addr:country',
 'addr:street']


There are 30 types of buildings and here are they listed:
['shed', 'industrial', 'office', 'apartments', 'house', 'terrace', 'church', 'yes', 'no', 'residential', 'hospital', 'civic', 'dormitory', 'boat_house', 'warehouse', 'gazebo', 'public', 'entrance', 'hotel', 'commercial', 'supermarket', 'hut', 'garages', 'school', 'university', 'roof', 'garage', 'retail', 'ruins', 'storage_tank']


There are 11 types of entries in the field 'tourism' and here are they listed:
['information', 'picnic_site', 'viewpoint', 'theme_park', 'motel', 'hotel', 'gallery', 'caravan_site', 'resort', 'artwork', 'museum']

Content Summary Information

I can see that I will have to create new nestings for "pos", "node_refs", "created". Expecting most of the cleanup will have to do with the addresses though - to convert the abbreviations to standard names and also standardize zip codes.

Part 2 Data Cleaning¶

Problems Encountered¶

I see two kinds of problems with the data: 1). Structure of the data set 2). Quality of the content. Regarding the structure, I would prefer to re-design the dataset into a more logical and hierarchical structure for easier analysis in MongodB.

Regarding the content of the data, I would standardize the street names and the zip codes whenever possible to avoid inconsistent abbreviations and formatting.

In order to prepare the data, I will create several mini-functions and a final main function to re-structure the data set and modify the content. After that I will export the data into a .json file for easy import into MongoDB where I will continue my analysis.

#First, seek to understand what kind of addresses we have in the data
#use reg expression for the keys
lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
second_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

#find each key's type
def key_type(element, keys):
    if element.tag in ['node','way','relation']:
        for tag in element.iter("tag"):
            if lower.match(tag.attrib["k"]):
                keys["lower"] = keys["lower"]+1
            elif lower_colon.match(tag.attrib["k"]):
                keys["lower_colon"] = keys["lower_colon"]+1              
            elif problemchars.search(tag.attrib["k"]):
                keys["problemchars"] = keys["problemchars"]+1                   
            else:
                keys["other"] = keys["other"]+1
    return keys

#understand the diversity of the data
def process_map(filename):
    keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
    for _, element in ET.iterparse(filename):
        keys = key_type(element, keys)
    return keys

key_review = process_map(sample_file)
#ok, now I know what to fix
# we want "addr:street" to turn into  {"address": {"street": "Some value"}}

#Now look at the actual street values - in the v= attribute

street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)
expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", 
            "Trail", "Parkway", "Commons"]
mapping = { "St": "Street",
            "St.": "Street",
            "Ave": "Avenue",
            "Ave.": "Avenue",
            "Rd": "Road",
            "Rd.": "Road",
            "N": "North",
            "N.": "North", 
            "Ct": "Court",
            "Ln": "Lane",
            "Dr": "Drive",
            "W": "West",
            "Blvd": "Boulevard",
            "Blvd.": "Boulevard"
            }
            
#save all the unique street names
def process_maps(sample_file):
    sset=set()
    for event, elem in ET.iterparse(sample_file, events=("start",)):
        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if tag.attrib['k'] == "addr:street":
                    ss = tag.attrib['v']
                    sset.add(ss)
    return sset

slist  = list(process_maps(sample_file))
#now got a unique list of street names

#Use the regular expressions to find everything that needs to be fixed
def audit_street_type(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        if street_type not in expected:
            street_types[street_type].add(street_name)

#check if element is a street name
def is_street_name(elem):
    return (elem.attrib['k'] == "addr:street")

#the main audit fn that calls others mini fns
def audit(osmfile):
    osm_file = open(osmfile, "r")
    street_types = defaultdict(set)
    for event, elem in ET.iterparse(osmfile, events=("start",)):
        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_street_name(tag):
                    audit_street_type(street_types, tag.attrib['v'])
    osm_file.close()
    return street_types
st_types = audit(sample_file)
st_typeslist = list(audit(sample_file))
#these are the ones I need to add to the map - if empty, all good format

#now will map wrong names to correct ones - by splitting and reconstructing
def update_name(name, mapping):
    words = name.split()
    neww = []
    for w in words:
        if w in mapping:
            w = mapping[w]
            neww.append(w)
        else:
            neww.append(w)
    nwords = ' '.join(neww)
    return nwords

#another fn to actually map old to new format
def map_names(street_types):
    new_types = {}
    for st_type, ways in street_types.iteritems():
        for name in ways:
            better_name = update_name(name, mapping)
            new_types[name]=better_name
    return new_types
new_types = map_names(st_types)

#Remember, we want "addr:street" to turn into  {"address": {"street": "Some value"}}

#Also, some zip codes are not of proper length 5 - e.g. NV 89523-3829
#just keep the first 5 digits
def update_zip(z):
    zip = ''
    for one in z:
        if one.isdigit():
            zip += one
        if len(zip) == 5:
            break
    return zip

#the main processing fn - transform the data before importing into MongoDB
#Shape_element - restructures and cleans the data
def shape_element(element):
    node = {}
    if element.tag in tags_1st:
        node = {}
        myTempDictionary = {}  
        node['type'] = element.tag
        #print element.tag #all the nodes and ways
        posList = [None for i in range(2)]
        for key in element.attrib: #to go through all the attributes - the text inside the <node and <tag
            #print key #all the attributes insde the nodes
            if key in CREATED: #quicker way to create this group
                myTempDictionary[key]= element.attrib[key]
                node['created'] = myTempDictionary
            elif key == 'lat':
                posList[0] = float(element.attrib[key])
                node['pos'] = posList
            elif key == 'lon':
                posList[1] = float(element.attrib[key])
                node['pos'] = posList
            else:
                node[key] = element.attrib[key]
        # access the nested elements - under the <node and <way
        nodeRefsList = [] #and this has to be here
        adict = {}
        for secondLevelTag in element:
            #print secondLevelTag.tag, secondLevelTag.attrib #- under the <node and <way
            for key in secondLevelTag.attrib: #again go their attributes - which are text
                if 'ref' in secondLevelTag.attrib: #if ref exists among the attributes
                    nodeRefsList.append( secondLevelTag.attrib['ref'])
                    node['node_refs'] = nodeRefsList #this has to be after, so it overwrites the values in node dict
                    #print key, nodeRefsList
                elif 'k' in secondLevelTag.attrib:
                    katt = secondLevelTag.attrib['k']
                    vatt = secondLevelTag.attrib['v']
                    if katt[:5] == "addr:" and not second_colon.match(katt):
                        if katt=="addr:street":
                            vatt = update_name(vatt,mapping)
                        elif katt=="addr:postcode": #did this step cause duplicates?
                            vatt = update_zip(vatt)
                        adict[katt[5:]] = vatt
                        #print katt[0:5], katt[5:],vatt
                    elif not second_colon.match(katt):
                        node[katt] = vatt
        #print adict
        if any(adict):
            node["address"]=adict
        return node
    else:
        return None

#output into json for easier import into Mongodb 
def export_json(file_in, pretty = False):
    # You do not need to change this file
    file_out = "{0}.json".format(file_in)
    data = []
    with codecs.open(file_out, "w") as fo:
        for _, element in ET.iterparse(file_in):
            el = shape_element(element) #the main work is done here
            if el:
                data.append(el)
                if pretty:
                    fo.write(json.dumps(el, indent=2)+"\n")
                else:
                    fo.write(json.dumps(el) + "\n")
    return data

#This is the main call that will produce a .json file
#data_out = export_json(sample_file, False)
#Final output is D:\Python_progs\mongo\reno.new.osm.json - load this into MongoDB

print "Name of the output file (to import into MongoDB)"
print "D:\Python_progs\mongo\reno.new.osm.json"
print "Size of the file (in megabytes)"
print os.stat(r"D:\Python_progs\mongo\reno.new.osm.json").st_size/1e6

#import into Mongo - don't forget to drop the collection first, otherwise will append
# mongoimport -d examples -c reno --type json --file /vagrant/shared/reno.new.osm.json
# Imported 543,641 documents (aka rows)

Name of the output file (to import into MongoDB)
eno.new.osm.json
Size of the file (in megabytes)
123.910855

The cleaned up file is slightly larger than the original - 124MB vs 108MB but the data is now structured in a more intuitive fashion to be used in MongoDB.

Part 3: Import into MongoDB and analyze¶

from pymongo import MongoClient
import pprint
client = MongoClient("mongodb://localhost:27017")
db = client.examples
col1=db.reno

print "How many documents in total?"
print col1.count() #543,641 documents
print "\n"
print "A few basic queries"
print "\n"
print "Output a single document"
print col1.find_one({})
print "\n"
print "Output a single document with type:bounds"
print col1.find_one({"type":"bounds"})
print "\n"
print "Output a single document with type:node"
print col1.find_one({"type":"node"})
print "\n"
print "Output a single document with type:way"
print col1.find_one({"type":"way"})
print "\n"

print "Count types of tags."
print "\n"
print "type:node"
print col1.find({"type": "node"}).count() #505783
print "type:way - locations"
print col1.find({"type": "way"}).count() #37516
print "type:relation"
print col1.find({"type": "relation"}).count() #2
print "created.user - unique users who contributed to this file"
print col1.distinct("created.user").__len__() #297
print "amenity"
print col1.distinct("amenity").__len__() #59
print "tourism"
print col1.distinct("tourism").__len__() #11
print "address.postcode - unique zip codes"
print col1.distinct("address.postcode").__len__() #21
print "address.street - unique streets"
print col1.distinct("address.street").__len__() #172

How many documents in total?
543641


A few basic queries


Output a single document
{u'maxlon': u'-119.6626711', u'minlon': u'-120.0399015', u'minlat': u'39.3503645', u'maxlat': u'39.7520192', u'_id': ObjectId('5918f6565372be7249daf00f'), u'type': u'bounds'}


Output a single document with type:bounds
{u'maxlon': u'-119.6626711', u'minlon': u'-120.0399015', u'minlat': u'39.3503645', u'maxlat': u'39.7520192', u'_id': ObjectId('5918f6565372be7249daf00f'), u'type': u'bounds'}


Output a single document with type:node
{u'created': {u'changeset': u'12599511', u'version': u'3', u'user': u'oldtopos', u'timestamp': u'2012-08-03T15:23:02Z', u'uid': u'169004'}, u'_id': ObjectId('5918f6565372be7249daf010'), u'type': u'node', u'id': u'26798725', u'pos': [39.4433397, -120.0117505]}


Output a single document with type:way
{u'tiger:source': u'tiger_import_dch_v0.6_20070809', u'node_refs': [u'58128446', u'58128447', u'58128448', u'58128449', u'58128450'], u'tiger:upload_uuid': u'bulk_upload.pl-3a4520d4-a031-4a95-98ae-8ce13c4fa8b6', u'tiger:cfcc': u'A41', u'created': {u'changeset': u'498071', u'version': u'1', u'user': u'DaveHansenTiger', u'timestamp': u'2007-09-23T03:40:02Z', u'uid': u'7168'}, u'tiger:reviewed': u'no', u'tiger:separated': u'no', u'tiger:county': u'Lassen, CA', u'tiger:tlid': u'126975520:126925963', u'_id': ObjectId('5918f6685372be7249e2a7c8'), u'type': u'way', u'id': u'7898584', u'highway': u'residential'}


Count types of tags.


type:node
505783
type:way - locations
37516
type:relation
2
created.user - unique users who contributed to this file
297
amenity
59
tourism
11
address.postcode - unique zip codes
21
address.street - unique streets
172

Overview Statistics from MongoDB¶

The "eno.new.osm.json" file that I imported into MongoDB was 123.9MB.

297 unique users contributed to the creation of this file.

There are 505,783 nodes and 37,516 ways in this file.

There are 59 entries in "amenity", 11 entries in "tourism."

Interesting to see what are the elements in the amenity and tourism tags. Also who are the top contributing users.

#Create a flexible two stage fn to create aggregations
def reno_agg(db, pipeline):
    result = db.reno.aggregate(pipeline)
    return result

#how many amenity types
def amenity_pipe():
    pipeline = [
    {"$match" : {"$and": [{"amenity": {"$exists": True, "$ne": None}} ]}},
    {"$group" : {"_id" : "$amenity", "count" : {"$sum" : 1 }}}, 
    {"$sort" : { "count" : -1 }}
    ]
    return pipeline

amenity_count_out = list(reno_agg(db, amenity_pipe()))
print "Top 3 amenities in Reno:"
pprint.pprint(amenity_count_out[:3])
print "\n"

#how many tourism types
def tourism_pipe():
    pipeline = [
    {"$match" : {"$and": [{"tourism": {"$exists": True, "$ne": None}} ]}},
    {"$group" : {"_id" : "$tourism", "count" : {"$sum" : 1 }}}, 
    {"$sort" : { "count" : -1 }}
    ]
    return pipeline

tourism_count_out = list(reno_agg(db, tourism_pipe()))
print "Top 3 entries in 'tourism'"
pprint.pprint(tourism_count_out[:3])
print "\n"

#User - contributer summary
def user_pipe():
    pipeline = [
    {"$group" : {"_id" : "$created.user", 
                 "count" : {"$sum" : 1 }}}, 
    {"$sort" : { "count" : -1 }} 
    ]
    return pipeline

user_count_out = list(reno_agg(db, user_pipe()))
print "Top 3 contributing users"
pprint.pprint(user_count_out[:3])
print "\n"

Top 3 amenities in Reno:
[{u'_id': u'parking', u'count': 844},
 {u'_id': u'school', u'count': 229},
 {u'_id': u'restaurant', u'count': 90}]


Top 3 entries in 'tourism'
[{u'_id': u'motel', u'count': 74},
 {u'_id': u'hotel', u'count': 31},
 {u'_id': u'information', u'count': 14}]


Top 3 contributing users
[{u'_id': u'theangrytomato', u'count': 306691},
 {u'_id': u'woodpeck_fixbot', u'count': 67628},
 {u'_id': u'abschiff', u'count': 46336}]

Looks like the most common "amenities" in Reno are parking, school and restaurant.

The most numerous entries for "tourism" are motels, hotels and information.

By far the main contributor to this open street map is someone named 'theangrytomato', with the next two contributors lagging much further behind.

Conclusions and Other ideas about the dataset¶

Cleaning up the address data has been fairly straightforward with the exception of data entry errors (e.g. zip codes) that need to cross validated by a geocoder or manually corrected. Restructuring the data into a more intuitive hierarchical design is beneficial for easier access in a NoSQL database like MongoDB. The top contributor for this file is a user named "theangrytomato" (unknown whether a bot or a real person). The second contributor is "woodpeck_fixbot" who is a bot. Perhaps, promoting contributions by local residents and visitors could improve the quality and add more content to the street data.