Auto-populate fields with Google Maps data

Jen Ren 2020-05-22 14 minute read

While in quarantine, I’ve been volunteering with the Bay Area Community Resources Map team on improving their data quality. Given the economic fallout from coronavirus-related closures, the group created a map for individuals and families to access critical resources and services such as free food, meals, legal assistance, and medical care.

The map is powered by a huge volunteer workforce who manually contact each resource to ensure that information is up-to-date. This can take a long time to call each place and enter in the relevant information; I helped to simplify volunteers’ workload by auto-populating as much as we could from Google Maps which they could then check against. This notebook shares the scripts used to pull this via the Google Place API’s Python client. There are a number of tweaks to make to optimize GCP costs, but I’ve left it as-is because its current format is better for iterative development (where we’re still at). I expect to make changes once we productionize it in a pipeline and solidify our data structures.

I’m sharing in case it helps serve as a reference for how to replicate this and pull information from Google Maps for your own geospatial data projects. There’s a lot of rich data to explore there!


Goal

I happened to start with a dataset published by the USDA which lists retailers that accept SNAP, a federal food stamp program. Given that our team is only interested in California’s Bay Area, they were filtered down to the relevant state and counties. The column names were also tidied to include only:

## | new name  | old name   | type    |
## |-----------|------------|---------|
## | site_name | Store_Name | string  |
## | address   | Address    | string  |
## | city      | City       | string  |
## | long      | Longitude  | numeric |
## | lat       | Latitude   | numeric |
## | county    | County     | string  |

For every retailer in that new dataset, we wanted to find additional attributes such as:

  • Longitude and latitude
  • Zipcode
  • Operating hours
  • Business status, such as if it is currently: operational, temporarily closed, or permanently closed
  • Phone number
  • Website

Though the first two items are given in the USDA dataset, the team expressed concern that they weren’t entirely accurate or were dropped from the table earlier on in the data processing workflow. Therefore, they wanted to look it up in and check it against Google Maps. The latter four items aren’t found in the table, but they are fields which we can look up from Google Maps! Specifically, the Google Places API helps us look up a place given a search string, and it can return details about a particular place we’ve found.

# Constants to replace with your own values

INPUT_PATH = 'INPUT_PATH.csv'
OUTPUT_PATH = 'OUTPUT_PATH.csv'
API_KEY = 'YOUR_KEY'

Import data

In our case, our data is small enough to be saved in a csv file. However, it still has over 800 rows! When you’re checking each one’s information manually, that can take up quite a bit of time. We’ll import our dataset and develop our script with a sample of the data first before running it on the whole dataset.

import pandas as pd

df = pd.read_csv(INPUT_PATH)
df_sample = df.head(10)

df.head(5)
## | site_name                 | address              | city          | long       | lat       | county      |
## |---------------------------|----------------------|---------------|------------|-----------|-------------|
## | Dollar Tree 5522          | 2222 Business Cir    | San Jose      | -121.93281 | 37.323013 | Santa Clara |
## | Joya Supermarket, Llc.    | 2512 California St   | Mountain View | -122.10679 | 37.403904 | Santa Clara |
## | Safeway 2887              | 150 E El Camino Real | Sunnyvale     | -122.03108 | 37.366657 | Santa Clara |
## | Morgans Hill Market       | 16935 Monterey St    | Morgan Hill   | -121.64956 | 37.124382 | Santa Clara |
## | Bakery/Panaderia La Mejor | 1239 E Julian St     | San Jose      | -121.87022 | 37.351727 | Santa Clara |

Get an API key

We’ll also need access to the Google Places API. In order to use the Places API, you’ll first need to get an API key. This allows you to make requests and retrieve information from the web service. Follow this guide to get your key and paste it below.

We’ll also import the Python extension for Google Maps. They’ve got some handy functions that make it easier to work with the Google Place API.

import googlemaps

gmaps = googlemaps.Client(key = API_KEY)

Lookup a place

The information we’re interested in are the details of a place. The way that we can retrieve it from Google Maps is through a unique ID that Google assigns each place. However, we don’t have that ID offhand; we’ll need to look it up based on the information we do have.

When we manually search for a location on Google Maps, typically we’re able to type in the place’s name to find a result. However, for large retailers, there are multiple (if not hundreds or thousands of) locations with the same name (think: Safeway or 7-Eleven). By default, Google Maps biases to your current location and searches for matches near you (that means the first results returned when I look up 7-Eleven will be near me in Seattle, not in Singapore). If you have an address and city information, that will also help sort the results. In our case, the USDA dataset gives us a place’s:

  • name
  • address
  • city
  • latitude
  • longitude

However, we know that the lat/long coordinates may be slightly off; nonetheless, we’ll make them optional in our search since they can help bias our search to the right vicinity.

Let’s create a few helper functions for us to be able to look up a location’s Google Place ID given its name, address, city, and optionally coordinates. We’ll take the ID of the top result. This ID is what will help us map a location to its details.

If we quickly scan the USDA dataset, we’ll notice that locations of large chains often have digits or alphanumeric codes at the end of their name. These look like unique store identifiers that each retailer defines. In our case, these codes are useless to us, and in fact they can throw off our search since they’re not used in Google Maps. To better improve our search result success, we’ll parse the site_name to remove these troublesome codes.

# Improves search result success by removing words at the end of the site name that contain digits.
# Looking at the data, 7-Eleven produces inconsistencies so we add a special case to handle them.
import regex

def trim_suffix(site_name):
    if not site_name:
        return site_name
    clean_name = regex.sub(pattern = "\W", repl = " ", string = site_name.lower())
    bits = clean_name.split(" ")
    while not str.isalpha(bits[-1]) and "eleven" not in bits[-1]:
        del bits[-1]
    return " ".join(bits)

# Returns a place id for closest match based on input cols.
# Returns none if there isn't a match.

def get_place_id(site_name, address, city, lon = "", lat = ""):
    search_string = " ".join([trim_suffix(site_name), address, city])
    coords = "".join(["point:", str(lon), ",", str(lat)])

    candidates = gmaps.find_place(
        input = search_string,
        input_type = "textquery",
        location_bias=coords,
    ).get('candidates')

    if not candidates:
        return None
    else:
        return candidates[0].get('place_id')

Now we’ve got a function to help look up a place’s ID given the information in our USDA dataset. We can use the result to look up each place’s details!

Verify a place’s latitude and longitude

We’ll start with looking up the coordinates of the top match. This can give us a basis for verifying whether the place has a high probability of being the correct one in the USDA dataset. We want to return both the latitude and longitude values separately to store them in separate columns later on.

def get_coords(place_id):
    place_details = gmaps.place(
            place_id = place_id,
            fields = ["geometry"]
        ).get("result")
    if not place_details:
        return None
    coords = place_details.get('geometry').get('location')
    lat = coords.get("lat")
    lng = coords.get("lng")
    return lat, lng
def lookup_coords(row):
    place_id = get_place_id(
        row["site_name"],
        row["address"],
        row["city"],
        row["long"],
        row["lat"]
    )
    if not place_id:
        return "No place found"
    lat, lng = get_coords(place_id)
    return lat, lng

Lookup a place’s zipcode

We also want to verify the location’s zipcode. Let’s define a few more functions as well:

# Get zip code for the place
def get_zipcode(place_id):
    place_details = gmaps.place(
        place_id = place_id,
        fields = ["address_component"]
    ).get('result')
    if not place_details:
        return None
    address_components = place_details.get('address_components')
    zipcode_details = next(item for item in address_components if 'postal_code' in item['types'])
    zipcode = zipcode_details.get('long_name')
    return zipcode
def lookup_zipcode(row):
    place_id = get_place_id(
        row["site_name"],
        row["address"],
        row["city"],
        row["long"],
        row["lat"]
    )
    if not place_id:
        return "No place found"
    zipcode = get_zipcode(place_id)
    if not zipcode:
        return "No zipcode listed"
    return zipcode

Lookup a place’s hours

Now, let’s retrieve attributes that weren’t in the original USDA dataset! One of the attributes we care about is a location’s hours. In exploring a sample entry, we notice that the request gives us hours in a list for each day. Given that our final goal is to display the text in the map, I think it’s fine to flatten it into a string. If we need to parse it, it’ll be in a uniform format that should make it straightforward to structure again.

Let’s define a few functions to get those:

# Retrieve hours given a valid place id.

def get_hours(place_id):
    place_details = gmaps.place(
        place_id = place_id,
        fields = ['opening_hours']
    ).get('result')
    if not place_details:
        return None
    hours_list = place_details.get('opening_hours').get('weekday_text')
    return '\n'.join(hours_list)
# Given input, tries to retrieve opening hours listed on Google Maps.
def lookup_hours(row):
    place_id = get_place_id(
        row["site_name"],
        row["address"],
        row["city"],
        row["long"],
        row["lat"]
    )
    if not place_id:
        return "No place found"
    hours = get_hours(place_id)
    if not hours:
        return "No hours listed"
    return hours

Lookup a place’s phone number, website, and business status

There are also some other characteristics like phone number, website, and business status that we care about. These have simpler formatting (i.e. less nesting) so we can generalize their retrievals into one core function:

def get_attribute(place_id, attribute):
    place_details = gmaps.place(
        place_id = place_id,
        fields = [attribute]
    ).get('result')
    if not place_details:
        return None
    return place_details.get(attribute)
def lookup_attribute(row, attribute):
    place_id = get_place_id(
        row["site_name"], 
        row["address"], 
        row["city"], 
        row["long"], 
        row["lat"]
    )
    if not place_id:
        return "No place found"
    result = get_attribute(place_id, attribute)
    if not result:
        return "No " + attribute + " listed"
    return result
def lookup_website(row):
    return lookup_attribute(row, "website")
def lookup_phone(row):
    return lookup_attribute(row, "formatted_phone_number")
def lookup_status(row):
    return lookup_attribute(row, "business_status")

Apply to our dataset

Let’s now apply each of our functions onto all rows to get their attributes!

df = pd.read_csv(INPUT_PATH)

df.head(5)
## | site_name                 | address              | city          | long       | lat       | county      |
## |---------------------------|----------------------|---------------|------------|-----------|-------------|
## | Dollar Tree 5522          | 2222 Business Cir    | San Jose      | -121.93281 | 37.323013 | Santa Clara |
## | Joya Supermarket, Llc.    | 2512 California St   | Mountain View | -122.10679 | 37.403904 | Santa Clara |
## | Safeway 2887              | 150 E El Camino Real | Sunnyvale     | -122.03108 | 37.366657 | Santa Clara |
## | Morgans Hill Market       | 16935 Monterey St    | Morgan Hill   | -121.64956 | 37.124382 | Santa Clara |
## | Bakery/Panaderia La Mejor | 1239 E Julian St     | San Jose      | -121.87022 | 37.351727 | Santa Clara |

Let’s start with saving the coordinates as separate lat_gmaps and lng_gmaps columns:

df["lat_gmaps"], df["lng_gmaps"] = zip(*df.apply(func = lookup_coords, axis = 1))

df.head(5)
## | site_name                 | address              | ... | lat_gmaps | lng_gmaps |
## |---------------------------|----------------------|-----|-----------|-----------|
## | Dollar Tree 5522          | 2222 Business Cir    | ... | 37.3229   | -121.933  |
## | Joya Supermarket, Llc.    | 2512 California St   | ... | 37.404    | -122.106  |
## | Safeway 2887              | 150 E El Camino Real | ... | 37.3647   | -122.031  |
## | Morgans Hill Market       | 16935 Monterey St    | ... | 37.1242   | -121.65   |
## | Bakery/Panaderia La Mejor | 1239 E Julian St     | ... | 37.3519   | -121.871  |

We’ll also get the zipcodes for each of these places:

df['zipcode'] = df.apply(func = lookup_zipcode, axis = 1)

df.head(5)
## | site_name                 | address              | ... | lat_gmaps | lng_gmaps | zipcode |
## |---------------------------|----------------------|-----|-----------|-----------|---------|
## | Dollar Tree 5522          | 2222 Business Cir    | ... | 37.3229   | -121.933  | 95128   |
## | Joya Supermarket, Llc.    | 2512 California St   | ... | 37.404    | -122.106  | 94040   |
## | Safeway 2887              | 150 E El Camino Real | ... | 37.3647   | -122.031  | 94087   |
## | Morgans Hill Market       | 16935 Monterey St    | ... | 37.1242   | -121.65   | 95037   |
## | Bakery/Panaderia La Mejor | 1239 E Julian St     | ... | 37.3519   | -121.871  | 95116   |

Now let’s retrieve their hours:

df['hours'] = df.apply(func = lookup_hours, axis = 1)

df.head(5)
## | site_name                 | address              | ... | zipcode | hours                                             |
## |---------------------------|----------------------|-----|---------|---------------------------------------------------|
## | Dollar Tree 5522          | 2222 Business Cir    | ... | 95128   | Monday: 8:00 AM – 8:00 PM\nTuesday: 8:00 AM – ... |
## | Joya Supermarket, Llc.    | 2512 California St   | ... | 94040   | Monday: 7:00 AM – 9:30 PM\nTuesday: 7:00 AM – ... |
## | Safeway 2887              | 150 E El Camino Real | ... | 94087   | Monday: 5:00 AM – 11:00 PM\nTuesday: 5:00 AM –... |
## | Morgans Hill Market       | 16935 Monterey St    | ... | 95037   | No hours listed                                   |
## | Bakery/Panaderia La Mejor | 1239 E Julian St     | ... | 95116   | Monday: 5:00 AM – 10:00 PM\nTuesday: 5:00 AM –... |

Let’s get the phone numbers:

df['phone'] = df.apply(func = lookup_phone, axis = 1)

df.head(5)
## | site_name                 | ... | hours                                             | phone                            |
## |---------------------------|-----|---------------------------------------------------|----------------------------------|
## | Dollar Tree 5522          | ... | Monday: 8:00 AM – 8:00 PM\nTuesday: 8:00 AM – ... | (669) 212-4078                   |
## | Joya Supermarket, Llc.    | ... | Monday: 7:00 AM – 9:30 PM\nTuesday: 7:00 AM – ... | (650) 935-2235                   |
## | Safeway 2887              | ... | Monday: 5:00 AM – 11:00 PM\nTuesday: 5:00 AM –... | (408) 732-1259                   |
## | Morgans Hill Market       | ... | No hours listed                                   | No formatted_phone_number listed |
## | Bakery/Panaderia La Mejor | ... | Monday: 5:00 AM – 10:00 PM\nTuesday: 5:00 AM –... | (408) 287-2447                   |

And the listed website:

df['website'] = df.apply(func = lookup_website, axis = 1)

df.head(5)
## | site_name                 | ... | phone                            | website                                           |
## |---------------------------|-----|----------------------------------|---------------------------------------------------|
## | Dollar Tree 5522          | ... | (669) 212-4078                   | https://www.dollartree.com/locations/ca/san-jo... |
## | Joya Supermarket, Llc.    | ... | (650) 935-2235                   | http://www.joyasupermarketca.com/                 |
## | Safeway 2887              | ... | (408) 732-1259                   | https://local.safeway.com/safeway/ca/sunnyvale... |
## | Morgans Hill Market       | ... | No formatted_phone_number listed | No website listed                                 |
## | Bakery/Panaderia La Mejor | ... | (408) 287-2447                   | No website listed                                 |

And finally the business’s status – whether it’s operational, temporarily closed, or permanently closed:

df['status'] = df.apply(func = lookup_status, axis = 1)

df.head(5)
## | site_name                 | ... | website                                           | status      |
## |---------------------------|-----|---------------------------------------------------|-------------|
## | Dollar Tree 5522          | ... | https://www.dollartree.com/locations/ca/san-jo... | OPERATIONAL |
## | Joya Supermarket, Llc.    | ... | http://www.joyasupermarketca.com/                 | OPERATIONAL |
## | Safeway 2887              | ... | https://local.safeway.com/safeway/ca/sunnyvale... | OPERATIONAL |
## | Morgans Hill Market       | ... | No website listed                                 | OPERATIONAL |
## | Bakery/Panaderia La Mejor | ... | No website listed                                 | OPERATIONAL |

Now let’s write this out to a csv to save the results!

df.to_csv(OUTPUT_PATH)

Next steps

Reduce redundant passes over the data

In this case, you notice that each time we wanted to add an attribute (i.e. column) we had to look up each place again. This means that we’ve made far more calls to the web service than we needed to, thereby costing more money. Luckily, our dataset is fairly small so it’s not terribly important.

We can instead restructure our script so that for each location, we pull all its desired attributes at once. This way, we’ll only do one pass over the dataset instead of n times for however many n attributes we want. So why haven’t we done it here?

In this case, I’ve made a tradeoff to make it easier to tinker around with what attributes we want. This piecemeal way makes it easy to test each attribute and to debug while we’re still experimenting. But, once we’re ready to productionize and want to put this in a pipeline to automatically run and check for updated information, that’ll be the next thing to do!

Read and write directly from Google Sheets

Another next step will be to directly read and write to Google Sheets, which is the interface that our volunteers use to make changes. This reduces our need to work with our data locally as csv files. Still, we’re at the stage of figuring out how best to manage data versions and establish a reconciliation process so we don’t inadvertently overwrite the correct data. Once we’ve got a workflow planned, that’ll be a great way to extend this work!


I’ve found working with Google Maps APIs to be a ton of fun and full of rich data to play with. I hope this helps you also get started with making the most of this data for your own projects!