I am curious as to how the Google geocoder works.
I have been studying some implementations of open source geocoders like geocommons' geocoder or PostGIS's new Tiger Geocoder. This is roughly what I know so far (to hopefully prove that I have been doing my homework) :
I realize that at the core of the open source geocoders, there are three main elements.
1.- An address normalizer that takes an arbitrary string and normalizes it (taking the example from here):
normalize_address('address string');
e.g.: SELECT naddy.* FROM normalize_address('29645 7th Street SW Federal Way 98023') AS naddy;
address | predirabbrev | streetname | streettypeabbrev | postdirabbrev | internal | location | stateabbrev | zip | parsed
---------+-------------+-----------------------+------------------+---------------+----------+----------+-------------+-------+--------
29645 | | 7th Street SW Federal | Way | | | | | 98023 |
and:
2.- A geocoder that does some magical fuzzy matching for names where the core algorithm is the Levenshtein Distance.
A good example is the one from the Wikipedia article where it calculates the Levenshtein distance between the words kitten and sitting (the distance is 3 since that is the number of edits required to change one string into the other):
kitten → sitten (substitution of 's' for 'k')
sitten → sittin (substitution of 'i' for 'e')
sittin → sitting (insertion of 'g' at the end).
3.- Some interpolation of the street segments at the end to guess where the house is. I downloaded a chunk of the free Census Tiger street dataset to create this example.
In the example above, the street segment of interest (Schaeffer Hills Dr) has a from node that starts at 300 (so 300 Schaeffer Hills Dr) and a to node that ends on 400 (400 Schaeffer Hills Drv). If I matched to this Schaeffer Hills Drv, and the request was for street 310, then the algorithm would just interpolate to it (traverse 10% of it) to where my green arrow is.
This is what the Open Source geocoder tools do. Nevertheless, Google is clearly smarter than that and uses all kinds of non-traditional hints.
How so?
For example, I can type 680 Mission st (no city, state, county, anything at all). Most of the standard address normalizers would blow up because they would find too many matches. But since I am in SF, I am guessing google uses my ip to get some geoip-like information, does some expanding bounding as a hint with some fuzzy search, and right away finds the closest segment that matches and tells me that's my answer (which is correct!).
I am looking for answers that can shed some more light into how the Google geocoder works besides the techniques that I described above.
Update:
OK, so far we have two kinds of hints listed
Geocoding is the process of transforming a description of a location—such as a pair of coordinates, an address, or a name of a place—to a location on the earth's surface. You can geocode by entering one location description at a time or by providing many of them at once in a table.
The Geocoding API uses a pay-as-you-go pricing model. Geocoding API requests generate calls to one of two SKUs depending on the type of request: basic or advanced. Along with the overall Google Terms of Use, there are usage limits specific to the Geocoding API.
Geocoder class is used to handle geocoding and reverse geocoding. Geocoding is a process in which street address is converted into a coordinate (latitude,longitude). Reverse geocoding is a process in which a coordinate (latitude,longitude) is converted into an address.
Google Maps Geocoding Rate Limit According to Google's API Usage and Billing page, there is a limit of 50 requests per second. If you exceed this amount, subsequent requests to the API will fail with the OVER_DAILY_LIMIT or OVER_QUERY_LIMIT status code and the geocoded data will not be returned.
One of the things you can find by poking at the black box is that the Google geocoder isn't totally sensitive to the order of the tokens (there's no enforced street/city/state/country expectation, though it does better when you do follow that). Which says to me that they might be dumping everything into some kind of full text search and then seeing what comes back. Or perhaps not. Try searching "sault saint marie adams 200" and "sault saint marie 200 adams".
With respect to your Mission example, that's a great one, as you can see the map hint coming into play directly:
Query with map window over Europe: European results
Query with map window over North America: American results
There is another source of data: county property maps. These don't just include roads, but also property lines (and their street addresses). You can often see this on Google's map, it will actually show faint lines that separate adjacent properties. Sometimes they even outline buildings (county maps often include these too).
You can also do the reverse lookup, given your GPS coordinates finding your exact address can be as simple as a 2D query to find which property polygon you're in. I've seen this work properly when I was physically far from the road but still inside the property and it returned the correct street address despite the handset being closer to another street.
Note that these maps tend to be public and some counties even have their own online interface. You can even look up who owns a particular plot.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With