I am trying to create a Python script which will take an address as input and will spit out its latitude and longitude, or latitudes and longitudes in case of multiple matches, quite like Nominatim.
So, the possible input and outputs could be:-
In 6 above, New York was returned since no place was found with address 103 Alkazam, New York, USA
, but it could at least find New York, USA
.
Initially I thought of building a tree representing the hierarchy relation where siblings are sorted alphabetically. It could have been like:-
GLOBAL
|
---------------------------------------------
| | ...
USA
---------------
| | ...
CALIFORNIA NEW YORK
| |
----------- -------------
| |.. | |....
PEARL STREET PEARL STREET
But the problem was user can provide incomplete address as in 2, 4 and 5.
So, I next thought of using a search tree and store the fully qualified address in each node. But this too is quite bad since:-
I have one additional requirement. I need to detect misspellings. I guess that will have to be dealt as a separate problem and can treat each node as generic strings.
Update 1
A little elaboration. The input would be a list, where the item on lower index is parent of the item in higher index; and they of course may or may not be immediate parent or child. So for query 1, input would be ["USA", "NEW YORK"]
. So, it is perfectly fine that USA, New York
returns no result.
The user should be able to locate a building if he has the address and our data is so detailed.
Update 2 (Omission Case)
If user queries Pearl Street, USA
, so our algo should be able to locate the address since it knows Pearl Street
has New York
as parent and USA
is its parent.
Update 3 (Surplus Case)
Suppose the user queries for 101 C, Alley A, Pearl Street, New York
. Also suppose our data does know of 101 C
but not about Alley A
. According to it 101 C
is a immediate child of Pearl Street
. Even in this case it should be able to locate the address.
Geocodes are Latitude & longitude Coordinates of a street address, which can be used to place markers on a map.
You can geocode by entering one location description at a time or by providing many of them at once in a table. The resulting locations are output as geographic features with attributes, which can be used for mapping or spatial analysis. You can quickly find various kinds of locations through geocoding.
Thanks to all for their answers, their answers were helpful, but did not address everything I needed. I finally found an approach which took care of all my cases. The approach is the modified version of what I suggested in the question.
Here I will refer to something called 'node', it is a class object which will contain the geo information like a place entity's latitude, longitude, maybe dimension too, and its full address.
If the address of the entity is '101 C, Pearl Street, New York, USA', then this means our data structure will have at least four nodes for - '101 C', 'Pearl Street', 'New York' and 'USA'. Each node will have a name
and one address
part. For '101 C', name
will be '101 C' and address will be 'Pearl Street, New York, USA'.
The basic idea is to have a search tree of these nodes, where node name
will be used as the key for the search. We may get multiple matches, so later we need to rank the results on how good the node's address
match with the queried one.
EARTH
|
---------------------------------------------
| |
USA INDIA
| |
--------------------------- WEST BENGAL
| | |
NEW YORK CALIFORNIA KOLKATA
| | |
--------------- PEARL STREET BARA BAZAR
| | |
PEARL STREET TIME SQUARE 101 C
| |
101 C 101 C
Suppose we have a geographic data as above. So, a search for '101 C, NEW YORK' will not only return the '101 C' nodes in 'NEW YORK' but also the one in 'INDIA'. This is because the algo will only use the name
, i.e. '101 C' here, for searching the nodes. Later we can grade the the quality of the result by measuring the difference of the node's address
from the queried address. We are not using an exact match since the user is allowed to provide incomplete address, as in this case.
To grade the quality of the result we can use Longest Common Subsequence. The 'Omission' and 'Surplus' cases are well taken care of in this algo.
It is best if I let the code do the talking. Below is a Python implementation tailored for this purpose.
def _lcs_diff_cent(s1, s2):
"""
Calculates Longest Common Subsequence Count Difference in percentage between two strings or lists.
LCS reference: http://en.wikipedia.org/wiki/Longest_common_subsequence_problem.
Returns an integer from 0-100. 0 means that `s1` and `s2` have 0% difference, i.e. they are same.
"""
m = len(s1)
n = len(s2)
if s1 == s2:
return 0
if m == 0: # When user given query is empty then that is like '*'' (match all)
return 0
if n == 0:
return 100
matrix = [[0] * (n + 1)] * (m + 1)
for i in range(1, m+1):
for j in range(1, n+1):
if s1[i-1] == s2[j-1]:
matrix[i][j] = matrix[i-1][j-1] + 1
else:
matrix[i][j] = max(matrix[i][j-1], matrix[i-1][j])
return int( ( 1 - float(matrix[m][n]) / m ) * 100 )
I bailed out on the above (basic) approach since it forced redundancy, and it could not cut leverage the fact that if user has provided 'USA' in his query then we need not look into nodes in 'INDIA'.
This optimized approach addresses both the above problems to quite an extent. The solution is not to have one big search tree. We can partition the search space into say 'USA' and 'INDIA'. Later we can further repartition those search spaces state-wise. This is what I call 'slicing'.
In the below diagram - SearchSlice
represents a 'slice', and SearchPool
represents a search tree.
SearchSlice()
|
---------------------------------------------
| |
SearchSlice(USA) SearchSlice(INDIA)
| |
--------------------------- SearchPool(WEST BENGAL)
| | |
SearchPool(NEW YORK) SearchPool(CALIFORNIA) |- KOLKATA
| | |- BARA BAZAR, KOLKATA
|- PEARL STREET |- PEARL STREET |- 101 C, BARA BAZAR, KOLKATA
|- TIME SQUARE
|- 101 C, PEARL STREET
|- 101 C, TIME SQUARE
Few key points to notice above...
SearchSlice(USA)
maintains a slice of states in 'USA'. So, nodes under 'NEW YORK' does not include the name 'NEW YORK' or 'USA' in their address
. Same goes for other regions too. The hierarchy relation implicitly defines the full address.address
includes its parent's name
too since they are not sliced.Where there is a bucket (pool), there is an implicit scaling possibility. We (say) divide geo data for 'USA' into two groups. Both can be on different systems. So, it is perfectly fine if 'NEW YORk' pool is on System A but 'CALIFORNIA' pool is on System B, since they do not share any data, except for the parents of course.
Here is the caveat. We need to duplicate the parents which will always be a slice. Since slices are meant to be limited in number so the hierarchy won't be too deep, so it should not be too redundant to duplicate them.
Please refer to my GitHub for a working demo Python code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With