Objective: Send a list of addresses to an API and extract certain information(eg: a flag which indicates if an address is in a flood zone or not).
Solution: Working Python script for small data.
Problem: I want to optimize my current solution for large input. How to improve the performance of the API calls. If I have 100,000 addresses will my current solution fail? Will this slow down the HTTP calls? Will I get a request TIME out? Does the API resist the number of API calls being made?
Sample input
777 Brockton Avenue, Abington MA 2351
30 Memorial Drive, Avon MA 2322
My current solution works well for a small dataset.
# Creating a function to get lat & long of the existing adress and then detecting the zone in fema
def zonedetect(addrs):
global geolocate
geocode_result = geocode(address=addrs, as_featureset=True)
latitude = geocode_result.features[0].geometry.x
longitude = geocode_result.features[0].geometry.y
url = "https://hazards.fema.gov/gis/nfhl/rest/services/public/NFHL/MapServer/28/query?where=1%3D1&text=&objectIds=&time=&geometry="+str(latitude)+"%2C"+str(longitude)+"&geometryType=esriGeometryPoint&inSR=4326&spatialRel=esriSpatialRelIntersects&relationParam=&outFields=*&returnGeometry=true&returnTrueCurves=false&maxAllowableOffset=&geometryPrecision=&outSR=&returnIdsOnly=false&returnCountOnly=false&orderByFields=&groupByFieldsForStatistics=&outStatistics=&returnZ=false&returnM=false&gdbVersion=&returnDistinctValues=false&resultOffset=&resultRecordCount=&queryByDistance=&returnExtentsOnly=false&datumTransformation=¶meterValues=&rangeValues=&f=json"
response = req.get(url)
parsed_data = json.loads(response.text)
formatted_data = json_normalize(parsed_data["features"])
formatted_data["Address_1"] = addrs
#Exception handling
if response.status_code == 200:
geolocate = geolocate.append(formatted_data, ignore_index = True)
else:
print("Request to {} failed".format(postcode))
# Reading every adress from existing dataframe
for i in range(len(df.index)):
zonedetect(df["Address"][i])
Instead of using the for loop above is there an alternative. Can I process this logic in a batch?
Sending 100,000 requests to the hazards.fema.gov server will definitely cause some slow downs on their server but it will mostly impact your script as you will need to wait for every single HTTP request to be queued and responded to which could take an extremely long time to process.
What would be better is to send one REST query for everything you will need and then handle the logic afterwards. Looking at the REST API, you can find that the geometry URL parameter is able to accept a geometryMultiPoint from the docs. Here is an example of a multiPoint:
{
"points" : [[-97.06138,32.837],[-97.06133,32.836],[-97.06124,32.834],[-97.06127,32.832]],
"spatialReference" : {"wkid" : 4326}
}
So what you can do is make an object to store all the points you want to query:
multipoint = { points: [], spatialReference: { wkid: 4326}
And when you loop, append the lat/long point to the multipoint list:
for i in range(len(df.index)):
address = df["Address"][i]
geocode_result = geocode(address=addrs, as_featureset=True)
latitude = geocode_result.features[0].geometry.x
longitude = geocode_result.features[0].geometry.y
multiPoint.points.append([latitude, longitude])
Then you can set the multipoint as the geometry in your query which results in just one API request instead of one for each point.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With