I am still in very early stage of my learning of Python. Apologize in advance if this question sounds stupid.
I have this set of data (in table format) that I want to add few calculated columns to. Basically I have some location lon/lat and destination lon/lat, and the respective data time, and I'm calculating the average velocity between each pair.
Sample data look like this:
print(data_all.head(3))
id lon_evnt lat_evnt event_time \
0 1 -179.942833 41.012467 2017-12-13 21:17:54
1 2 -177.552817 41.416400 2017-12-14 03:16:00
2 3 -175.096567 41.403650 2017-12-14 09:14:06
dest_data_generate_time lat_dest lon_dest \
0 2017-12-13 22:33:37.980 37.798599 -121.292193
1 2017-12-14 04:33:44.393 37.798599 -121.292193
2 2017-12-14 10:33:51.629 37.798599 -121.292193
address_fields_dest \
0 {'address': 'Nestle Way', 'city': 'Lathrop...
1 {'address': 'Nestle Way', 'city': 'Lathrop...
2 {'address': 'Nestle Way', 'city': 'Lathrop...
I then zipped the lon/lat together:
data_all['ping_location'] = list(zip(data_all.lon_evnt, data_all.lat_evnt))
data_all['destination'] = list(zip(data_all.lon_dest, data_all.lat_dest))
then I want to calculate the distance between each pair of location pings, and grab some address info from a string (basically taking a substring), and then calculate for the velocity:
for idx, row in data_all.iterrows():
dist = gcd.dist(row['destination'], row['ping_location'])
data_all.loc[idx, 'gc_distance'] = dist
temp_idx = str(row['address_fields_dest']).find(":")
pos_start = temp_idx + 3
pos_end = str(row['address_fields_dest']).find(",") - 2
data_all.loc[idx, 'destination address'] = str(row['address_fields_dest'])[pos_start:pos_end]
##### calculate velocity which is: v = d/t
## time is the difference btwn destination time and the ping creation time
timediff = abs(row['dest_data_generate_time'] - row['event_time'])
data_all.loc[idx, 'velocity km/hr'] = 0
## check if the time dif btwn destination and event ping is more than a minute long
if timediff > datetime.timedelta(minutes=1):
data_all.loc[idx, 'velocity km/hr'] = dist / timediff.total_seconds() * 3600.0
ok now, this program took me almost 7 hours to execute on 333k rows of data! :( I have windows 10 2 core 16gb ram... which is not much, but 7 hours is definitely not ok :(
How can I make the program run more efficiently? One way I'm thinking is, since the data and its calculations are independent of each other, I can take advantage of parallel processing.
I've read into many posts, but it seems like most of the parallel processing methods presented are for if I'm only using one simple function; but here I'm adding multiple new columns.
Any help is really appreciated! or telling me that this is impossible to make pandas do parallel processing (which I believe I've read somewhere saying that but am not completely sure if it's 100% true still).
Sample posts read into:
Large Pandas Dataframe parallel processing
python pandas dataframe to dictionary
How do I parallelize a simple Python loop?
How to do parallel programming in Python
and a lot more that are not on stackoverflow....
https://medium.com/@ageitgey/quick-tip-speed-up-your-python-data-processing-scripts-with-process-pools-cf275350163a
https://homes.cs.washington.edu/~jmschr/lectures/Parallel_Processing_in_Python.html
Here is a quick solution - I didn't try to optimize your code at all, just fed it into a multiprocessing pool. This will run your function on each row individually, return a row with the new properties, and create a new dataframe from this output.
import multiprocessing as mp
pool = mp.Pool(processes=mp.cpu_count())
def func( arg ):
idx,row = arg
dist = gcd.dist(row['destination'], row['ping_location'])
row['gc_distance'] = dist
temp_idx = str(row['address_fields_dest']).find(":")
pos_start = temp_idx + 3
pos_end = str(row['address_fields_dest']).find(",") - 2
row['destination address'] = str(row['address_fields_dest'])[pos_start:pos_end]
##### calculate velocity which is: v = d/t
## time is the difference btwn destination time and the ping creation time
timediff = abs(row['dest_data_generate_time'] - row['event_time'])
row['velocity km/hr'] = 0
## check if the time dif btwn destination and event ping is more than a minute long
if timediff > datetime.timedelta(minutes=1):
row['velocity km/hr'] = dist / timediff.total_seconds() * 3600.0
return row
new_rows = pool.map( func, [(idx,row) for idx,row in data_all.iterrows()])
data_all_new = pd.concat( new_rows )
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With