How can I improve the performance of my script?

Question

I have a "seed" GeoDataFrame (GDF)(RED) which contains a 0.5 arc minutes global grid ((180*2)*(360*2) = 259200). Each cell contains an absolute population estimate. In addition, I have a "leech" GDF (GREEN) with roughly 8250 adjoining non-regular shapes of various sizes (watersheds).

I wrote a script to allocate the population estimates to the geometries in the leech GDF based on the overlapping area between grid cells (seed GDF) and the geometries in the leech GDF. The script works perfectly fine for my sample data (see below). However, once I run it on my actual data, it is very slow. I ran it overnight and the next morning only 27% of the calculations had been performed. I will have to run this script many times and waiting for two days each time, is simply not an option.

After doing a bit of literature research, I already replaced (?) for loops with for index i in df.iterrows() (or is this the same as "conventional" python for loops) but it didn't bring about the performance imporvement I had hoped for.

Any suggestion son how I can speed up my code? In twelve hours, my script only processed only ~30000 rows out of ~200000.

My expected output is the column leech_df['leeched_values'].

enter image description here

import geopandas as gpd
import time
from datetime import datetime
from shapely.geometry import Polygon

# =============================================================================
# Geometries for testing
# =============================================================================

polys1 = gpd.GeoSeries([Polygon([(0.00,0.00), (0.00,0.25), (0.25,0.25), (0.25,0.00)]),
                        Polygon([(0.00,0.25), (0.00,0.50), (0.25,0.50), (0.25,0.25)]),
                        Polygon([(0.00,0.50), (0.00,0.75), (0.25,0.75), (0.25,0.50)]),
                        Polygon([(0.25,0.00), (0.25,0.25), (0.50,0.25), (0.50,0.00)]),
                        Polygon([(0.25,0.25), (0.25,0.50), (0.50,0.50), (0.50,0.25)]),
                        Polygon([(0.25,0.50), (0.25,0.75), (0.50,0.75), (0.50,0.50)]),
                        Polygon([(0.50,0.00), (0.50,0.25), (0.75,0.25), (0.75,0.00)]),
                        Polygon([(0.50,0.25), (0.50,0.50), (0.75,0.50), (0.75,0.25)]),
                        Polygon([(0.50,0.50), (0.50,0.75), (0.75,0.75), (0.75,0.50)]),
                        ])

polys2 = gpd.GeoSeries([Polygon([(0.125,0.125), (0.125,0.375), (0.375,0.375), (0.375,0.125)]),
                        Polygon([(0.050,0.550), (0.050,0.700), (0.200,0.700), (0.200,0.550)]),
                        Polygon([(0.25,0.625), (0.25,0.375), (0.750,0.375), (0.750,0.625)]),
                        ])

seed_df = gpd.GeoDataFrame({'geometry': polys1, 'seed_value':[10,10,10,10,10,10,10,10,10]})
leech_df = gpd.GeoDataFrame({'geometry': polys2})

del polys1, polys2

seed_value = 'seed_value'

# =============================================================================
# Prepare DataFrames
# =============================================================================

start = time.time()

print('

Prepare DataFrames ... ')

# Create a new index for the seed DF
# The old index will be copied into the seed DF as a new column
# and transferred into the merged DF
seed_df = seed_df.reset_index()
seed_df = seed_df.rename(columns={'index': 'seed_index'})

# Create a new index for the seed DF
# The old index will be copied into the seed DF as a new column
# and transferred into the merged DF
leech_df = leech_df.reset_index()
leech_df = leech_df.rename(columns={'index': 'leech_index'})

end = time.time()
print(end - start)

# Add the area to the geometries

start = time.time()

print('Calculating the area of the leech DF geometries ...')
leech_df['leech_area'] = leech_df['geometry'].area

print('Calculating the area of the seed DF geometries ...')
seed_df['seed_area'] = seed_df['geometry'].area

leech_df['leeched_value'] = 0.0

end = time.time()
print(end - start)

# =============================================================================
# Merge seed and leech data and count overlaps
# =============================================================================

start = time.time()

print('Merging DataFrames ... ')

merged_df = gpd.overlay(leech_df, seed_df, how='union')
# Drop NaNs
merged_df = merged_df.dropna(axis='rows')

# =============================================================================
# Allocate seed values
# =============================================================================

# Count with how many leech geometries each seed geometry overlaps

print('Count overlaps ... ')

overlaps = merged_df['seed_index'].value_counts()

neglected_values = list()
one_overlaps_values = list()
more_overlaps_values = list()

no_overlaps = list()
one_overlaps = list()
more_overlaps = list()

end = time.time()
print(end - start)

start = time.time()

print('Allocate seed values ... ')

i = 0

for index, row in seed_df.iterrows(): # For each row in seed_df MAX 70123

    if row['seed_index'] % 1000 == 0:
        print(str(row['seed_index'])+'k  at  '+str(datetime.now().strftime('%Y-%m-%d %H:%M:%S')))

    # If seed geometry does not have an overlap
    # Get the value with the seed_index == 0
    # Otherwise return 0
    # So whenever, the seedindex does not turn up in overlaps return zero

    if overlaps.get(row['seed_index'],0) == 0: # If seed geometry does not have an overlap
        #print('Grid cell ' + str(i) + ' - ' + str(row['seed_index']) + ': No overlap')
        no_overlaps.append(1) # Count the values which have not been considered
        neglected_values.append(row[seed_value]) # Count the values which have not been considered
        i = i + 1
    elif overlaps.get(row['seed_index'],0) == 1:
        #print('Grid cell ' + str(i) + ' - ' + str(row['seed_index']) + ': One overlap')
        one_overlaps.append(1)
        # What is for row the leech index (with which leech geometry does an overlap exist)?
        temp_int = int(merged_df[merged_df['seed_index'] == row['seed_index']]['leech_index'])
        # For this leech index replace leeched_value with leeched_value + leeched_value
        seed_value_amount = int(seed_df[seed_df['seed_index'] == row['seed_index']][seed_value])
        leech_df.loc[temp_int,'leeched_value'] = leech_df.loc[temp_int,'leeched_value'] + seed_value_amount
        one_overlaps_values.append(row[seed_value]) # Count the values which have not been considered
        i = i + 1
    elif overlaps.get(row['seed_index'],0) > 1:
        #print('Grid cell ' + str(i) + ' - ' + str(row['seed_index']) + ': More than one overlap')
        more_overlaps.append(1)
        # Create a DF which contains the overlaps of the seed geometries with geoemtries of the leech df
        temp_df = merged_df[merged_df['seed_index'] == row['seed_index']]
        # Calculate the overlaying area
        temp_df['overlay_area'] = temp_df['geometry'].area
        # And comapre this to the total overlaying area of this grid cell
        temp_df['rel_contribution'] = 0.0
        temp_df['rel_contribution'] = temp_df['overlay_area']/sum(temp_df.area)
        # Calcualte the absolute distribution of the seed value to each leech row ('leech index')
        temp_df['abs_contribution'] = temp_df[seed_value]*temp_df['rel_contribution']
        # Reset index
        temp_df = temp_df.reset_index(drop=True)
        more_overlaps_values.append(row[seed_value]) # Count the values which have not been considered
        i = i + 1
        #For each overlap between grid cell (seed) and leech geometry:
        for j in range(len(leech_df)):
            #print('==   LEECH ' + str(j) + '.' +str(i))
            # Create a DF which only contains the temp_df row for the specific leech geometry
            contribution = temp_df[temp_df['leech_index'] == j]
            # Save the contribution to the leech_df
            #leech_df['leeched_value'][j] = leech_df.loc[:,('leeched_value')][j] + sum(contribution['abs_contribution'])
            leech_df.loc[j,'leeched_value'] = leech_df.loc[:,('leeched_value')][j] + sum(contribution['abs_contribution'])

end = time.time()
print(end - start)
print('Done')

# =============================================================================
# Data check
# =============================================================================

print('>1 Overlaps: ' + str(sum(more_overlaps)) + ' (sum neglected values ' + str(sum(more_overlaps_values)) + ')' )
print('=1 Overlaps: ' + str(sum(one_overlaps)) + ' (sum neglected values ' + str(sum(one_overlaps_values)) + ')' )
print('No Overlaps: ' + str(sum(no_overlaps)) + ' (sum neglected values ' + str(sum(neglected_values)) + ')' )

print('Unconsidered: ' + str(sum(seed_df[seed_value]) - (sum(more_overlaps_values)+sum(one_overlaps_values)+sum(neglected_values))))


# =============================================================================
# Plot
# =============================================================================

# Plot
base = seed_df.plot(color='red', edgecolor='black');
leech_df.plot(ax=base, color='green', alpha=0.5);

EDIT: pip freeze > requirements_output.txt returns:

affine==2.3.0
alabaster==0.7.12
appdirs==1.4.3
argh==0.26.2
asn1crypto==1.3.0
astroid==2.3.3
atomicwrites==1.3.0
attrs==19.3.0
autopep8==1.4.4
Babel==2.8.0
backcall==0.1.0
bcrypt==3.1.7
bleach==3.1.0
cachetools==4.0.0
Cartopy==0.17.0
certifi==2019.11.28
cffi==1.13.2
cftime==1.0.4.2
chardet==3.0.4
Click==7.0
click-plugins==1.1.1
cligj==0.5.0
cloudpickle==1.2.2
colorama==0.4.3
country-converter==0.6.6
cryptography==2.8
cycler==0.10.0
dask==2.10.0
datacube==1.7
decorator==4.4.1
defusedxml==0.6.0
Deprecated==1.2.7
descartes==1.1.0
diff-match-patch==20181111
docutils==0.16
entrypoints==0.3
Fiona==1.8.13
flake8==3.7.9
future==0.18.2
GDAL==3.0.2
geocube==0.0.10
geopandas==0.6.2
h5py==2.9.0
idna==2.8
imageio==2.6.1
imagesize==1.2.0
importlib-metadata==1.4.0
intervaltree==3.0.2
ipykernel==5.1.4
ipython==7.11.1
ipython-genutils==0.2.0
isort==4.3.21
jedi==0.14.1
Jinja2==2.10.3
joblib==0.14.1
jsonschema==3.2.0
jupyter-client==5.3.4
jupyter-core==4.6.1
keyring==21.1.0
kiwisolver==1.1.0
lark-parser==0.8.1
lazy-object-proxy==1.4.3
lxml==4.4.2
mapclassify==2.2.0
MarkupSafe==1.1.1
matplotlib==3.1.1
mccabe==0.6.1
mistune==0.8.4
mkl-fft==1.0.15
mkl-random==1.1.0
mkl-service==2.3.0
more-itertools==8.0.2
munch==2.5.0
nbconvert==5.6.1
nbformat==5.0.4
netCDF4==1.5.3
notebook==6.0.0
numpy==1.16.4
numpydoc==0.9.2
olefile==0.46
OWSLib==0.19.1
packaging==20.1
pandas==0.25.0
pandocfilters==1.4.2
paramiko==2.6.0
parso==0.6.0
pathtools==0.1.2
pexpect==4.8.0
pickleshare==0.7.5
Pillow==7.0.0
pluggy==0.13.1
prometheus-client==0.7.1
prompt-toolkit==3.0.3
psutil==5.6.7
psycopg2==2.8.4
pycodestyle==2.5.0
pycparser==2.19
pydocstyle==4.0.1
pyepsg==0.4.0
pyflakes==2.1.1
Pygments==2.5.2
pykdtree==1.3.1
pylint==2.4.4
pymatreader==0.0.20
Pympler==0.7
pymrio==0.4.0
PyNaCl==1.3.0
pyOpenSSL==19.1.0
pyparsing==2.4.6
pyPEG2==2.15.2
pyproj==2.4.2.post1
PyQt5-sip==4.19.19
pyrsistent==0.15.7
pyshp==2.1.0
PySocks==1.7.1
python-dateutil==2.8.1
python-jsonrpc-server==0.3.4
python-language-server==0.31.7
pytz==2019.3
pywin32==227
pywin32-ctypes==0.2.0
pywinpty==0.5.7
PyYAML==5.2
pyzmq==18.1.0
QDarkStyle==2.8
QtAwesome==0.6.1
qtconsole==4.6.0
QtPy==1.9.0
rasterio==1.1.2
requests==2.22.0
rioxarray==0.0.19
rope==0.16.0
Rtree==0.9.3
scikit-learn==0.22.1
scipy==1.3.2
Send2Trash==1.5.0
Shapely==1.7.0
singledispatch==3.4.0.3
six==1.14.0
snowballstemmer==2.0.0
snuggs==1.4.7
sortedcontainers==2.1.0
Sphinx==2.3.1
sphinxcontrib-applehelp==1.0.1
sphinxcontrib-devhelp==1.0.1
sphinxcontrib-htmlhelp==1.0.2
sphinxcontrib-jsmath==1.0.1
sphinxcontrib-qthelp==1.0.2
sphinxcontrib-serializinghtml==1.1.3
spyder==4.0.0
spyder-kernels==1.8.1
spyder-notebook==0.1.4
SQLAlchemy==1.3.13
terminado==0.8.3
testpath==0.4.4
toolz==0.10.0
tornado==6.0.3
tqdm==4.43.0
traitlets==4.3.3
ujson==1.35
urllib3==1.25.8
watchdog==0.9.0
wcwidth==0.1.7
webencodings==0.5.1
win-inet-pton==1.1.0
wincertstore==0.2
wrapt==1.11.2
xarray==0.14.1
xlrd==1.2.0
xmltodict==0.12.0
yapf==0.28.0
zipp==0.6.0

jlandercy · Accepted Answer

Introduction

It might be worthy to profile your code in details to get precise insights of what is your bottleneck.

Bellow some advises to already improve your script performance:

Avoid list.append(1) to count occurrences, use collection.Counter instead;
Avoid pandas.DataFrame.iterrows, use pandas.DataFrame.itertuples instead;
Avoid extra assignation that are not needed, use pandas.DataFrame.fillna instead:

Eg. this line:

temp_df['rel_contribution'] = 0.0
temp_df['rel_contribution'] = temp_df['overlay_area']/sum(temp_df.area)

Probable bottleneck

Boolean indexing is a great capability, but having it nested in a loop is probably the biggest performance issue in your algorithm. You bottleneck probably comes from there.

Eg. this line performs boolean indexing within a loop:

temp_df = merged_df[merged_df['seed_index'] == row['seed_index']]

Make use of index as it generally will drastically lower the time complexity of algorithms (such as rtree as suggested in comment, choosing the best index requires experience);
When using Pandas, try to avoid loops: there may have an existing "optimized" shortcut to perform your aggregation without resorting to an explicit loop. It will ease maintenance and make your code more readable.

Towards a solution

Also notice that GeoPandas has a spatial join geopandas.sjoin method (which relies on rtree) that can join rows on spatial operations (default is intersection).

I have the feeling that you can solve your problem by making a left intersect join followed by a group-by. Then you could apply different logics on different bunches of rows and aggregates.

For the example sake, lets say we want to find count number of intersection and the area covered for all polygons in seed_df and distribute seed_value based on the ratio of intersecting areas. We could achieve it like this:

# Merge datafarmes on geometry intersection
# keep geometries in seed_df which does not intersects at all:
df = gpd.sjoin(seed_df, leech_df, how='left').reset_index()
df = df.rename(columns={'index': 'seed_id', 'geometry': 'seed_geom', 'index_right': 'leech_id'})

# Add leech_df geometry to merged dataframe:
df = df.merge(leech_df, left_on='leech_id', right_index=True, how='left')
df = df.rename(columns={'geometry': 'leech_geom'})

# Create a function computing intersection area (fault tolerant)
def func(x):
    if x['leech_geom']:
        return x['seed_geom'].intersection(x['leech_geom']).area

# Apply function:
df['intersection'] = df.apply(func, axis=1)

The results looks like (df.tails(4)):

    seed_id                                          seed_geom  seed_value  \
8         5  POLYGON ((0.25000 0.50000, 0.25000 0.75000, 0....          10   
9         6  POLYGON ((0.50000 0.00000, 0.50000 0.25000, 0....          10   
10        7  POLYGON ((0.50000 0.25000, 0.50000 0.50000, 0....          10   
11        8  POLYGON ((0.50000 0.50000, 0.50000 0.75000, 0....          10   

    leech_id                                         leech_geom  intersection  
8        2.0  POLYGON ((0.25000 0.62500, 0.25000 0.37500, 0....       0.03125  
9        NaN                                               None           NaN  
10       2.0  POLYGON ((0.25000 0.62500, 0.25000 0.37500, 0....       0.03125  
11       2.0  POLYGON ((0.25000 0.62500, 0.25000 0.37500, 0....       0.03125

At this point, we group by and aggregate:

# Group by and aggregate:
agg = df.groupby('seed_id')['int_area'].agg(['count', 'sum']).reset_index()
agg = agg.rename(columns={'count': 'int_count', 'sum': 'int_sum'})

It leads to:

   seed_id  int_count   int_sum
0        0          1  0.015625
1        1          2  0.015625
2        2          2  0.022500
3        3          1  0.015625
4        4          2  0.046875
5        5          1  0.031250
6        6          0  0.000000
7        7          1  0.031250
8        8          1  0.031250

Then, we merge aggregate against our original dataframe and perform final computations:

final = df.merge(agg)
final['leech_value'] = final['seed_value']*final['int_area']/final['int_sum']

The final result is:

    seed_id  seed_value  leech_id  int_area  int_count   int_sum  leech_value
0         0          10       0.0  0.015625          1  0.015625    10.000000
1         1          10       0.0  0.015625          2  0.015625    10.000000
2         1          10       2.0  0.000000          2  0.015625     0.000000
3         2          10       1.0  0.022500          2  0.022500    10.000000
4         2          10       2.0  0.000000          2  0.022500     0.000000
5         3          10       0.0  0.015625          1  0.015625    10.000000
6         4          10       0.0  0.015625          2  0.046875     3.333333
7         4          10       2.0  0.031250          2  0.046875     6.666667
8         5          10       2.0  0.031250          1  0.031250    10.000000
9         6          10       NaN       NaN          0  0.000000          NaN
10        7          10       2.0  0.031250          1  0.031250    10.000000
11        8          10       2.0  0.031250          1  0.031250    10.000000

Which allocates seed_value to polygons in leech_df based on the ratio of overlapping intersection.

If you wish to know how leech_value is distributed to leech polygons, then aggregate once more:

final.groupby('leech_id')['leech_value'].sum()

It returns:

leech_id
0.0    33.333333
1.0    10.000000
2.0    36.666667

Graphically:

enter image description here

Note

To make sjoin work you need to install rtree and first you need to install libspatialindex-dev library. On Debian it boils down to:

apt-get update && apt-get install libspatialindex-dev
python3 -m pip install --upgrade geopandas

How can I improve the performance of my script?

Tags:

performance

python

pandas

geopandas

Stücke

Video Answer

1 Answers

Introduction

Probable bottleneck

Towards a solution

Note

jlandercy

Recent Activity

Donate For Us

How can I improve the performance of my script?

Tags:

performance

python

pandas

geopandas

Stücke

Video Answer

1 Answers

Introduction

Probable bottleneck

Towards a solution

Note

jlandercy

Related questions

Recent Activity

Donate For Us