Pandas groupby apply performing slow

Tags:

I am working on a program that involves large amounts of data. I am using the python pandas module to look for errors in my data. This usually works very fast. However this current piece of code I wrote seems to be way slower than it should be and I am looking for a way to speed it up.

In order for you guys to properly test it I uploaded a rather large piece of code. You should be able to run it as is. The comments in the code should explain what I am trying to do here. Any help would be greatly appreciated.

# -*- coding: utf-8 -*-  import pandas as pd import numpy as np  # Filling dataframe with data # Just ignore this part for now, real data comes from csv files, this is an example of how it looks TimeOfDay_options = ['Day','Evening','Night'] TypeOfCargo_options = ['Goods','Passengers'] np.random.seed(1234) n = 10000  df = pd.DataFrame() df['ID_number'] = np.random.randint(3, size=n) df['TimeOfDay'] = np.random.choice(TimeOfDay_options, size=n) df['TypeOfCargo'] = np.random.choice(TypeOfCargo_options, size=n) df['TrackStart'] = np.random.randint(400, size=n) * 900 df['SectionStart'] = np.nan df['SectionStop'] = np.nan  grouped_df = df.groupby(['ID_number','TimeOfDay','TypeOfCargo','TrackStart']) for index, group in grouped_df:     if len(group) == 1:         df.loc[group.index,['SectionStart']] = group['TrackStart']         df.loc[group.index,['SectionStop']] = group['TrackStart'] + 899      if len(group) > 1:         track_start = group.loc[group.index[0],'TrackStart']         track_end = track_start + 899         section_stops = np.random.randint(track_start, track_end, size=len(group))         section_stops[-1] = track_end         section_stops = np.sort(section_stops)         section_starts = np.insert(section_stops, 0, track_start)          for i,start,stop in zip(group.index,section_starts,section_stops):             df.loc[i,['SectionStart']] = start             df.loc[i,['SectionStop']] = stop  #%% This is what a random group looks like without errors #Note that each section neatly starts where the previous section ended #There are no gaps (The whole track is defined) grouped_df.get_group((2, 'Night', 'Passengers', 323100))  #%% Introducing errors to the data df.loc[2640,'SectionStart'] += 100 df.loc[5390,'SectionStart'] += 7  #%% This is what the same group looks like after introducing errors  #Note that the 'SectionStop' of row 1525 is no longer similar to the 'SectionStart' of row 2640 #This track now has a gap of 100, it is not completely defined from start to end grouped_df.get_group((2, 'Night', 'Passengers', 323100))  #%% Try to locate the errors #This is the part of the code I need to speed up  def Full_coverage(group):     if len(group) > 1:         #Sort the grouped data by column 'SectionStart' from low to high          #Updated for newer pandas version         #group.sort('SectionStart', ascending=True, inplace=True)         group.sort_values('SectionStart', ascending=True, inplace=True)          #Some initial values, overwritten at the end of each loop           #These variables correspond to the first row of the group         start_km = group.iloc[0,4]         end_km = group.iloc[0,5]         end_km_index = group.index[0]          #Loop through all the rows in the group         #index is the index of the row         #i is the 'SectionStart' of the row         #j is the 'SectionStop' of the row         #The loop starts from the 2nd row in the group         for index, (i, j) in group.iloc[1:,[4,5]].iterrows():              #The start of the next row must be equal to the end of the previous row in the group             if i != end_km:                   #Add the faulty data to the error list                 incomplete_coverage.append(('Expected startpoint: '+str(end_km)+' (row '+str(end_km_index)+')', \                                     'Found startpoint: '+str(i)+' (row '+str(index)+')'))                              #Overwrite these values for the next loop             start_km = i             end_km = j             end_km_index = index      return group  #Check if the complete track is completely defined (from start to end) for each combination of:     #'ID_number','TimeOfDay','TypeOfCargo','TrackStart' incomplete_coverage = [] #Create empty list for storing the error messages df_grouped = df.groupby(['ID_number','TimeOfDay','TypeOfCargo','TrackStart']).apply(lambda x: Full_coverage(x))  #Print the error list print('\nFound incomplete coverage in the following rows:') for i,j in incomplete_coverage:     print(i)     print(j)     print()   #%%Time the procedure -- It is very slow, taking about 6.6 seconds on my pc %timeit df.groupby(['ID_number','TimeOfDay','TypeOfCargo','TrackStart']).apply(lambda x: Full_coverage(x))

829

asked Nov 03 '15 11:11

Alex

1 Answers

The problem, I believe, is that your data has 5300 distinct groups. Due to this, anything slow within your function will be magnified. You could probably use a vectorized operation rather than a for loop in your function to save time, but a much easier way to shave off a few seconds is to return 0 rather than return group. When you return group, pandas will actually create a new data object combining your sorted groups, which you don't appear to use. When you return 0, pandas will combine 5300 zeros instead, which is much faster.

For example:

cols = ['ID_number','TimeOfDay','TypeOfCargo','TrackStart'] groups = df.groupby(cols) print(len(groups)) # 5353  %timeit df.groupby(cols).apply(lambda group: group) # 1 loops, best of 3: 2.41 s per loop  %timeit df.groupby(cols).apply(lambda group: 0) # 10 loops, best of 3: 64.3 ms per loop

Just combining the results you don't use is taking about 2.4 seconds; the rest of the time is actual computation in your loop which you should attempt to vectorize.

Edit:

With a quick additional vectorized check before the for loop and returning 0 instead of group, I got the time down to about ~2sec, which is basically the cost of sorting each group. Try this function:

def Full_coverage(group):     if len(group) > 1:         group = group.sort('SectionStart', ascending=True)          # this condition is sufficient to find when the loop         # will add to the list         if np.any(group.values[1:, 4] != group.values[:-1, 5]):             start_km = group.iloc[0,4]             end_km = group.iloc[0,5]             end_km_index = group.index[0]              for index, (i, j) in group.iloc[1:,[4,5]].iterrows():                 if i != end_km:                     incomplete_coverage.append(('Expected startpoint: '+str(end_km)+' (row '+str(end_km_index)+')', \                                         'Found startpoint: '+str(i)+' (row '+str(index)+')'))                                 start_km = i                 end_km = j                 end_km_index = index      return 0  cols = ['ID_number','TimeOfDay','TypeOfCargo','TrackStart'] %timeit df.groupby(cols).apply(Full_coverage) # 1 loops, best of 3: 1.74 s per loop

Edit 2: here's an example which incorporates my suggestion to move the sort outside the groupby and to remove the unnecessary loops. Removing the loops is not much faster for the given example, but will be faster if there are a lot of incompletes:

def Full_coverage_new(group):     if len(group) > 1:         mask = group.values[1:, 4] != group.values[:-1, 5]         if np.any(mask):             err = ('Expected startpoint: {0} (row {1}) '                    'Found startpoint: {2} (row {3})')             incomplete_coverage.extend([err.format(group.iloc[i, 5],                                                    group.index[i],                                                    group.iloc[i + 1, 4],                                                    group.index[i + 1])                                         for i in np.where(mask)[0]])     return 0  incomplete_coverage = [] cols = ['ID_number','TimeOfDay','TypeOfCargo','TrackStart'] df_s = df.sort_values(['SectionStart','SectionStop']) df_s.groupby(cols).apply(Full_coverage_nosort)

183

answered Sep 21 '22 02:09

jakevdp

Related questions
                            
                                How to prevent Pandas from converting my integers to floats when I merge two dataFrames?
                            
                                MacOSX Instruments to profile Python code
                            
                                Comparing SQLAlchemy Object Instances for Equality of Attributes
                            
                                Python ASCII Graph Drawing [closed]
                            
                                how to call a program from python without waiting for it to return
                            
                                Performance degradation of matrix multiplication of single vs double precision arrays on multi-core machine
                            
                                PyAudio IOError: No Default Input Device Available
                            
                                CPU Flame Graphs for Python
                            
                                easy_install : ImportError: Entry point ('console_scripts', 'easy_install') not found
                            
                                Flask slow at retrieving post data from request?
                            
                                Accessing validation data within a custom callback
                            
                                change strength of antialiasing in matplotlib
                            
                                Python equivalent of which() in R
                            
                                Python / ImportError: Import by filename is not supported [duplicate]
                            
                                How to merge two dictionaries with same key names [duplicate]
                            
                                What does the group_keys argument to pandas.groupby actually do?
                            
                                setting up environment in virtaulenv using python3 stuck on setuptools, pip, wheel
                            
                                PIP installation for Python3 problem: Consider adding this directory to PATH
                            
                                Need help understanding Comet in Python (with Django)
                            
                                How do I delete or replace a file in a zip archive?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas groupby apply performing slow

Tags:

python

python-3.x

pandas

Alex

People also ask

1 Answers

jakevdp

Recent Activity

Donate For Us