This is my first time asking a question. I'm working with a large CSV dataset (it contains over 15 million rows and is over 1.5 GB in size). I'm loading the extracts into Pandas dataframes running in Jupyter Notebooks to derive an algorithm based on the dataset. I group the data by MAC address, which results in 1+ million groups. Core to my algorithm development is running this operation: <pre class="prettyprint"><code>pandas.core.groupby.DataFrameGroupBy.filter </code></pre> Running this operation takes 3 to 5 minutes, depending on the data set. To develop this algorithm, I must execute this operation hundreds, perhaps thousands of times. This operation appears to be CPU bound and only uses one of several cores available on my machine. I spent a few hours researching potential solutions online. I've tried to use both <code>numba</code> and <code>dask</code> to accelerate this operation and both attempts resulted in exceptions. Numba provided a message to the effect of "this should not have happened, thank you for helping improve the product". Dask, it appears, may not implement the DataFrameGroupBy.filter operation. I could not determine how to re-write my code to use <code>pool</code>/<code>map</code>. I'm looking for suggestions on how to accelerate this operation: <pre class="prettyprint"><code>pandas.core.groupby.DataFrameGroupBy.filter </code></pre> Here is an example of this operation in my code. There are other examples, all of which seem to have about the same execution time. <pre class="prettyprint"><code>import pandas as pd def import_data(_file, _columns): df = pd.read_csv(_file, low_memory = False) df[_columns] = df[_columns].apply(pd.to_numeric, errors='coerce') df = df.sort_values(by=['mac', 'time']) # The line below takes ~3 to 5 minutes to run df = df.groupby(['mac']).filter(lambda x: x['latency'].count() > 1) return df </code></pre> How can I speed this up?

<code>filter</code> is generally known to be slow when used with <code>GroupBy</code>. If you are trying to filter a DataFrame based on a conditional inside a GroupBy, a better alternative is to use <code>transform</code> or <code>map</code>: <pre class="prettyprint"><code>df[df.groupby('mac')['latency'].transform('count').gt(1)] </code></pre> <pre class="prettyprint"><code>df[df['mac'].map(df.groupby('mac')['latency'].count()).gt(1)] </code></pre>

How do I improve the performance of pandas GroupBy filter operation?

Tags:

python

pandas

group-by

pandas-groupby

This is my first time asking a question.

I'm working with a large CSV dataset (it contains over 15 million rows and is over 1.5 GB in size).

I'm loading the extracts into Pandas dataframes running in Jupyter Notebooks to derive an algorithm based on the dataset. I group the data by MAC address, which results in 1+ million groups.

Core to my algorithm development is running this operation:

pandas.core.groupby.DataFrameGroupBy.filter

Running this operation takes 3 to 5 minutes, depending on the data set. To develop this algorithm, I must execute this operation hundreds, perhaps thousands of times.

This operation appears to be CPU bound and only uses one of several cores available on my machine. I spent a few hours researching potential solutions online. I've tried to use both numba and dask to accelerate this operation and both attempts resulted in exceptions.

Numba provided a message to the effect of "this should not have happened, thank you for helping improve the product". Dask, it appears, may not implement the DataFrameGroupBy.filter operation. I could not determine how to re-write my code to use pool/map.

I'm looking for suggestions on how to accelerate this operation:

pandas.core.groupby.DataFrameGroupBy.filter

Here is an example of this operation in my code. There are other examples, all of which seem to have about the same execution time.

import pandas as pd

def import_data(_file, _columns):
    df = pd.read_csv(_file, low_memory = False)
    df[_columns] = df[_columns].apply(pd.to_numeric, errors='coerce')
    df = df.sort_values(by=['mac', 'time'])
    # The line below takes ~3 to 5 minutes to run
    df = df.groupby(['mac']).filter(lambda x: x['latency'].count() > 1)
    return df

How can I speed this up?

744

asked Feb 09 '19 20:02

Rob P

1 Answers

filter is generally known to be slow when used with GroupBy. If you are trying to filter a DataFrame based on a conditional inside a GroupBy, a better alternative is to use transform or map:

df[df.groupby('mac')['latency'].transform('count').gt(1)]

df[df['mac'].map(df.groupby('mac')['latency'].count()).gt(1)]

answered Oct 25 '22 06:10

cs95

Related questions
                            
                                Issues after Apache Airflow migration from 1.9.0 to 1.10.1
                            
                                Convert list of pyodbc.rows to pandas Dataframe takes very long time
                            
                                Can Python recognize changes to a file that it is running interactively?
                            
                                How to generate random numbers with each random number having a difference of at least x with all other elements?
                            
                                'string' has incorrect type (expected str, got spacy.tokens.doc.Doc)
                            
                                Using more worker processes than there are cores
                            
                                Converting python UTC timestamp to and from string
                            
                                float() object id creation order
                            
                                How to read email using python and smtplib
                            
                                What is the meaning of the parameter 'dims' in function Permute in keras?
                            
                                Export pandas dataframe to json and back to a dataframe with columns in the same order
                            
                                How to run a Python project using __pycache__ folder?
                            
                                Specifying NumPy Arrays with 2-Bit Dtype
                            
                                OpenCV VideoCapture and error: (-215:Assertion failed) !_src.empty() in function 'cv::cvtColor'
                            
                                Tensorflow predict the class of output
                            
                                set parameters in EventInput in Dialogflow V2 API
                            
                                `Cannot open include file: 'apr_perms_set.h'` when doing `pip install mod_wsgi`
                            
                                Unable to Import in VS Code
                            
                                how to download S3 file in Serverless Lambda (Python)
                            
                                Debug info from custom ansible module

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With