Is there a way to speed up the following pandas for loop?

Tags:

My data frame contains 10,000,000 rows! After group by, ~ 9,000,000 sub-frames remain to loop through.

The code is:

data = read.csv('big.csv')
for id, new_df in data.groupby(level=0): # look at mini df and do some analysis
    # some code for each of the small data frames

This is super inefficient, and the code has been running for 10+ hours now.

Is there a way to speed it up?

Full Code:

d = pd.DataFrame() # new df to populate
print 'Start of the loop'
for id, new_df in data.groupby(level=0):
    c = [new_df.iloc[i:] for i in range(len(new_df.index))]
    x = pd.concat(c, keys=new_df.index).reset_index(level=(2,3), drop=True).reset_index()
    x = x.set_index(['level_0','level_1', x.groupby(['level_0','level_1']).cumcount()])
    d = pd.concat([d, x])

To get the data:

data = pd.read_csv('https://raw.githubusercontent.com/skiler07/data/master/so_data.csv', index_col=0).set_index(['id','date'])

Note:

Most of id's will only have 1 date. This indicates only 1 visit. For id's with more visits, I would like to structure them in a 3d format e.g. store all of their visits in the 2nd dimension out of 3. The output is (id, visits, features)

339

asked Mar 16 '18 10:03

GRS

2 Answers

Here is one way to speed that up. This adds the desired new rows in some code which processes the rows directly. This saves the overhead of constantly constructing small dataframes. Your sample of 100,000 rows runs in a couple of seconds on my machine. While your code with only 10,000 rows of your sample data takes > 100 seconds. This seems to represent a couple of orders of magnitude improvement.

Code:

def make_3d(csv_filename):

    def make_3d_lines(a_df):
        a_df['depth'] = 0
        depth = 0
        prev = None
        accum = []
        for row in a_df.values.tolist():
            row[0] = 0
            key = row[1]
            if key == prev:
                depth += 1
                accum.append(row)
            else:
                if depth == 0:
                    yield row
                else:
                    depth = 0
                    to_emit = []
                    for i in range(len(accum)):
                        date = accum[i][2]
                        for j, r in enumerate(accum[i:]):
                            to_emit.append(list(r))
                            to_emit[-1][0] = j
                            to_emit[-1][2] = date
                    for r in to_emit[1:]:
                        yield r
                accum = [row]
            prev = key

    df_data = pd.read_csv('big-data.csv')
    df_data.columns = ['depth'] + list(df_data.columns)[1:]

    new_df = pd.DataFrame(
        make_3d_lines(df_data.sort_values('id date'.split())),
        columns=df_data.columns
    ).astype(dtype=df_data.dtypes.to_dict())

    return new_df.set_index('id date'.split())

Test Code:

start_time = time.time()
df = make_3d('big-data.csv')
print(time.time() - start_time)

df = df.drop(columns=['feature%d' % i for i in range(3, 25)])
print(df[df['depth'] != 0].head(10))

Results:

1.7390995025634766

                          depth  feature0  feature1  feature2
id              date                                         
207555809644681 20180104      1   0.03125  0.038623  0.008130
247833985674646 20180106      1   0.03125  0.004378  0.004065
252945024181083 20180107      1   0.03125  0.062836  0.065041
                20180107      2   0.00000  0.001870  0.008130
                20180109      1   0.00000  0.001870  0.008130
329567241731951 20180117      1   0.00000  0.041952  0.004065
                20180117      2   0.03125  0.003101  0.004065
                20180117      3   0.00000  0.030780  0.004065
                20180118      1   0.03125  0.003101  0.004065
                20180118      2   0.00000  0.030780  0.004065

115

answered Sep 22 '22 04:09

Stephen Rauch

I believe your approach for feature engineering could be done better, but I will stick to answering your question.

In Python, iterating over a Dictionary is way faster than iterating over a DataFrame

Here how I managed to process a huge pandas DataFrame (~100,000,000 rows):

# reset the Dataframe index to get level 0 back as a column in your dataset
df = data.reset_index()  # the index will be (id, date)

# split the DataFrame based on id
# and store the splits as Dataframes in a dictionary using id as key
d = dict(tuple(df.groupby('id')))

# iterate over the Dictionary and process the values
for key, value in d.items():

    pass  # each value is a Dataframe


# concat the values and get the original (processed) Dataframe back  
df2 = pd.concat(d.values(), ignore_index=True)

answered Sep 24 '22 04:09

Abdulrahman Bres

Related questions
                            
                                "TypeError: 'Tensor' object is not iterable" error with tensorflow Estimator
                            
                                How to bundle cx_oracle with Pyinstaller
                            
                                How vectorizer fit_transform work in sklearn?
                            
                                Machine Learning: normalize target var based on the impact of independent var
                            
                                How to check the status of a mysql connection in python?
                            
                                Separate computation from socket work in Python
                            
                                How to hide user name and password in pip.log?
                            
                                Django: Change formset error message(s)
                            
                                GPU under utilization using tensorflow dataset
                            
                                How do I use threads on a generator while keeping the order?
                            
                                Q-values exploding when training DQN
                            
                                Can't debug unittests in Pycharm
                            
                                How to scrape multiple pages with an unchanging URL - Python 3
                            
                                Selenium Firefox webdriver works on images built from Ubuntu but not on images built from Debian
                            
                                Pyarrow s3fs partition by timestamp
                            
                                Difference between scipy.optimize.fmin and scipy.optimize.minimize
                            
                                Properly render text with a given font in Python and accurately detect its boundaries
                            
                                Make new custom view at django admin
                            
                                Tensorflow: InvalidArgumentError: Expected image (JPEG, PNG, or GIF), got empty file
                            
                                Reading Excel file without hidden columns in Python using Pandas or other modules

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there a way to speed up the following pandas for loop?

Tags:

python

pandas

numpy

GRS

People also ask