Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Selecting 1.6M rows of a pandas dataframe [duplicate]

I have a csv file with ~2.3M rows. I'd like save the subset (~1.6M) of the rows which have non-nan values in two columns inside the dataframe. I'd like to keep using pandas to do this. Right now, my code looks like:

import pandas as pd
catalog = pd.read_csv('catalog.txt')
slim_list = []
for i in range(len(catalog)):
    if (pd.isna(catalog['z'][i]) == False and pd.isna(catalog['B'][i]) == False):
        slim_list.append(i)

which holds the rows of catalog which have non-nan values. I then make a new catalog with those rows as entries

slim_catalog = pd.DataFrame(columns = catalog.columns)
for j in range(len(slim_list)):
    data = (catalog.iloc[j]).to_dict()
    slim_catalog = slim_catalog.append(data, ignore_index = True)
pd.to_csv('slim_catalog.csv')

This should, in principle, work. It's sped up a little by reading each row into a dict. However, it takes far, far too long to execute for all 2.3M rows. What is a better way to solve this problem?

like image 859
user3517167 Avatar asked Nov 06 '22 04:11

user3517167


1 Answers

This is the completely wrong way of doing this in pandas.

Firstly, never iterate over some range, i.e. for i in range(len(catalog)): and then individually index into the row: catalog['z'][i], that is incredibly inefficient.

Second, do not create a pandas.DataFrame using pd.DataFrame.append in a loop, that is a linear operation, so the entire thing will be quadratic time.

But you shouldn't be looping here to begin with. All you need is something like

catalog[catalog.loc[:, ['z', 'B']].notna().all(axis=1)].to_csv('slim_catalog.csv')

Or broken up to perhaps be more readable:

not_nan_zB = catalog.loc[:, ['z', 'B']].notna().all(axis=1)
catalog[not_nan_zB].to_csv('slim_catalog.csv')
like image 137
juanpa.arrivillaga Avatar answered Nov 14 '22 22:11

juanpa.arrivillaga