I have a csv file with ~2.3M rows. I'd like save the subset (~1.6M) of the rows which have non-nan values in two columns inside the dataframe. I'd like to keep using pandas to do this. Right now, my code looks like:
import pandas as pd
catalog = pd.read_csv('catalog.txt')
slim_list = []
for i in range(len(catalog)):
if (pd.isna(catalog['z'][i]) == False and pd.isna(catalog['B'][i]) == False):
slim_list.append(i)
which holds the rows of catalog
which have non-nan values. I then make a new catalog with those rows as entries
slim_catalog = pd.DataFrame(columns = catalog.columns)
for j in range(len(slim_list)):
data = (catalog.iloc[j]).to_dict()
slim_catalog = slim_catalog.append(data, ignore_index = True)
pd.to_csv('slim_catalog.csv')
This should, in principle, work. It's sped up a little by reading each row into a dict. However, it takes far, far too long to execute for all 2.3M rows. What is a better way to solve this problem?
This is the completely wrong way of doing this in pandas.
Firstly, never iterate over some range, i.e. for i in range(len(catalog)):
and then individually index into the row: catalog['z'][i]
, that is incredibly inefficient.
Second, do not create a pandas.DataFrame using pd.DataFrame.append
in a loop, that is a linear operation, so the entire thing will be quadratic time.
But you shouldn't be looping here to begin with. All you need is something like
catalog[catalog.loc[:, ['z', 'B']].notna().all(axis=1)].to_csv('slim_catalog.csv')
Or broken up to perhaps be more readable:
not_nan_zB = catalog.loc[:, ['z', 'B']].notna().all(axis=1)
catalog[not_nan_zB].to_csv('slim_catalog.csv')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With