Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

.dropna() increases memory usage

Tags:

python

pandas

First I import the whole file and get a memory consumption of 1002.0+ KB

df = pd.read_csv(
    filepath_or_buffer="./dataset/chicago.csv"
)
print(df.info())

# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 32063 entries, 0 to 32062
# Data columns (total 4 columns):
# Name                      32062 non-null object
# Position Title            32062 non-null object
# Department                32062 non-null object
# Employee Annual Salary    32062 non-null object
# dtypes: object(4)
# memory usage: 1002.0+ KB
# None

then I drop NaN, run the script again and get a memory consumption of 1.2+ MB

df = pd.read_csv(
    filepath_or_buffer="./dataset/chicago.csv"
).dropna(how="all")

# <class 'pandas.core.frame.DataFrame'>
# Int64Index: 32062 entries, 0 to 32061
# Data columns (total 4 columns):
# Name                      32062 non-null object
# Position Title            32062 non-null object
# Department                32062 non-null object
# Employee Annual Salary    32062 non-null object
# dtypes: object(4)
# memory usage: 1.2+ MB
# None

since I'm dropping one row I would expect that memory consumption goes down or at least remain the same no this.

Does any body know why is this happening? or how to fix it? or if this is a bug?

EDIT: chicago.csv

like image 805
juanp_1982 Avatar asked Jan 26 '23 12:01

juanp_1982


1 Answers

The change comes from the fact that your index changed from a RangeIndex to an Int64Index, which takes more memory.

You can "fix" this by resetting the index after the dropna(), but this will have the side effect of changing the row index (which you may not care about).

Here is an illustrative example:

First create a sample DataFrame:

df = pd.DataFrame({"a": range(10000)})
df.loc[1000, "a"] = None

Print the info:

print(df.info())
#<class 'pandas.core.frame.DataFrame'>
#RangeIndex: 10000 entries, 0 to 9999
#Data columns (total 1 columns):
#a    9999 non-null float64
#dtypes: float64(1)
#memory usage: 78.2 KB

Drop the na values:

print(df.dropna().info())
#<class 'pandas.core.frame.DataFrame'>
#Int64Index: 9999 entries, 0 to 9999
#Data columns (total 1 columns):
#a    9999 non-null float64
#dtypes: float64(1)
#memory usage: 156.2 KB

Reset (and drop) the index:

df.dropna().reset_index(drop=True).info()
#<class 'pandas.core.frame.DataFrame'>
#RangeIndex: 9999 entries, 0 to 9998
#Data columns (total 1 columns):
#a    9999 non-null float64
#dtypes: float64(1)
#memory usage: 78.2 KB
like image 168
pault Avatar answered Jan 29 '23 01:01

pault