First I import the whole file and get a memory consumption of 1002.0+ KB
df = pd.read_csv(
filepath_or_buffer="./dataset/chicago.csv"
)
print(df.info())
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 32063 entries, 0 to 32062
# Data columns (total 4 columns):
# Name 32062 non-null object
# Position Title 32062 non-null object
# Department 32062 non-null object
# Employee Annual Salary 32062 non-null object
# dtypes: object(4)
# memory usage: 1002.0+ KB
# None
then I drop NaN, run the script again and get a memory consumption of 1.2+ MB
df = pd.read_csv(
filepath_or_buffer="./dataset/chicago.csv"
).dropna(how="all")
# <class 'pandas.core.frame.DataFrame'>
# Int64Index: 32062 entries, 0 to 32061
# Data columns (total 4 columns):
# Name 32062 non-null object
# Position Title 32062 non-null object
# Department 32062 non-null object
# Employee Annual Salary 32062 non-null object
# dtypes: object(4)
# memory usage: 1.2+ MB
# None
since I'm dropping one row I would expect that memory consumption goes down or at least remain the same no this.
Does any body know why is this happening? or how to fix it? or if this is a bug?
EDIT: chicago.csv
The change comes from the fact that your index changed from a RangeIndex
to an Int64Index
, which takes more memory.
You can "fix" this by resetting the index after the dropna()
, but this will have the side effect of changing the row index (which you may not care about).
Here is an illustrative example:
First create a sample DataFrame:
df = pd.DataFrame({"a": range(10000)})
df.loc[1000, "a"] = None
Print the info:
print(df.info())
#<class 'pandas.core.frame.DataFrame'>
#RangeIndex: 10000 entries, 0 to 9999
#Data columns (total 1 columns):
#a 9999 non-null float64
#dtypes: float64(1)
#memory usage: 78.2 KB
Drop the na values:
print(df.dropna().info())
#<class 'pandas.core.frame.DataFrame'>
#Int64Index: 9999 entries, 0 to 9999
#Data columns (total 1 columns):
#a 9999 non-null float64
#dtypes: float64(1)
#memory usage: 156.2 KB
Reset (and drop) the index:
df.dropna().reset_index(drop=True).info()
#<class 'pandas.core.frame.DataFrame'>
#RangeIndex: 9999 entries, 0 to 9998
#Data columns (total 1 columns):
#a 9999 non-null float64
#dtypes: float64(1)
#memory usage: 78.2 KB
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With