Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas DataFrame slow to show shape or dtypes

I'm very new to python and pandas. Any guidance, comment, and suggestion are appreciated!

Here is my issue: it takes couple minutes to return the result after I call df.shape or df.dtypes. The DataFrame has 1,610,658 rows and 5 columns. Three columns are stored as int64, one as float64, and one as datetime64.

I used the following codes to practice load & transform in python. Both load and transform have good performance, but I met this issue when I checked the output.

Update 1:

After setting some columns as index, the df.shape time drops from 80+s down to 1.7s, but the df.dtypes still stay at 80+s

import pandas as pd

###############
# Load
###############
raw = pd.read_csv("data.zip", compression='zip')

###############
# Transform
###############

payment_method = {
   "Cash": 1
   "Card": 2
}

df = raw. \
    assign(
        # Encode site ids to int. Only two sites in this data
        site     = (raw.site == "A").astype(int),
        # Encode payment types to int
        payment  = 
            [payment_method.get(k, 0) for k in raw.payment],
        # Rescale values
        amount   = raw.amount / 1e6,
        # Convert integer date key to datetime
        sold_date= pd.to_datetime(
            [str(dt) for dt in raw. sold_date],
            format="%Y%m%d")
    )

###############
# Check point
###############

df.shape # pain point I. Took minutes to return
# Out[9]: (1610658, 5)

df.dtypes # pain point II
# Out[10]: 
# site                       int64
# acct_key                   int64
# sold_date         datetime64[ns]
# amount                   float64
# payment                    int64

If I convert the data frame to numpy.ndarray, I can instantly get the result. I think I must miss something. Please give me some direction.

Thanks a lot!

System: OS X 10.12
Python: 3.6.1
Numpy: 1.12
Pandas: 0.20.2
Jupyter console: 5.1.0

like image 665
Richard H Avatar asked Jun 10 '17 18:06

Richard H


1 Answers

Try to reduce the size of your DataFrame:

int_columns = df.select_dtypes(include=["int"]).columns
df[int_columns] = df[int_columns].apply(pd.to_numeric, downcast='unsigned')
float_columns = df.select_dtypes(include=["float"]).columns
df[float_columns] = df[float_columns].apply(pd.to_numeric, downcast='float')
like image 141
Hai Avatar answered Dec 08 '22 16:12

Hai