Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to cache a pandas dataframe?

Yesterday I learned the hard way that saving a pandas dataframe to csv for later use is a bad idea. I have a dataframe of +- 130k tweets, where one row of the dataframe is a list of tweets. When I saved the data to CSV and then loaded the dataframe back in, the rows of my dataframes are now of type String. This lead to all kinds of errors and a lot of debugging. Of course it was a stupid mistake to assume that CSV would be able to preserve information about which data structure type my data is.

My question now is: How do I save a dataframe for later use, in a way that information about which data types my columns/rows are is preserved?

like image 561
Psychotechnopath Avatar asked Oct 21 '25 12:10

Psychotechnopath


1 Answers

I hope you found the solution you were looking for.
To answer the question, one can use the DataFrame.to_pickle() method to serialize (convert python objects into byte streams), and when you de-serialize a pickle file, you get back the data as they were, but keep in mind when using pickle files, they may pose a security threat when received from untrusted sources.

Here's an example from the doc on how to use pickle:

>>> original_df = pd.DataFrame({"foo": range(5), "bar": range(5, 10)})
>>> original_df
   foo  bar
0    0    5
1    1    6
2    2    7
3    3    8
4    4    9

>>> pd.to_pickle(original_df, "./dummy.pkl")
>>> unpickled_df = pd.read_pickle("./dummy.pkl")
>>> unpickled_df
   foo  bar
0    0    5
1    1    6
2    2    7
3    3    8
4    4    9
like image 110
Singh Avatar answered Oct 23 '25 00:10

Singh



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!