I have a large csv file, about 600mb with 11 million rows and I want to create statistical data like pivots, histograms, graphs etc. Obviously trying to just to read it normally:
df = pd.read_csv('Check400_900.csv', sep='\t')
doesn't work so I found iterate and chunksize in a similar post so I used
df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)
All good, i can for example print df.get_chunk(5)
and search the whole file with just
for chunk in df: print chunk
My problem is I don't know how to use stuff like these below for the whole df and not for just one chunk
plt.plot() print df.head() print df.describe() print df.dtypes customer_group3 = df.groupby('UserID') y3 = customer_group.size()
I hope my question is not so confusing
Using pandas. One way to process large files is to read the entries in chunks of reasonable size, which are read into the memory and are processed before reading the next chunk. We can use the chunk size parameter to specify the size of the chunk, which is the number of lines.
Use efficient datatypesThe default pandas data types are not the most memory efficient. This is especially true for text data columns with relatively few unique values (commonly referred to as “low-cardinality” data). By using more efficient data types, you can store larger datasets in memory.
Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.
The long answer is the size limit for pandas DataFrames is 100 gigabytes (GB) of memory instead of a set number of cells.
Solution, if need create one big DataFrame
if need processes all data at once (what is possible, but not recommended):
Then use concat for all chunks to df, because type of output of function:
df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)
isn't dataframe, but pandas.io.parsers.TextFileReader
- source.
tp = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000) print tp #<pandas.io.parsers.TextFileReader object at 0x00000000150E0048> df = pd.concat(tp, ignore_index=True)
I think is necessary add parameter ignore index to function concat
, because avoiding duplicity of indexes.
EDIT:
But if want working with large data like aggregating, much better is use dask
, because it provides advanced parallelism.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With