Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python - Using pandas structures with large csv(iterate and chunksize)

Tags:

I have a large csv file, about 600mb with 11 million rows and I want to create statistical data like pivots, histograms, graphs etc. Obviously trying to just to read it normally:

df = pd.read_csv('Check400_900.csv', sep='\t') 

doesn't work so I found iterate and chunksize in a similar post so I used

df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000) 

All good, i can for example print df.get_chunk(5) and search the whole file with just

for chunk in df:     print chunk 

My problem is I don't know how to use stuff like these below for the whole df and not for just one chunk

plt.plot() print df.head() print df.describe() print df.dtypes customer_group3 = df.groupby('UserID') y3 = customer_group.size() 

I hope my question is not so confusing

like image 263
Thodoris P Avatar asked Nov 11 '15 01:11

Thodoris P


People also ask

How does pandas work with large CSV files?

Using pandas. One way to process large files is to read the entries in chunks of reasonable size, which are read into the memory and are processed before reading the next chunk. We can use the chunk size parameter to specify the size of the chunk, which is the number of lines.

Is pandas efficient for large data sets?

Use efficient datatypesThe default pandas data types are not the most memory efficient. This is especially true for text data columns with relatively few unique values (commonly referred to as “low-cardinality” data). By using more efficient data types, you can store larger datasets in memory.

What is the fastest way to iterate over pandas DataFrame?

Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.

How big is too big for a pandas DataFrame?

The long answer is the size limit for pandas DataFrames is 100 gigabytes (GB) of memory instead of a set number of cells.


1 Answers

Solution, if need create one big DataFrame if need processes all data at once (what is possible, but not recommended):

Then use concat for all chunks to df, because type of output of function:

df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000) 

isn't dataframe, but pandas.io.parsers.TextFileReader - source.

tp = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000) print tp #<pandas.io.parsers.TextFileReader object at 0x00000000150E0048> df = pd.concat(tp, ignore_index=True) 

I think is necessary add parameter ignore index to function concat, because avoiding duplicity of indexes.

EDIT:

But if want working with large data like aggregating, much better is use dask, because it provides advanced parallelism.

like image 181
jezrael Avatar answered Nov 15 '22 01:11

jezrael