I have a large csv file, about 600mb with 11 million rows and I want to create statistical data like pivots, histograms, graphs etc. Obviously trying to just to read it normally: <pre class="prettyprint"><code>df = pd.read_csv('Check400_900.csv', sep='\t') </code></pre> doesn't work so I found iterate and chunksize in a similar post so I used <pre class="prettyprint"><code>df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000) </code></pre> All good, i can for example <code>print df.get_chunk(5)</code> and search the whole file with just <pre class="prettyprint"><code>for chunk in df: print chunk </code></pre> My problem is I don't know how to use stuff like these below for the whole df and not for just one chunk <pre class="prettyprint"><code>plt.plot() print df.head() print df.describe() print df.dtypes customer_group3 = df.groupby('UserID') y3 = customer_group.size() </code></pre> I hope my question is not so confusing

Solution, if need create one big <code>DataFrame</code> if need processes all data at once (what is possible, but not recommended): Then use concat for all chunks to df, because type of output of function: <pre class="prettyprint"><code>df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000) </code></pre> isn't dataframe, but <code>pandas.io.parsers.TextFileReader</code> - source. <pre class="prettyprint"><code>tp = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000) print tp #<pandas.io.parsers.TextFileReader object at 0x00000000150E0048> df = pd.concat(tp, ignore_index=True) </code></pre> I think is necessary add parameter ignore index to function <code>concat</code>, because avoiding duplicity of indexes. EDIT: But if want working with large data like aggregating, much better is use <code>dask</code>, because it provides advanced parallelism.

python - Using pandas structures with large csv(iterate and chunksize)

Tags:

I have a large csv file, about 600mb with 11 million rows and I want to create statistical data like pivots, histograms, graphs etc. Obviously trying to just to read it normally:

df = pd.read_csv('Check400_900.csv', sep='\t')

doesn't work so I found iterate and chunksize in a similar post so I used

df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)

All good, i can for example print df.get_chunk(5) and search the whole file with just

for chunk in df:     print chunk

My problem is I don't know how to use stuff like these below for the whole df and not for just one chunk

plt.plot() print df.head() print df.describe() print df.dtypes customer_group3 = df.groupby('UserID') y3 = customer_group.size()

I hope my question is not so confusing

263

asked Nov 11 '15 01:11

Thodoris P

1 Answers

Solution, if need create one big DataFrame if need processes all data at once (what is possible, but not recommended):

Then use concat for all chunks to df, because type of output of function:

df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)

isn't dataframe, but pandas.io.parsers.TextFileReader - source.

tp = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000) print tp #<pandas.io.parsers.TextFileReader object at 0x00000000150E0048> df = pd.concat(tp, ignore_index=True)

I think is necessary add parameter ignore index to function concat, because avoiding duplicity of indexes.

EDIT:

But if want working with large data like aggregating, much better is use dask, because it provides advanced parallelism.

181

answered Nov 15 '22 01:11

jezrael

Related questions
                            
                                Forcing cURL to get a password from the environment
                            
                                Can we use VectorDrawable or VectorXML as icons for push notifications in android?
                            
                                Create ArrayBuffer from Array (holding integers) and back again
                            
                                How can I delete a data node which is not empty in zookeeper?
                            
                                Spark and Not Serializable DateTimeFormatter
                            
                                Rails 5 ignoring /lib class?
                            
                                SQLAlchemy set default value of one column to that of another column
                            
                                Realm vs Sqlite for mobile development [closed]
                            
                                How to check whether ngIf has taken effect
                            
                                How to delete an Analytics Event from Firebase
                            
                                How to group by time interval in Spark SQL
                            
                                "StandardOut has not been redirected or the process hasn't started yet" when reading console command output in C#

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With