How do you split reading a large csv file into evenly-sized chunks in Python?

Tags:

In a basic I had the next process.

import csv reader = csv.reader(open('huge_file.csv', 'rb'))  for line in reader:     process_line(line)

See this related question. I want to send the process line every 100 rows, to implement batch sharding.

The problem about implementing the related answer is that csv object is unsubscriptable and can not use len.

>>> import csv >>> reader = csv.reader(open('dataimport/tests/financial_sample.csv', 'rb')) >>> len(reader) Traceback (most recent call last):   File "<stdin>", line 1, in <module> TypeError: object of type '_csv.reader' has no len() >>> reader[10:] Traceback (most recent call last):   File "<stdin>", line 1, in <module> TypeError: '_csv.reader' object is unsubscriptable >>> reader[10] Traceback (most recent call last):   File "<stdin>", line 1, in <module> TypeError: '_csv.reader' object is unsubscriptable

How can I solve this?

742

asked Feb 10 '11 12:02

Mario César

2 Answers

Just make your reader subscriptable by wrapping it into a list. Obviously this will break on really large files (see alternatives in the Updates below):

>>> reader = csv.reader(open('big.csv', 'rb')) >>> lines = list(reader) >>> print lines[:100] ...

Further reading: How do you split a list into evenly sized chunks in Python?

Update 1 (list version): Another possible way would just process each chuck, as it arrives while iterating over the lines:

#!/usr/bin/env python  import csv reader = csv.reader(open('4956984.csv', 'rb'))  chunk, chunksize = [], 100  def process_chunk(chuck):     print len(chuck)     # do something useful ...  for i, line in enumerate(reader):     if (i % chunksize == 0 and i > 0):         process_chunk(chunk)         del chunk[:]  # or: chunk = []     chunk.append(line)  # process the remainder process_chunk(chunk)

Update 2 (generator version): I haven't benchmarked it, but maybe you can increase performance by using a chunk generator:

#!/usr/bin/env python  import csv reader = csv.reader(open('4956984.csv', 'rb'))  def gen_chunks(reader, chunksize=100):     """      Chunk generator. Take a CSV `reader` and yield     `chunksize` sized slices.      """     chunk = []     for i, line in enumerate(reader):         if (i % chunksize == 0 and i > 0):             yield chunk             del chunk[:]  # or: chunk = []         chunk.append(line)     yield chunk  for chunk in gen_chunks(reader):     print chunk # process chunk  # test gen_chunk on some dummy sequence: for chunk in gen_chunks(range(10), chunksize=3):     print chunk # process chunk  # => yields # [0, 1, 2] # [3, 4, 5] # [6, 7, 8] # [9]

There is a minor gotcha, as @totalhack points out:

Be aware that this yields the same object over and over with different contents. This works fine if you plan on doing everything you need to with the chunk between each iteration.

192

answered Sep 18 '22 16:09

miku

We can use pandas module to handle these big csv files.

df = pd.DataFrame() temp = pd.read_csv('BIG_File.csv', iterator=True, chunksize=1000) df = pd.concat(temp, ignore_index=True)

answered Sep 16 '22 16:09

debaonline4u

Related questions
                            
                                Passing value from PHP script to Python script
                            
                                Can I use MongoDB as a replacement for CoreData on iOS?
                            
                                NHibernate Criteria Restriction vs Expression
                            
                                Regular expression with carriage return
                            
                                Select most recent row with GROUP BY in MySQL
                            
                                How to record audio file in Android
                            
                                In app purchase does not work when live
                            
                                Can a <small> tag be inside an HTML5 heading (i.e. h1, h2, h3, etc...)?
                            
                                Model First with DbContext, Fails to initialize new DataBase
                            
                                Linking error for unit testing with XCode 4?
                            
                                PHP - How to catch a 'Trying to get property of non-object' error
                            
                                How do I pass Scala enums as parameters?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With