I am trying to read some specific rows of a large csv file, and I don't want to load the whole file into memory. The index of the specific rows are given in a list <code>L = [2, 5, 15, 98, ...]</code> and my csv file looks like this: <pre class="prettyprint"><code>Col 1, Col 2, Col3 row11, row12, row13 row21, row22, row23 row31, row32, row33 ... </code></pre> Using the ideas mentioned here I use the following command to read the rows <pre class="prettyprint"><code>with open('~/file.csv') as f: r = csv.DictReader(f) # I need to read it as a dictionary for my purpose for i in L: for row in enumerate(r): print row[i] </code></pre> I immediately get the following error: <pre class="prettyprint"><code>IndexError Traceback (most recent call last) <ipython-input-25-78951a0d4937> in <module>() 6 for i in L: 7 for row in enumerate(r): ----> 8 print row[i] IndexError: tuple index out of range </code></pre> Question 1. It seems like my use of the <code>for</code> loops here is obviously wrong. Any ideas on how to fix this? On the other hand, the following gets the job done, but it's too slow: <pre class="prettyprint"><code>def read_csv_line(line_number): with open("~/file.csv") as f: r = csv.DictReader(f) for i, line in enumerate(r): if i == (line_number - 2): return line return None for i in L: print read_csv_line(i) </code></pre> Question 2. Any idea on how to improve this basic method of going through the whole file until I reach row i then print it?

Assuming <code>L</code> is a list containing the line numbers you want, you could do : <pre class="prettyprint"><code>with open("~/file.csv") as f: r = csv.DictReader(f) for i, line in enumerate(r): if i in L: # or (i+2) in L: from your second example print line </code></pre> That way : <ul> <li>you read the file only once</li> <li>you do not load the whole file in memory</li> <li>you only get the lines you are interested in</li> </ul> The only caveat is that you read whole file even if <code>L = [3]</code>

A file doesn't have "lines" or "rows". What you consider a "line" is "what is found between two newline characters". As such you cannot read the nth line without reading the lines before it, as you couldn't count the newline characters. Answer 1: if you consider your example, but with L=[9], unrolling your loops would give: <pre class="prettyprint"><code>i=9 row = (0, {'Col 2': 'row12', 'Col 3': 'row13', 'Col 1': 'row11'}) </code></pre> As you can see, row is a tuple with two members, calling <code>row[i]</code> means <code>row[9]</code>, hence the IndexError. Answer 2: This is very slow because you are reading the file up to the line number every time. In your example, you read the first 2 lines, then the first 5, then the first 15, then the first 98, etc. So you've read the first 5 lines 3 times. You could create a generator that only returns the lines you want (beware, line numbers would be 0-indexed): <pre class="prettyprint"><code>def read_my_lines(csv_reader, lines_list): for line_number, row in enumerate(csv_reader): if line_number in lines_list: yield line_number, row </code></pre> So when you want to process the lines, you would do: <pre class="prettyprint"><code>L = [2, 5, 15, 98, ...] with open('~/file.csv') as f: r = csv.DictReader(f) for line_number, line in read_my_lines(r, L): do_something_with_line(line) </code></pre> * Edit * This could further be improved to stop reading the file when you've read all the lines you wanted: <pre class="prettyprint"><code>def read_my_lines(csv_reader, lines_list): # make sure every line number shows up only once: lines_set = set(lines_list) for line_number, row in enumerate(csv_reader): if line_number in lines_set: yield line_number, row lines_set.remove(line_number) # Stop when the set is empty if not lines_set: raise StopIteration </code></pre>

How to read specific lines of a large csv file

Tags:

python

csv

large-files

I am trying to read some specific rows of a large csv file, and I don't want to load the whole file into memory. The index of the specific rows are given in a list L = [2, 5, 15, 98, ...] and my csv file looks like this:

Col 1, Col 2, Col3
row11, row12, row13
row21, row22, row23
row31, row32, row33
...

Using the ideas mentioned here I use the following command to read the rows

with open('~/file.csv') as f:
    r = csv.DictReader(f) # I need to read it as a dictionary for my purpose

    for i in L:
        for row in enumerate(r):
            print row[i]

I immediately get the following error:

IndexError                                Traceback (most recent call last)
<ipython-input-25-78951a0d4937> in <module>()
      6     for i in L:
      7         for row in enumerate(r):
----> 8             print row[i]
IndexError: tuple index out of range

Question 1. It seems like my use of the for loops here is obviously wrong. Any ideas on how to fix this?

On the other hand, the following gets the job done, but it's too slow:

def read_csv_line(line_number):
    with open("~/file.csv") as f:
        r = csv.DictReader(f)
        for i, line in enumerate(r):
            if i == (line_number - 2):
                return line
    return None

for i in L:
    print read_csv_line(i)

Question 2. Any idea on how to improve this basic method of going through the whole file until I reach row i then print it?

324

asked Apr 10 '15 17:04

Keivan

2 Answers

Assuming L is a list containing the line numbers you want, you could do :

with open("~/file.csv") as f:
    r = csv.DictReader(f)
    for i, line in enumerate(r):
        if i in L:    # or (i+2) in L: from your second example
            print line

That way :

you read the file only once
you do not load the whole file in memory
you only get the lines you are interested in

The only caveat is that you read whole file even if L = [3]

124

answered Nov 04 '22 22:11

Serge Ballesta

A file doesn't have "lines" or "rows". What you consider a "line" is "what is found between two newline characters". As such you cannot read the nth line without reading the lines before it, as you couldn't count the newline characters.

Answer 1: if you consider your example, but with L=[9], unrolling your loops would give:

i=9
row = (0, {'Col 2': 'row12', 'Col 3': 'row13', 'Col 1': 'row11'})

As you can see, row is a tuple with two members, calling row[i] means row[9], hence the IndexError.

Answer 2: This is very slow because you are reading the file up to the line number every time. In your example, you read the first 2 lines, then the first 5, then the first 15, then the first 98, etc. So you've read the first 5 lines 3 times. You could create a generator that only returns the lines you want (beware, line numbers would be 0-indexed):

def read_my_lines(csv_reader, lines_list):
    for line_number, row in enumerate(csv_reader):
        if line_number in lines_list:
            yield line_number, row

So when you want to process the lines, you would do:

L = [2, 5, 15, 98, ...]
with open('~/file.csv') as f:
    r = csv.DictReader(f)
    for line_number, line in read_my_lines(r, L):
        do_something_with_line(line)

* Edit *

This could further be improved to stop reading the file when you've read all the lines you wanted:

def read_my_lines(csv_reader, lines_list):
    # make sure every line number shows up only once:
    lines_set = set(lines_list)
    for line_number, row in enumerate(csv_reader):
        if line_number in lines_set:
            yield line_number, row
            lines_set.remove(line_number)
            # Stop when the set is empty
            if not lines_set:
                raise StopIteration

answered Nov 05 '22 00:11

vlad

Related questions
                            
                                Terminal messed up (not displaying new lines) after running Python script
                            
                                Defining a binary matplotlib colormap
                            
                                Store object using Python pickle, and load it into different namespace
                            
                                Python Pandas drop columns based on max value of column
                            
                                Log labels on colorbar matplotlib
                            
                                Why can bcrypt.hashpw be used both for hashing and verifying passwords?
                            
                                Re-index dataframe by new range of dates
                            
                                Is there a way to add an empty entry to a Legend in Matplotlib?
                            
                                JSON-serializing non-string dictionary keys
                            
                                iPython notebook - set ylim on subplot secondary y-axis
                            
                                Python #define equivalent
                            
                                OCaml equivalent of Python generators
                            
                                Does 64-bit Anaconda on win32 uses 32-bit or 64-bit?
                            
                                Pandas groupby: percentage above threshold
                            
                                What is the format of the 'orient' agument to pandas.DataFrame.to_json()?
                            
                                How do I see mongoengine built query?
                            
                                Flask-Sqlalchemy + Sqlalchemy-searchable returning empty list
                            
                                How do I schedule an interval job with APScheduler?
                            
                                Extracting minimum values per row using numpy
                            
                                how to write the collections.Counter object to a file in python and then reload it from the file and use it as a counter object

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With