In pandas, I can just use <code>pandas.io.parser.read_csv("file.csv", nrows=10000)</code> to get the first 10000 lines of a csv file. But because my csv file is huge, and the last lines are more relevant than the first ones, I would like to read the last 10000 lines. However, this is not that easy even if I know the length of the file, because if I skip the first 990000 lines of a 1000000 line csv file using <code>pandas.io.parser.read_csv("file.csv", nrows=10000, skiprows=990000)</code> the first line, which contains the file header, is skipped, as well. (<code>header=0</code> is measured after <code>skiprows</code> is applied, so it does not help either.) How do I get the last 10000 lines from a csv file with a header in line 0, preferably without knowing the length of the file in lines?

You could first calculate your size of the file with: <pre class="prettyprint"><code>size = sum(1 for l in open('file.csv')) </code></pre> Then use <code>skiprows</code> with <code>range</code>: <pre class="prettyprint"><code>df = pd.read_csv('file.csv', skiprows=range(1, size - 10000)) </code></pre> EDIT As @ivan_pozdeev mentioned with that solution you need to go though file twice. I tried to read whole file with pandas and then use <code>tail</code> method but that method slower then suggested. Example dataframe: <pre class="prettyprint"><code>pd.DataFrame(np.random.randn(1000000,3), columns=list('abc')).to_csv('file.csv') </code></pre> Timing <pre class="prettyprint"><code>def f1(): size = sum(1 for l in open('file.csv')) return pd.read_csv('file.csv', skiprows=range(1, size - 10000)) def f2(): return pd.read_csv('file.csv').tail(10000) In [10]: %timeit f1() 1 loop, best of 3: 1.8 s per loop In [11]: %timeit f2() 1 loop, best of 3: 1.94 s per loop </code></pre>

Get the last 10000 lines of a csv file

Tags:

python

pandas

csv

tail

In pandas, I can just use pandas.io.parser.read_csv("file.csv", nrows=10000) to get the first 10000 lines of a csv file.

But because my csv file is huge, and the last lines are more relevant than the first ones, I would like to read the last 10000 lines. However, this is not that easy even if I know the length of the file, because if I skip the first 990000 lines of a 1000000 line csv file using pandas.io.parser.read_csv("file.csv", nrows=10000, skiprows=990000) the first line, which contains the file header, is skipped, as well. (header=0 is measured after skiprows is applied, so it does not help either.)

How do I get the last 10000 lines from a csv file with a header in line 0, preferably without knowing the length of the file in lines?

308

asked Mar 14 '16 04:03

Anaphory

1 Answers

You could first calculate your size of the file with:

size = sum(1 for l in open('file.csv'))

Then use skiprows with range:

df = pd.read_csv('file.csv', skiprows=range(1, size - 10000))

EDIT

As @ivan_pozdeev mentioned with that solution you need to go though file twice. I tried to read whole file with pandas and then use tail method but that method slower then suggested.

Example dataframe:

pd.DataFrame(np.random.randn(1000000,3), columns=list('abc')).to_csv('file.csv')

Timing

def f1():
    size = sum(1 for l in open('file.csv'))
    return pd.read_csv('file.csv', skiprows=range(1, size - 10000))

def f2():
    return pd.read_csv('file.csv').tail(10000)

In [10]: %timeit f1()
1 loop, best of 3: 1.8 s per loop

In [11]: %timeit f2()
1 loop, best of 3: 1.94 s per loop

111

answered Sep 21 '22 12:09

Anton Protopopov

Related questions
                            
                                Django How to Serialize from ManyToManyField and List All
                            
                                How can I set the time zone in Dockerfile using gliderlabs/alpine:3.3
                            
                                django.core.exceptions.ImproperlyConfigured: Enable 'django.contrib.auth.context_processors.auth'
                            
                                urllib.request.urlopen return bytes, but I cannot decode it [duplicate]
                            
                                Using BeautifulSoup to find tag with two specific styles
                            
                                Matplotlib: Overriding "ggplot" default style properties
                            
                                Python unit test: testcase class with own constructor fails in standard library [duplicate]
                            
                                print(foo, end="") not working in terminal
                            
                                Pymongo's update_one() returns UpdateResult with AttributeError
                            
                                Flask WTForms: how do I get a form value back into Python?
                            
                                How to find elements with two possible class names by XPath?
                            
                                Using lambda and strftime on dates when there are null values (Pandas)
                            
                                How can I get the cursor's position in an ANSI terminal?
                            
                                "Worksheet range names does not exist" KeyError in openpyxl
                            
                                Why does setup.py usually not have a shebang line?
                            
                                How to add dictionary to PyEnchant?
                            
                                "Matrix is not positive definite" error in scipy.cluster.vq.kmeans2
                            
                                Jupyter install fails on Mac
                            
                                Creating a float64 Variable in tensorflow
                            
                                pip install hyperopt and hyperas fail

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With