Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get the last 10000 lines of a csv file

In pandas, I can just use pandas.io.parser.read_csv("file.csv", nrows=10000) to get the first 10000 lines of a csv file.

But because my csv file is huge, and the last lines are more relevant than the first ones, I would like to read the last 10000 lines. However, this is not that easy even if I know the length of the file, because if I skip the first 990000 lines of a 1000000 line csv file using pandas.io.parser.read_csv("file.csv", nrows=10000, skiprows=990000) the first line, which contains the file header, is skipped, as well. (header=0 is measured after skiprows is applied, so it does not help either.)

How do I get the last 10000 lines from a csv file with a header in line 0, preferably without knowing the length of the file in lines?

like image 308
Anaphory Avatar asked Mar 14 '16 04:03

Anaphory


People also ask

How do I count lines in a CSV file?

Using len() function Under this method, we need to read the CSV file using pandas library and then use the len() function with the imported CSV file, which will return an int value of a number of lines/rows present in the CSV file.

How do I open a CSV file with millions of rows?

So, how do you open large CSV files in Excel? Essentially, there are two options: Split the CSV file into multiple smaller files that do fit within the 1,048,576 row limit; or, Find an Excel add-in that supports CSV files with a higher number of rows.

Can CSV have more than 1 million rows?

The maximum number of characters per cell in a csv file is 32,767. Per sheet, Excel has a capacity of 1,048,576 rows and 16,384 columns. Many more rows can be stored in CSV files. Data & Insights datasets, on the other hand, may contain parameters that surpass these restrictions.

How many rows is too many for CSV?

csv files have a limit of 32,767 characters per cell. Excel has a limit of 1,048,576 rows and 16,384 columns per sheet. CSV files can hold many more rows. You can read more about these limits and others from this Microsoft support article here.


1 Answers

You could first calculate your size of the file with:

size = sum(1 for l in open('file.csv'))

Then use skiprows with range:

df = pd.read_csv('file.csv', skiprows=range(1, size - 10000))

EDIT

As @ivan_pozdeev mentioned with that solution you need to go though file twice. I tried to read whole file with pandas and then use tail method but that method slower then suggested.

Example dataframe:

pd.DataFrame(np.random.randn(1000000,3), columns=list('abc')).to_csv('file.csv')

Timing

def f1():
    size = sum(1 for l in open('file.csv'))
    return pd.read_csv('file.csv', skiprows=range(1, size - 10000))

def f2():
    return pd.read_csv('file.csv').tail(10000)

In [10]: %timeit f1()
1 loop, best of 3: 1.8 s per loop

In [11]: %timeit f2()
1 loop, best of 3: 1.94 s per loop
like image 111
Anton Protopopov Avatar answered Sep 21 '22 12:09

Anton Protopopov