In pandas, I can just use pandas.io.parser.read_csv("file.csv", nrows=10000)
to get the first 10000 lines of a csv file.
But because my csv file is huge, and the last lines are more relevant than the first ones, I would like to read the last 10000 lines. However, this is not that easy even if I know the length of the file, because if I skip the first 990000 lines of a 1000000 line csv file using pandas.io.parser.read_csv("file.csv", nrows=10000, skiprows=990000)
the first line, which contains the file header, is skipped, as well. (header=0
is measured after skiprows
is applied, so it does not help either.)
How do I get the last 10000 lines from a csv file with a header in line 0, preferably without knowing the length of the file in lines?
Using len() function Under this method, we need to read the CSV file using pandas library and then use the len() function with the imported CSV file, which will return an int value of a number of lines/rows present in the CSV file.
So, how do you open large CSV files in Excel? Essentially, there are two options: Split the CSV file into multiple smaller files that do fit within the 1,048,576 row limit; or, Find an Excel add-in that supports CSV files with a higher number of rows.
The maximum number of characters per cell in a csv file is 32,767. Per sheet, Excel has a capacity of 1,048,576 rows and 16,384 columns. Many more rows can be stored in CSV files. Data & Insights datasets, on the other hand, may contain parameters that surpass these restrictions.
csv files have a limit of 32,767 characters per cell. Excel has a limit of 1,048,576 rows and 16,384 columns per sheet. CSV files can hold many more rows. You can read more about these limits and others from this Microsoft support article here.
You could first calculate your size of the file with:
size = sum(1 for l in open('file.csv'))
Then use skiprows
with range
:
df = pd.read_csv('file.csv', skiprows=range(1, size - 10000))
EDIT
As @ivan_pozdeev mentioned with that solution you need to go though file twice. I tried to read whole file with pandas and then use tail
method but that method slower then suggested.
Example dataframe:
pd.DataFrame(np.random.randn(1000000,3), columns=list('abc')).to_csv('file.csv')
Timing
def f1():
size = sum(1 for l in open('file.csv'))
return pd.read_csv('file.csv', skiprows=range(1, size - 10000))
def f2():
return pd.read_csv('file.csv').tail(10000)
In [10]: %timeit f1()
1 loop, best of 3: 1.8 s per loop
In [11]: %timeit f2()
1 loop, best of 3: 1.94 s per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With