I am looking for a a way to read just the header row of a large number of large CSV files.
Using Pandas, I have this method available, for each csv file:
>>> df = pd.read_csv(PATH_TO_CSV) >>> df.columns
I could do this with just the csv module:
>>> reader = csv.DictReader(open(PATH_TO_CSV)) >>> reader.fieldnames
The problem with these is that each CSV file is 500MB+ in size, and it seems to be a gigantic waste to read in the entire file of each just to pull the header lines.
My end goal of all of this is to pull out unique column names. I can do that once I have a list of column headers that are in each of these files.
How can I extract only the header row of a CSV file, quickly?
This can be done with the help of the pandas. read_csv() method. We will pass the first parameter as the CSV file and the second parameter the list of specific columns in the keyword usecols. It will return the data of the CSV file of specific columns.
To read CSV file without header, use the header parameter and set it to “None” in the read_csv() method.
Expanding on the answer given by Jeff It is now possbile to use pandas
without actually reading any rows.
In [1]: import pandas as pd In [2]: import numpy as np In [3]: pd.DataFrame(np.random.randn(10, 4), columns=list('abcd')).to_csv('test.csv', mode='w') In [4]: pd.read_csv('test.csv', index_col=0, nrows=0).columns.tolist() Out[4]: ['a', 'b', 'c', 'd']
pandas
can have the advantage that it deals more gracefully with CSV encodings.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With