I've got some junk at the start of my csv file that prevents me selecting the first column of my dataframe by name.
Example:
In[1]: df = pd.read_csv('file:inputdata.csv', usecols=[0], nrows=1)
In[2]: df
Out[2]:
TAB
0 10-LV_Non
In[3]: df['TAB']
Out[3]: <snip> KeyError: 'TAB'
I found the junk by reading the file with open():
In[4]: with open('inputdata.csv', 'rb') as f:
print(f.read(7))
Out[4]: b'\xef\xbb\xbfTAB,'
EDIT: '\xef\xbb\xbf' is three bytes of junk. 'TAB' is the name of the first column.
Is there a way to make pandas.read_csv() ignore junks like this (if present) at the start of the csv file?
NB The csv files are exported from a proprietary system, so I can't control their format.
UPDATE: Here's my solution, based on Mike Müller's answer:
with open('inputdata.csv', 'r') as f:
# Skip past any bytes that aren't text
while re.match('[a-zA-Z0-9_]', f.read(1)) is None:
pass
# Seek back one byte
f.seek(f.tell()-1)
# Read the file
df = pd.read_csv(f, usecols=['TAB'])
It's unclear to me what exactly is the format of the "junk", but there are a number of options to use.
pandas.read_csv takes a filepath_or_buffer
filepath_or_buffer : string or file handle / StringIO
It follows that if you open a File object, read past the junk, then pass the File object to read_csv, it should be OK.
The skiprows arguments skips rows:
skiprows : list-like or integer, default None
Thus you can possibly skip the junk's row(s).
Something like this could work:
with open('inputdata.csv', 'rb') as f:
if f.read(7) != b'\xef\xbb\xbfTAB,':
f.seek(0)
df = pd.read_csv(f, usecols=[0], nrows=1)
Just read the first seven bytes. If the are good, i.e. not equal to the bytes you don't want, go back to the beginning of the file with seek(0), otherwise start reading at position 7 bytes, skipping the offending bytes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With