Make pandas.read_csv() ignore junk at the start of the csv files?

Question

I've got some junk at the start of my csv file that prevents me selecting the first column of my dataframe by name.

Example:

In[1]: df = pd.read_csv('file:inputdata.csv', usecols=[0], nrows=1)

In[2]: df
Out[2]:
        TAB
0  10-LV_Non

In[3]: df['TAB']
Out[3]: <snip> KeyError: 'TAB'

I found the junk by reading the file with open():

In[4]: with open('inputdata.csv', 'rb') as f:
           print(f.read(7))
Out[4]: b'\xef\xbb\xbfTAB,'

EDIT: '\xef\xbb\xbf' is three bytes of junk. 'TAB' is the name of the first column.

Is there a way to make pandas.read_csv() ignore junks like this (if present) at the start of the csv file?

NB The csv files are exported from a proprietary system, so I can't control their format.

UPDATE: Here's my solution, based on Mike Müller's answer:

with open('inputdata.csv', 'r') as f:
    # Skip past any bytes that aren't text
    while re.match('[a-zA-Z0-9_]', f.read(1)) is None:
        pass
    # Seek back one byte
    f.seek(f.tell()-1)
    # Read the file
    df = pd.read_csv(f, usecols=['TAB'])

Ami Tavory · Accepted Answer

It's unclear to me what exactly is the format of the "junk", but there are a number of options to use.

pandas.read_csv takes a filepath_or_buffer

filepath_or_buffer : string or file handle / StringIO

It follows that if you open a File object, read past the junk, then pass the File object to read_csv, it should be OK.

The skiprows arguments skips rows:

skiprows : list-like or integer, default None

Thus you can possibly skip the junk's row(s).

Mike Müller · Answer

Something like this could work:

with open('inputdata.csv', 'rb') as f:
    if f.read(7) != b'\xef\xbb\xbfTAB,':
        f.seek(0)
    df = pd.read_csv(f, usecols=[0], nrows=1)

Just read the first seven bytes. If the are good, i.e. not equal to the bytes you don't want, go back to the beginning of the file with seek(0), otherwise start reading at position 7 bytes, skipping the offending bytes.

Make pandas.read_csv() ignore junk at the start of the csv files?

Tags:

python

pandas

csv

Li-Wen Yip

2 Answers

Ami Tavory

Mike Müller

Recent Activity

Donate For Us

Make pandas.read_csv() ignore junk at the start of the csv files?

Tags:

python

pandas

csv

Li-Wen Yip

2 Answers

Ami Tavory

Mike Müller

Related questions

Recent Activity

Donate For Us