Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas read_csv without knowing whether header is present

Tags:

python

pandas

csv

I have an input file with known columns, let's say two columns Name and Sex. Sometimes it has the header line Name,Sex, and sometimes it doesn't:

1.csv:

Name,Sex
John,M
Leslie,F

2.csv:

John,M
Leslie,F

Knowing the identity of the columns beforehand, is there a nice way to handle both cases with the same read_csv command? Basically, I want to specify names=['Name', 'Sex'] and then have it infer header=0 only when the header is there. Best I can come up with is:

  • 1) Read the first line of the file before doing read_csv, and set parameters appropriately.

  • 2) Just do df = pd.read_csv(input_file, names=['Name', 'Sex']), then check whether the zeroeth row is identical to the header, and if so drop it (and then maybe have to renumber the rows).

But this doesn't seem like that unusual of a use case to me. Is there a built-in way of doing this with read_csv that I haven't thought of?

like image 708
leekaiinthesky Avatar asked Jul 13 '16 19:07

leekaiinthesky


People also ask

How do you check if CSV file has header or not in python?

Sniffer(). has_header(csv_test_bytes) # Check to see if there's a header in the file. dialect = csv. Sniffer().

How can I read pandas without header?

To read CSV file without header, use the header parameter and set it to “None” in the read_csv() method.

How do I get rid of the pandas header?

How do I remove a header from a Dataframe in Python? Just simply put header=False and for eliminating the index using index=False. If you want to learn more about Pandas then visit this Python Course designed by industrial experts.

How do I save pandas DataFrame as csv without header?

Pandas to CSV without Header To write DataFrame to CSV without column header (remove column names) use header=False param on to_csv() method.


2 Answers

using new feature - selection by callable:

cols = ['Name','Sex']

df = (pd.read_csv(filename, header=None, names=cols)
      [lambda x: np.ones(len(x)).astype(bool)
                 if (x.iloc[0] != cols).all()
                 else np.concatenate([[False], np.ones(len(x)-1).astype(bool)])]
)

using .query() method:

df = (pd.read_csv(filename, header=None, names=cols)
        .query('Name != "Name" and Sex != "Sex"'))

i'm not sure that this is the most elegant way, but this should work as well:

df = pd.read_csv(filename, header=None, names=cols)

if (df.iloc[0] == cols).all():
    df = df[1:].reset_index(drop=True)
like image 125
MaxU - stop WAR against UA Avatar answered Oct 18 '22 23:10

MaxU - stop WAR against UA


I've come up with a way of detecting the header without prior knowledge of its names:

if any(df.iloc[0].apply(lambda x: isinstance(x, str))):
    df = df[1:].reset_index(drop=True)

And by changing it slightly, it can update the current header with the detected one:

if any(df.iloc[0].apply(lambda x: isinstance(x, str))):
    df = df[1:].reset_index(drop=True).rename(columns=df.iloc[0])

This would allow easily selecting the desired behavior:

update_header = True

if any(df.iloc[0].apply(lambda x: isinstance(x, str))):
    new_header = df.iloc[0]

    df = df[1:].reset_index(drop=True)

    if update_header:
        df.rename(columns=new_header, inplace=True)

Pros:

  • Doesn't require prior knowledge of the header's names.
  • Can be used to update the header automatically if an existing one is detected.

Cons:

  • Won't work well if data contains strings. Replacing if any() to require all elements to be strings might help, unless data also contains entire rows of strings.
like image 26
Micael Jarniac Avatar answered Oct 18 '22 21:10

Micael Jarniac