Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Automatically determine header row when reading csv in pandas

Tags:

python

pandas

csv

I am trying to collect data from different .csv files, that share the same column names. However, some csv files have their headers located in different rows.

Is there a way to determine the header row dynamically based on the first row that contains "most" values (the actual header names)?

I tried the following:

def process_file(file, path, col_source, col_target):
    global df_master
    print(file)
    df = pd.read_csv(path + file, encoding = "ISO-8859-1", header=None)
    df = df.dropna(thresh=2) ## Drop the rows that contain less than 2 non-NaN values. E.g. metadata
    df.columns = df.iloc[0,:].values
    df = df.drop(df.index[0])

However, when using pandas.read_csv(), it seems like the very first value determines the size of the actual dataframe as I receive the following error message:

pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 162

As you can see in this case the header row would have been located in row 4. When adding error_bad_lines=False to read_csv, only the metadata will be read into the dataframe.

The files can have either the structure of:

a "Normal" File:

row1    col1   col2    col3    col4   col5   
row2    val1   val1    val1    val1   val1
row3    val2   val2    val2    val2   val2   
row4

or a structure with meta data before header:

row1   metadata1    
row2   metadata2
row3   col1   col2    col3    col4   col5
row4   val1   val1    val1    val1   val1

Any help much appreciated!

like image 574
Maeaex1 Avatar asked Feb 27 '20 13:02

Maeaex1


People also ask

How do I set a header row in pandas?

Pandas Set First Row as Header While Reading CSV The read_csv() method accepts the parameter header . You can pass header=[0] to make the first row from the CSV file as a header of the dataframe. What is this? Use the below snippet to set the first row as a header while reading the CSV file to create the dataframe.

What does Parse_dates do in pandas?

We can use the parse_dates parameter to convince pandas to turn things into real datetime types. parse_dates takes a list of columns (since you could want to parse multiple columns into datetimes ).

How do I display a specific row in a CSV file in Python?

Step 1: Load the CSV file using the open method in a file object. Step 2: Create a reader object with the help of DictReader method using fileobject. This reader object is also known as an iterator can be used to fetch row-wise data. Step 3: Use for loop on reader object to get each row.

How do I customize column headers in pandas read_CSV?

Pandas - Read, skip and customize column headers for read_csv. Pandas read_csv () function automatically parses the header while loading a csv file. It assumes that the top row (rowid = 0) contains the column name information. It is possible to change this default behavior to customize the column names.

What is pandpandas read_CSV () function?

Pandas read_csv () function automatically parses the header while loading a csv file. It assumes that the top row (rowid = 0) contains the column name information. It is possible to change this default behavior to customize the column names.

What if the data file has no header in Python?

If the data file has no header information, and the intent is treat all the rows as data - then header=None is used. Assign no header from file import pandas as pd #no header df = pd.read_csv('data_deposits.csv', header = None, sep = ',') print(df.columns) print(df.head(3))

How to skip rows while reading a CSV file?

Method 1: Skipping N rows from the starting while reading a csv file. Method 2: Skipping rows at specific positions while reading a csv file. Method 3: Skipping N rows from the starting except column names while reading a csv file. Method 4: Skip rows based on a condition while reading a csv file.


3 Answers

IMHO the simplest way if to forget pandas for a while:

  • you open the file as a text file for reading
  • you start parsing it line by line, guessing whether the line is
    • metadata header
    • the true header line
    • data lines

A simple way is to concatenate all the lines starting from the true header line in a single string (let us call it buffer), and then use pd.read_csv(io.StringIO(buffer), ...)

like image 56
Serge Ballesta Avatar answered Sep 29 '22 21:09

Serge Ballesta


A bit dirty, but this works. Basically it consists of trying to read the file ignoring top rows from 0 to the whole file. As soon as something is possible for a csv, it will return it. Adapt the custom_csv to your needs.

import pandas as pd

def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

def custom_csv(fname):
    _file_len = file_len(fname)
    for i in range(_file_len):
        try:
            df = pd.read_csv(fname, skiprows=i)
            return df
        except Exception:
            print(i)
    return 
print(custom_csv('pollution.csv'))
like image 25
Jano Avatar answered Sep 29 '22 21:09

Jano


Better way is to search where the data starts using csv sniffing and the row above it will give the CSV column header.

import csv 
import pandas as pd    
Expected_Delimiter= "," 
count =0

with open(path,"r+") as f:
    while True:
        sniffer = csv.Sniffer()
        line = f.readline()
        count = count+1
        # Breaking the loop if file reaches eof
        if not (line):
            break
        Dialect =sniffer.sniff(line)
        file_Delimiter = Dialect.delimiter
        # Breaking loop if delimiter is found
        if (file_Delimiter == Expected_Delimiter):
            break
        else:
            continue

skiprows = count -1     
CSV_data = pd.read_csv(path,sep=Expected_Delimiter,skiprows =skiprows, encoding = "ISO-8859-1")
like image 30
Rajesh Kumar Avatar answered Sep 29 '22 22:09

Rajesh Kumar