Automatically determine header row when reading csv in pandas

Tags:

I am trying to collect data from different .csv files, that share the same column names. However, some csv files have their headers located in different rows.

Is there a way to determine the header row dynamically based on the first row that contains "most" values (the actual header names)?

I tried the following:

Click to copy

def process_file(file, path, col_source, col_target):
    global df_master
    print(file)
    df = pd.read_csv(path + file, encoding = "ISO-8859-1", header=None)
    df = df.dropna(thresh=2) ## Drop the rows that contain less than 2 non-NaN values. E.g. metadata
    df.columns = df.iloc[0,:].values
    df = df.drop(df.index[0])

However, when using pandas.read_csv(), it seems like the very first value determines the size of the actual dataframe as I receive the following error message:

pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 162

As you can see in this case the header row would have been located in row 4. When adding error_bad_lines=False to read_csv, only the metadata will be read into the dataframe.

The files can have either the structure of:

a "Normal" File:

Click to copy

row1    col1   col2    col3    col4   col5   
row2    val1   val1    val1    val1   val1
row3    val2   val2    val2    val2   val2   
row4

or a structure with meta data before header:

Click to copy

row1   metadata1    
row2   metadata2
row3   col1   col2    col3    col4   col5
row4   val1   val1    val1    val1   val1

Any help much appreciated!

574

asked Feb 27 '20 13:02

3 Answers

IMHO the simplest way if to forget pandas for a while:

you open the file as a text file for reading
you start parsing it line by line, guessing whether the line is
- metadata header
- the true header line
- data lines

A simple way is to concatenate all the lines starting from the true header line in a single string (let us call it buffer), and then use pd.read_csv(io.StringIO(buffer), ...)

answered Sep 29 '22 21:09

A bit dirty, but this works. Basically it consists of trying to read the file ignoring top rows from 0 to the whole file. As soon as something is possible for a csv, it will return it. Adapt the custom_csv to your needs.

Click to copy

import pandas as pd

def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

def custom_csv(fname):
    _file_len = file_len(fname)
    for i in range(_file_len):
        try:
            df = pd.read_csv(fname, skiprows=i)
            return df
        except Exception:
            print(i)
    return 
print(custom_csv('pollution.csv'))

answered Sep 29 '22 21:09

Jano

Better way is to search where the data starts using csv sniffing and the row above it will give the CSV column header.

Click to copy

import csv 
import pandas as pd    
Expected_Delimiter= "," 
count =0

with open(path,"r+") as f:
    while True:
        sniffer = csv.Sniffer()
        line = f.readline()
        count = count+1
        # Breaking the loop if file reaches eof
        if not (line):
            break
        Dialect =sniffer.sniff(line)
        file_Delimiter = Dialect.delimiter
        # Breaking loop if delimiter is found
        if (file_Delimiter == Expected_Delimiter):
            break
        else:
            continue

skiprows = count -1     
CSV_data = pd.read_csv(path,sep=Expected_Delimiter,skiprows =skiprows, encoding = "ISO-8859-1")

answered Sep 29 '22 22:09

Rajesh Kumar

Related questions
                            
                                How can I rotate a matplotlib map?
                            
                                How to get the mode of distribution in scipy.stats
                            
                                What's the difference between auto_remove and remove in Docker SDK for python
                            
                                Why are deep learning libraries so huge?
                            
                                How to use nox with poetry?
                            
                                Split a list of dates into subsets of consecutive dates
                            
                                Visual Studio Code syntax highlighting not working
                            
                                Reading .dat file in python
                            
                                Feeding nullable data from BigQuery into Tensorflow Transform
                            
                                Does the django_address module provide a way to seed the initial country data?
                            
                                How to generate asgi.py for existent project?
                            
                                How do I correctly use mock call_args with Python's unittest.mock?
                            
                                Flask endpoint vs Sagemaker endpoint
                            
                                which python vs PYTHONPATH
                            
                                Do I need to split the data for isolation forest?
                            
                                Is it true that in multiprocessing, each process gets it's own GIL in CPython? How different is that from creating new runtimes?
                            
                                Django & mypy: ValuesQuerySet type hint
                            
                                How to process huge datasets in kedro
                            
                                Pandas - Generate Unique ID based on row values
                            
                                sklearn utils compute_class_weight function for large dataset

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Automatically determine header row when reading csv in pandas

Tags:

python

pandas

csv

Maeaex1

People also ask

3 Answers

Serge Ballesta

Jano

Rajesh Kumar

Recent Activity

Donate For Us