I am trying to collect data from different .csv files, that share the same column names. However, some csv files have their headers located in different rows.
Is there a way to determine the header row dynamically based on the first row that contains "most" values (the actual header names)?
I tried the following:
def process_file(file, path, col_source, col_target):
global df_master
print(file)
df = pd.read_csv(path + file, encoding = "ISO-8859-1", header=None)
df = df.dropna(thresh=2) ## Drop the rows that contain less than 2 non-NaN values. E.g. metadata
df.columns = df.iloc[0,:].values
df = df.drop(df.index[0])
However, when using pandas.read_csv()
, it seems like the very first value determines the size of the actual dataframe as I receive the following error message:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 162
As you can see in this case the header row would have been located in row 4.
When adding error_bad_lines=False
to read_csv, only the metadata will be read into the dataframe.
The files can have either the structure of:
a "Normal" File:
row1 col1 col2 col3 col4 col5
row2 val1 val1 val1 val1 val1
row3 val2 val2 val2 val2 val2
row4
or a structure with meta data before header:
row1 metadata1
row2 metadata2
row3 col1 col2 col3 col4 col5
row4 val1 val1 val1 val1 val1
Any help much appreciated!
Pandas Set First Row as Header While Reading CSV The read_csv() method accepts the parameter header . You can pass header=[0] to make the first row from the CSV file as a header of the dataframe. What is this? Use the below snippet to set the first row as a header while reading the CSV file to create the dataframe.
We can use the parse_dates parameter to convince pandas to turn things into real datetime types. parse_dates takes a list of columns (since you could want to parse multiple columns into datetimes ).
Step 1: Load the CSV file using the open method in a file object. Step 2: Create a reader object with the help of DictReader method using fileobject. This reader object is also known as an iterator can be used to fetch row-wise data. Step 3: Use for loop on reader object to get each row.
Pandas - Read, skip and customize column headers for read_csv. Pandas read_csv () function automatically parses the header while loading a csv file. It assumes that the top row (rowid = 0) contains the column name information. It is possible to change this default behavior to customize the column names.
Pandas read_csv () function automatically parses the header while loading a csv file. It assumes that the top row (rowid = 0) contains the column name information. It is possible to change this default behavior to customize the column names.
If the data file has no header information, and the intent is treat all the rows as data - then header=None is used. Assign no header from file import pandas as pd #no header df = pd.read_csv('data_deposits.csv', header = None, sep = ',') print(df.columns) print(df.head(3))
Method 1: Skipping N rows from the starting while reading a csv file. Method 2: Skipping rows at specific positions while reading a csv file. Method 3: Skipping N rows from the starting except column names while reading a csv file. Method 4: Skip rows based on a condition while reading a csv file.
IMHO the simplest way if to forget pandas for a while:
A simple way is to concatenate all the lines starting from the true header line in a single string (let us call it buffer
), and then use pd.read_csv(io.StringIO(buffer), ...)
A bit dirty, but this works. Basically it consists of trying to read the file ignoring top rows from 0 to the whole file. As soon as something is possible for a csv, it will return it. Adapt the custom_csv to your needs.
import pandas as pd
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
def custom_csv(fname):
_file_len = file_len(fname)
for i in range(_file_len):
try:
df = pd.read_csv(fname, skiprows=i)
return df
except Exception:
print(i)
return
print(custom_csv('pollution.csv'))
Better way is to search where the data starts using csv sniffing and the row above it will give the CSV column header.
import csv
import pandas as pd
Expected_Delimiter= ","
count =0
with open(path,"r+") as f:
while True:
sniffer = csv.Sniffer()
line = f.readline()
count = count+1
# Breaking the loop if file reaches eof
if not (line):
break
Dialect =sniffer.sniff(line)
file_Delimiter = Dialect.delimiter
# Breaking loop if delimiter is found
if (file_Delimiter == Expected_Delimiter):
break
else:
continue
skiprows = count -1
CSV_data = pd.read_csv(path,sep=Expected_Delimiter,skiprows =skiprows, encoding = "ISO-8859-1")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With