Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Error tokenizing data during Pandas read_csv. How to actually see the bad lines?

Tags:

python

pandas

csv

I have a large csv that I load as follows

df=pd.read_csv('my_data.tsv',sep='\t',header=0, skiprows=[1,2,3])

I get several errors during the loading process.

  1. First, if I dont specify warn_bad_lines=True,error_bad_lines=False I get:

    Error tokenizing data. C error: Expected 22 fields in line 329867, saw 24

  2. Second, if I use the options above, I now get:

    CParserError: Error tokenizing data. C error: EOF inside string starting at line 32357585

Question is: how can I have a look at these bad lines to understand what's going on? Is it possible to have read_csv return these bogus lines?

I tried the following hint (Pandas ParserError EOF character when reading multiple csv files to HDF5):

from pandas import parser

try:
  df=pd.read_csv('mydata.tsv',sep='\t',header=0, skiprows=[1,2,3])
except (parser.CParserError) as detail:
  print  detail

but still get

Error tokenizing data. C error: Expected 22 fields in line 329867, saw 24

like image 927
ℕʘʘḆḽḘ Avatar asked Aug 11 '16 17:08

ℕʘʘḆḽḘ


People also ask

What does error Tokenizing data mean?

The Error tokenizing data may arise when you're using separator (for eg. comma ',') as a delimiter and you have more separator than expected (more fields in the error row than defined in the header). So you need to either remove the additional field or remove the extra separator if it's there by mistake.

What is parse error in pandas?

ParserError[source] Exception that is raised by an error encountered in parsing file contents. This is a generic error raised for errors encountered when functions like read_csv or read_html are parsing contents of a file. See also read_csv. Read CSV (comma-separated) file into a DataFrame.

What is error Tokenizing data in python?

While reading a CSV file, you may get the “Pandas Error Tokenizing Data“. This mostly occurs due to the incorrect data in the CSV file. You can solve python pandas error tokenizing data error by ignoring the offending lines using error_bad_lines=False .

How do I read a pandas TSV file?

How to read TSV file in pandas? TSV stands for Tab Separated File Use pandas which is a text file where each field is separated by tab (\t). In pandas, you can read the TSV file into DataFrame by using the read_table() function.


1 Answers

i'll will give my answer in two parts:

part1: the op asked how to output these bad lines, to answer this we can use python csv module in a simple code like that:

import csv
file = 'your_filename.csv' # use your filename
lines_set = set([100, 200]) # use your bad lines numbers here

with open(file) as f_obj:
    for line_number, row in enumerate(csv.reader(f_obj)):
        if line_number > max(lines_set):
            break
        elif line_number in lines_set: # put your bad lines numbers here
            print(line_number, row)

also we can put it in more general function like that:

import csv


def read_my_lines(file, lines_list, reader=csv.reader):
    lines_set = set(lines_list)
    with open(file) as f_obj:
        for line_number, row in enumerate(csv.reader(f_obj)):
            if line_number > max(lines_set):
                break
            elif line_number in lines_set:
                print(line_number, row)


if __name__ == '__main__':
    read_my_lines(file='your_filename.csv', lines_list=[100, 200])

part2: the cause of the error you get:

it's hard to diagnose problem like this without a sample of the file you use. but you should try this ..

pd.read_csv(filename)

is it parse the file with no error ? if so, i will explain why.

the number of columns is inferred from the first line.

by using skiprows and header=0 you escaped the first 3 rows, i guess that contains the columns names or the header that should contains the correct number of columns.

basically you constraining what the parser is doing.

so parse without skiprows, or header=0 then reindexing to what you need later.

note:

if you unsure about what delimiter used in the file use sep=None, but it would be slower.

from pandas.read_csv docs:

sep : str, default ‘,’ Delimiter to use. If sep is None, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator by Python’s builtin sniffer tool, csv.Sniffer. In addition, separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example: '\r\t'

link

like image 172
Sameh Farouk Avatar answered Oct 06 '22 10:10

Sameh Farouk