I have a large csv that I load as follows
df=pd.read_csv('my_data.tsv',sep='\t',header=0, skiprows=[1,2,3])
I get several errors during the loading process.
First, if I dont specify warn_bad_lines=True,error_bad_lines=False
I get:
Error tokenizing data. C error: Expected 22 fields in line 329867, saw 24
Second, if I use the options above, I now get:
CParserError: Error tokenizing data. C error: EOF inside string starting at line 32357585
Question is: how can I have a look at these bad lines to understand what's going on? Is it possible to have read_csv
return these bogus lines?
I tried the following hint (Pandas ParserError EOF character when reading multiple csv files to HDF5):
from pandas import parser
try:
df=pd.read_csv('mydata.tsv',sep='\t',header=0, skiprows=[1,2,3])
except (parser.CParserError) as detail:
print detail
but still get
Error tokenizing data. C error: Expected 22 fields in line 329867, saw 24
The Error tokenizing data may arise when you're using separator (for eg. comma ',') as a delimiter and you have more separator than expected (more fields in the error row than defined in the header). So you need to either remove the additional field or remove the extra separator if it's there by mistake.
ParserError[source] Exception that is raised by an error encountered in parsing file contents. This is a generic error raised for errors encountered when functions like read_csv or read_html are parsing contents of a file. See also read_csv. Read CSV (comma-separated) file into a DataFrame.
While reading a CSV file, you may get the “Pandas Error Tokenizing Data“. This mostly occurs due to the incorrect data in the CSV file. You can solve python pandas error tokenizing data error by ignoring the offending lines using error_bad_lines=False .
How to read TSV file in pandas? TSV stands for Tab Separated File Use pandas which is a text file where each field is separated by tab (\t). In pandas, you can read the TSV file into DataFrame by using the read_table() function.
i'll will give my answer in two parts:
part1: the op asked how to output these bad lines, to answer this we can use python csv module in a simple code like that:
import csv
file = 'your_filename.csv' # use your filename
lines_set = set([100, 200]) # use your bad lines numbers here
with open(file) as f_obj:
for line_number, row in enumerate(csv.reader(f_obj)):
if line_number > max(lines_set):
break
elif line_number in lines_set: # put your bad lines numbers here
print(line_number, row)
also we can put it in more general function like that:
import csv
def read_my_lines(file, lines_list, reader=csv.reader):
lines_set = set(lines_list)
with open(file) as f_obj:
for line_number, row in enumerate(csv.reader(f_obj)):
if line_number > max(lines_set):
break
elif line_number in lines_set:
print(line_number, row)
if __name__ == '__main__':
read_my_lines(file='your_filename.csv', lines_list=[100, 200])
part2: the cause of the error you get:
it's hard to diagnose problem like this without a sample of the file you use. but you should try this ..
pd.read_csv(filename)
is it parse the file with no error ? if so, i will explain why.
the number of columns is inferred from the first line.
by using skiprows and header=0
you escaped the first 3 rows, i guess that contains the columns names or the header that should contains the correct number of columns.
basically you constraining what the parser is doing.
so parse without skiprows, or header=0
then reindexing to what you need later.
note:
if you unsure about what delimiter used in the file use sep=None
, but it would be slower.
from pandas.read_csv docs:
sep : str, default ‘,’ Delimiter to use. If sep is None, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator by Python’s builtin sniffer tool, csv.Sniffer. In addition, separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example: '\r\t'
link
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With