I am trying to read a csv file using pandas
df1 = pd.read_csv('panda_error.csv', header=None, sep=',')
But I am getting this error:
ParserError: Error tokenizing data. C error: Expected 7 fields in line 4, saw 10
For reproducibility, here is the csv file panda_error.csv
superkingdom:Bacteria , phylum:Actinobacteria , class:Actinobacteria , order:Corynebacteriales , family:Corynebacteriaceae , genus:Corynebacterium , species:Corynebacterium efficiens 1
superkingdom:Bacteria , phylum:Proteobacteria , class:Alphaproteobacteria , order:Rhizobiales , family:Aurantimonadaceae , genus:Aurantimonas , species:Aurantimonas manganoxydans 1
superkingdom:Bacteria , phylum:Proteobacteria , subphylum:delta/epsilon subdivisions , class:Deltaproteobacteria , no rank:unclassified Deltaproteobacteria , genus:Candidatus Entotheonella 1
superkingdom:Bacteria , phylum:Proteobacteria , class:Gammaproteobacteria , order:Pseudomonadales , family:Pseudomonadaceae , genus:Pseudomonas , species group:Pseudomonas syringae group , species subgroup:Pseudomonas syringae group genomosp. 2 , species:Pseudomonas amygdali , no rank:Pseudomonas amygdali pv. tabaci 1
superkingdom:Bacteria , phylum:Actinobacteria , class:Actinobacteria , order:Corynebacteriales , family:Nocardiaceae , genus:Rhodococcus , species:Rhodococcus wratislaviensis 1
superkingdom:Bacteria , phylum:Firmicutes , class:Clostridia , order:Clostridiales , family:Peptostreptococcaceae , genus:Peptoclostridium , species:Peptoclostridium difficile1
I am not really sure why this is happening and how to resolve this. The other answers just suggest to 1. ignore the troubling lines using error_bad_lines=False which I do not want to do, or 2. specific to certain scenario.
Here is the full error message if this helps:
---------------------------------------------------------------------------
ParserError Traceback (most recent call last)
<ipython-input-34-72c0ecaf0513> in <module>
----> 1 df1 = pd.read_csv('panda_error.csv', header=None, sep=',')
/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
683 )
684
--> 685 return _read(filepath_or_buffer, kwds)
686
687 parser_f.__name__ = name
/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
461
462 try:
--> 463 data = parser.read(nrows)
464 finally:
465 parser.close()
/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py in read(self, nrows)
1152 def read(self, nrows=None):
1153 nrows = _validate_integer("nrows", nrows)
-> 1154 ret = self._engine.read(nrows)
1155
1156 # May alter columns / col_dict
/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py in read(self, nrows)
2057 def read(self, nrows=None):
2058 try:
-> 2059 data = self._reader.read(nrows)
2060 except StopIteration:
2061 if self._first_chunk:
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()
pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()
ParserError: Error tokenizing data. C error: Expected 7 fields in line 4, saw 10
This solution worked for me
### Loop the data lines
with open("panda_error.csv", 'r') as temp_f:
# get No of columns in each line
col_count = [ len(l.split(",")) for l in temp_f.readlines() ]
### Generate column names (names will be 0, 1, 2, ..., maximum columns - 1)
column_names = [i for i in range(0, max(col_count))]
### Read csv
df = pd.read_csv("panda_error.csv", header=None, delimiter=",", names=column_names)
Pandas is a tool to process tabular data. It means that each row should contain the same number of fields. In case of CSV input there is one more requirement, namely fields in each row should be in the same order.
But your input file actually fails to meet both these requirements.
First 2 rows (and probably most others) have 7 fields: superkingdom, phylum, class, order, family, genus and species.
The third row contains: superkingdom, phylum, subphylum, class, no rank and genus. So:
This does not cause the read_csv to fail, only because the number of fields does not exceed the number of fields in previous rows (there are total 6 fields).
But the real problem is in row 4, where you have 10 fields.
So "ordinary" read_csv is by no means any good choice here. Even if you set the number of columns enough to read all rows, the attributes will be "scattered" among columns in a way difficult to read.
Any attempt to analyze such data based on column names will also fail, because each column will have different information in different rows.
Yet another problem is that data divided on commas will contain e.g. superkingdom:Bacteria, i.e.:
To overcome these problems, try another approach to read your input file:
Read your input file, using read_csv, but as a single column (sep set to a non-used char).
df = pd.read_csv('input.csv', sep='|', names=['col1'])
The next step, leading to a DataFrame which can be analyzed by a program is extractall (import re is needed):
df2 = df.col1.str.extractall(
r'(?P<name>[A-Z ]+[A-Z]):(?P<value>[A-Z /]+[A-Z])', flags=re.I)\
.reset_index(level=1, drop=True)
If you are not fluent in regular expressions, read a little about them.
The result is a DataFrame with 2 columns:
The index is just same as in df - it is the source row number, starting from 0.
For your sample data the result is as follows:
name value
0 superkingdom Bacteria
0 phylum Actinobacteria
0 class Actinobacteria
0 order Corynebacteriales
0 family Corynebacteriaceae
0 genus Corynebacterium
0 species Corynebacterium efficiens
1 superkingdom Bacteria
1 phylum Proteobacteria
1 class Alphaproteobacteria
1 order Rhizobiales
1 family Aurantimonadaceae
1 genus Aurantimonas
1 species Aurantimonas manganoxydans
2 superkingdom Bacteria
2 phylum Proteobacteria
2 subphylum delta/epsilon subdivisions
2 class Deltaproteobacteria
2 no rank unclassified Deltaproteobacteria
2 genus Candidatus Entotheonella
3 superkingdom Bacteria
3 phylum Proteobacteria
3 class Gammaproteobacteria
3 order Pseudomonadales
3 family Pseudomonadaceae
3 genus Pseudomonas
3 species group Pseudomonas syringae group
3 species subgroup Pseudomonas syringae group genomosp
3 species Pseudomonas amygdali
3 no rank Pseudomonas amygdali pv
4 superkingdom Bacteria
4 phylum Actinobacteria
4 class Actinobacteria
4 order Corynebacteriales
4 family Nocardiaceae
4 genus Rhodococcus
4 species Rhodococcus wratislaviensis
5 superkingdom Bacteria
5 phylum Firmicutes
5 class Clostridia
5 order Clostridiales
5 family Peptostreptococcaceae
5 genus Peptoclostridium
5 species Peptoclostridium difficile
If you want to have these data as a table with each name converted to respective column, run:
df3 = df2.set_index('name', append=True).unstack(fill_value='')
df3.columns = df3.columns.droplevel()
Take a look at the result and I think it will be more readable than any other attempt.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With