Pandas: How to workaround "error tokenizing data"?

Tags:

A lot of questions have been already asked about this topic on SO. (and many others). Among the numerous answers, none of them was really helpful to me so far. If I missed the useful one, please let me know.

I simply would like to read a CSV file with pandas into a dataframe. Sounds like a simple task.

My file Test.csv

Click to copy

1,2,3,4,5
1,2,3,4,5,6
,,3,4,5
1,2,3,4,5,6,7
,2,,4

My code:

Click to copy

import pandas as pd
df = pd.read_csv('Test.csv',header=None)

My error:

Click to copy

pandas.errors.ParserError: Error tokenizing data. C error: Expected 5 fields in line 2, saw 6

My guess about the issue is that Pandas looks to the first line and expects the same number of tokens in the following rows. If this is not the case it will stop with an error.

In the numerous answers, the suggestions for using options are, e.g.: error_bad_lines=False or header=None or skiprows=3 and more non-helpful suggestions.

However, I don't want to ignore any lines or skip. And I don't know in advance how many columns and rows the datafile has.

So it basically boils down to how to find the maximum number of columns in the datafile. Is this the way to go? I hoped that there was an easy way to simply read a CSV file which does not have the maximum column number in the first line. Thank you for any hints. I'm using Python 3.6.3, Pandas 0.24.1 on Win7.

682

asked Mar 15 '19 18:03

theozh

4 Answers

Thank you @ALollz for the "very fresh" link (lucky coincidence) and @Rich Andrews for pointing out that my example actually is not "strictly correct" CSV data.

So, the way it works for me for the time being is adapted from @ALollz' compact solution (https://stackoverflow.com/a/55129746/7295599)

Click to copy

### reading an "incorrect" CSV to dataframe having a variable number of columns/tokens 
import pandas as pd

df = pd.read_csv('Test.csv', header=None, sep='\n')
df = df[0].str.split(',', expand=True)
# ... do some modifications with df
### end of code

df contains empty string '' for the missing entries at the beginning and the middle, and None for the missing tokens at the end.

Click to copy

   0  1  2  3     4     5     6
0  1  2  3  4     5  None  None
1  1  2  3  4     5     6  None
2        3  4     5  None  None
3  1  2  3  4     5     6     7
4     2     4  None  None  None

If you write this again to a file via:

df.to_csv("Test.tab",sep="\t",header=False,index=False)

Click to copy

1   2   3   4   5       
1   2   3   4   5   6   
        3   4   5       
1   2   3   4   5   6   7
    2       4

None will be converted to empty string '' and everything is fine.

The next level would be to account for data strings in quotes which contain the separator, but that's another topic.

Click to copy

1,2,3,4,5
,,3,"Hello, World!",5,6
1,2,3,4,5,6,7

answered Nov 10 '22 01:11

theozh

Read the csv using the tolerant python csv module, and fix the loaded file prior to handing it off to pandas, which will fails on the otherwise malformed csv data regardless of the csv engine pandas uses.

Click to copy

import pandas as pd
import csv

not_csv = """1,2,3,4,5
1,2,3,4,5,6
,,3,4,5
1,2,3,4,5,6,7
,2,,4
"""

with open('not_a.csv', 'w') as csvfile:
    csvfile.write(not_csv)

d = []
with open('not_a.csv') as csvfile:
    areader = csv.reader(csvfile)
    max_elems = 0
    for row in areader:
        if max_elems < len(row): max_elems = len(row)
    csvfile.seek(0)
    for i, row in enumerate(areader):
        # fix my csv by padding the rows
        d.append(row + ["" for x in range(max_elems-len(row))])

df = pd.DataFrame(d)
print df

# the default engine
# provides "pandas.errors.ParserError: Error tokenizing data. C error: Expected 5 fields in line 2, saw 6 "
#df = pd.read_csv('Test.csv',header=None, engine='c')

# the python csv engine
# provides "pandas.errors.ParserError: Expected 6 fields in line 4, saw 7 "
#df = pd.read_csv('Test.csv',header=None, engine='python')

Preprocess file outside of python if concerned about extra code inside python creating too much python code.

Click to copy

Richs-MBP:tmp randrews$ cat test.csv
1,2,3
1,
2
1,2,
,,,
Richs-MBP:tmp randrews$ awk 'BEGIN {FS=","}; {print $1","$2","$3","$4","$5}' < test.csv
1,2,3,,
1,,,,
2,,,,
1,2,,,
,,,,

answered Nov 10 '22 01:11

Rich Andrews

I have a different take on the solution. Let pandas take care of creating the table and deleting None values and let us take care of writing a proper tokenizer.

Tokenizer

Click to copy

def tokenize(str):
    idx = [x for x, v in enumerate(str) if v == '\"']
    if len(idx) % 2 != 0:
        idx = idx[:-1]
    memory = {}
    for i in range(0, len(idx), 2):
        val = str[idx[i]:idx[i+1]+1]
        key = "_"*(len(val)-1)+"{0}".format(i)
        memory[key] = val
        str = str.replace(memory[key], key, 1)        
    return [memory.get(token, token) for token in str.split(",")]

Test cases for Tokenizer

Click to copy

print (tokenize("1,2,3,4,5"))
print (tokenize(",,3,\"Hello, World!\",5,6"))
print (tokenize(",,3,\"Hello,,,, World!\",5,6"))
print (tokenize(",,3,\"Hello, World!\",5,6,,3,\"Hello, World!\",5,6"))
print (tokenize(",,3,\"Hello, World!\",5,6,,3,\"Hello,,5,6"))

Output

['1', '2', '3', '4', '5'] ['', '', '3', '"Hello, World!"', '5', '6'] ['', '', '3', '"Hello,,,, World!"', '5', '6'] ['', '', '3', '"Hello, World!"', '5', '6', '', '3', '"Hello, World!"', '5', '6'] ['', '', '3', '"Hello, World!"', '5', '6', '', '3', '"Hello', '', '5', '6']

Putting the tokenizer into action

Click to copy

with open("test1.csv", "r") as fp:
    lines = fp.readlines()

lines = list(map(lambda x: tokenize(x.strip()), lines))
df = pd.DataFrame(lines).replace(np.nan, '')

Advantage:

Now we can teak the tokenizer function as per our needs

answered Nov 09 '22 23:11

mujjiga

In my case 1 I opened the *.csv in Excel 2 I saved the *.csv as CSV (comma-delimited) 3 I loaded the file in python via:

Click to copy

import pandas as pd
df = pd.read_csv('yourcsvfile.csv', sep=',')

Hope it helps!

answered Nov 10 '22 00:11

Corina Roca

Related questions
                            
                                How to make two markers share the same label in the legend using matplotlib?
                            
                                Print exception with stack trace to file
                            
                                Error with Sklearn Random Forest Regressor
                            
                                Pandas Dataframe: How to update multiple columns by applying a function?
                            
                                How to find the shortest dependency path between two words in Python?
                            
                                'Graph' object has no attribute 'nodes_iter' in networkx module python
                            
                                How to make a ttk.Combobox callback
                            
                                Django: How to get related objects of a queryset?
                            
                                Get all comments from a specific reddit thread in python
                            
                                SqlAlchemy: How to implement DROP TABLE ... CASCADE?
                            
                                Error when using importlib.util to check for library
                            
                                Django loaddata UNIQUE constraint failed
                            
                                Python: nested 'for' loops
                            
                                Create adjacency matrix for two columns in pandas dataframe
                            
                                Max in a sliding window in NumPy array
                            
                                pandas read_excel multiple tables on the same sheet
                            
                                sklearn: Hyperparameter tuning by gradient descent?
                            
                                How to extract and save images from tensorboard event summary?
                            
                                Clean-up database connection with SQLAlchemy in Pandas
                            
                                How to force matplotlib to show values on x-axis as integers

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas: How to workaround "error tokenizing data"?

Tags:

python

pandas

csv

theozh

People also ask

4 Answers

theozh

Rich Andrews

Tokenizer

Test cases for Tokenizer

Putting the tokenizer into action

Advantage:

mujjiga

Corina Roca

Recent Activity

Donate For Us