Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Error in Reading a csv file in pandas[CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.]

Tags:

python

pandas

csv

So i tried reading all the csv files from a folder and then concatenate them to create a big csv(structure of all the files was same), save it and read it again. All this was done using Pandas. The Error occurs while reading. I am Attaching the code and the Error below.

import pandas as pd import numpy as np import glob  path =r'somePath' # use your path allFiles = glob.glob(path + "/*.csv") frame = pd.DataFrame() list_ = [] for file_ in allFiles:     df = pd.read_csv(file_,index_col=None, header=0)     list_.append(df) store = pd.concat(list_) store.to_csv("C:\work\DATA\Raw_data\\store.csv", sep=',', index= False) store1 = pd.read_csv("C:\work\DATA\Raw_data\\store.csv", sep=',') 

Error:-

CParserError                              Traceback (most recent call last) <ipython-input-48-2983d97ccca6> in <module>() ----> 1 store1 = pd.read_csv("C:\work\DATA\Raw_data\\store.csv", sep=',')  C:\Users\armsharm\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)     472                     skip_blank_lines=skip_blank_lines)     473  --> 474         return _read(filepath_or_buffer, kwds)     475      476     parser_f.__name__ = name  C:\Users\armsharm\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\parsers.pyc in _read(filepath_or_buffer, kwds)     258         return parser     259  --> 260     return parser.read()     261      262 _parser_defaults = {  C:\Users\armsharm\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)     719                 raise ValueError('skip_footer not supported for iteration')     720  --> 721         ret = self._engine.read(nrows)     722      723         if self.options.get('as_recarray'):  C:\Users\armsharm\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)    1168     1169         try: -> 1170             data = self._reader.read(nrows)    1171         except StopIteration:    1172             if nrows is None:  pandas\parser.pyx in pandas.parser.TextReader.read (pandas\parser.c:7544)()  pandas\parser.pyx in pandas.parser.TextReader._read_low_memory (pandas\parser.c:7784)()  pandas\parser.pyx in pandas.parser.TextReader._read_rows (pandas\parser.c:8401)()  pandas\parser.pyx in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:8275)()  pandas\parser.pyx in pandas.parser.raise_parser_error (pandas\parser.c:20691)()  CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file. 

I tried using csv reader as well:-

import csv with open("C:\work\DATA\Raw_data\\store.csv", 'rb') as f:     reader = csv.reader(f)     l = list(reader) 

Error:-

Error                                     Traceback (most recent call last) <ipython-input-36-9249469f31a6> in <module>()       1 with open('C:\work\DATA\Raw_data\\store.csv', 'rb') as f:       2     reader = csv.reader(f) ----> 3     l = list(reader)  Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode? 
like image 247
Arman Sharma Avatar asked Nov 30 '15 12:11

Arman Sharma


People also ask

How do I fix parser error in pandas?

To solve pandas. parser. CParserError , try specifying the sep and/or header arguments when calling read_csv. In the above code, sep parameter defines your delimiter (for eg.

What is parse error in pandas?

ParserError[source] Exception that is raised by an error encountered in parsing file contents. This is a generic error raised for errors encountered when functions like read_csv or read_html are parsing contents of a file. See also read_csv. Read CSV (comma-separated) file into a DataFrame.

How to read CSV file in Python using PANDAS?

Use the below snippet to use the Python engine for reading the CSV file. import pandas as pd df = pd.read_csv ('sample.csv', engine='python', error_bad_lines=False) df

What is pandas errors parsererror?

In today’s short guide, we discussed a few cases where pandas.errors.ParserError: Error tokenizing data is raised by the pandas parser when reading csv files into pandas DataFrames. Additionally, we showcased how to deal with the error by fixing the errors or typos in the data file itself, or by specifying the appropriate line terminator.

How to solve bad lines in pandas data?

You can solve the error by ignoring the offending lines and suppressing errors. import pandas as pd df = pd.read_csv ('sample.csv', error_bad_lines=False, engine ='python') df If You Want to Understand Details, Read on…

How do I fix read_CSV () error?

The most obvious solution to the problem, is to fix the data file manually by removing the extra separators in the lines causing us troubles. This is actually the best solution (assuming that you have specified the right delimiters, headers etc. when calling read_csv function).


2 Answers

I found this error, the cause was that there were some carriage returns "\r" in the data that pandas was using as a line terminator as if it was "\n". I thought I'd post here as that might be a common reason this error might come up.

The solution I found was to add lineterminator='\n' into the read_csv function like this:

df_clean = pd.read_csv('test_error.csv',                  lineterminator='\n') 
like image 154
Louise Fallon Avatar answered Sep 20 '22 22:09

Louise Fallon


If you are using python and its a big file you may use engine='python' as below and should work.

df = pd.read_csv( file_, index_col=None, header=0, engine='python' )

like image 34
Firas Aswad Avatar answered Sep 21 '22 22:09

Firas Aswad