So i tried reading all the csv files from a folder and then concatenate them to create a big csv(structure of all the files was same), save it and read it again. All this was done using Pandas. The Error occurs while reading. I am Attaching the code and the Error below.
import pandas as pd import numpy as np import glob path =r'somePath' # use your path allFiles = glob.glob(path + "/*.csv") frame = pd.DataFrame() list_ = [] for file_ in allFiles: df = pd.read_csv(file_,index_col=None, header=0) list_.append(df) store = pd.concat(list_) store.to_csv("C:\work\DATA\Raw_data\\store.csv", sep=',', index= False) store1 = pd.read_csv("C:\work\DATA\Raw_data\\store.csv", sep=',')
Error:-
CParserError Traceback (most recent call last) <ipython-input-48-2983d97ccca6> in <module>() ----> 1 store1 = pd.read_csv("C:\work\DATA\Raw_data\\store.csv", sep=',') C:\Users\armsharm\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines) 472 skip_blank_lines=skip_blank_lines) 473 --> 474 return _read(filepath_or_buffer, kwds) 475 476 parser_f.__name__ = name C:\Users\armsharm\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\parsers.pyc in _read(filepath_or_buffer, kwds) 258 return parser 259 --> 260 return parser.read() 261 262 _parser_defaults = { C:\Users\armsharm\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows) 719 raise ValueError('skip_footer not supported for iteration') 720 --> 721 ret = self._engine.read(nrows) 722 723 if self.options.get('as_recarray'): C:\Users\armsharm\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows) 1168 1169 try: -> 1170 data = self._reader.read(nrows) 1171 except StopIteration: 1172 if nrows is None: pandas\parser.pyx in pandas.parser.TextReader.read (pandas\parser.c:7544)() pandas\parser.pyx in pandas.parser.TextReader._read_low_memory (pandas\parser.c:7784)() pandas\parser.pyx in pandas.parser.TextReader._read_rows (pandas\parser.c:8401)() pandas\parser.pyx in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:8275)() pandas\parser.pyx in pandas.parser.raise_parser_error (pandas\parser.c:20691)() CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
I tried using csv reader as well:-
import csv with open("C:\work\DATA\Raw_data\\store.csv", 'rb') as f: reader = csv.reader(f) l = list(reader)
Error:-
Error Traceback (most recent call last) <ipython-input-36-9249469f31a6> in <module>() 1 with open('C:\work\DATA\Raw_data\\store.csv', 'rb') as f: 2 reader = csv.reader(f) ----> 3 l = list(reader) Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
To solve pandas. parser. CParserError , try specifying the sep and/or header arguments when calling read_csv. In the above code, sep parameter defines your delimiter (for eg.
ParserError[source] Exception that is raised by an error encountered in parsing file contents. This is a generic error raised for errors encountered when functions like read_csv or read_html are parsing contents of a file. See also read_csv. Read CSV (comma-separated) file into a DataFrame.
Use the below snippet to use the Python engine for reading the CSV file. import pandas as pd df = pd.read_csv ('sample.csv', engine='python', error_bad_lines=False) df
In today’s short guide, we discussed a few cases where pandas.errors.ParserError: Error tokenizing data is raised by the pandas parser when reading csv files into pandas DataFrames. Additionally, we showcased how to deal with the error by fixing the errors or typos in the data file itself, or by specifying the appropriate line terminator.
You can solve the error by ignoring the offending lines and suppressing errors. import pandas as pd df = pd.read_csv ('sample.csv', error_bad_lines=False, engine ='python') df If You Want to Understand Details, Read on…
The most obvious solution to the problem, is to fix the data file manually by removing the extra separators in the lines causing us troubles. This is actually the best solution (assuming that you have specified the right delimiters, headers etc. when calling read_csv function).
I found this error, the cause was that there were some carriage returns "\r" in the data that pandas was using as a line terminator as if it was "\n". I thought I'd post here as that might be a common reason this error might come up.
The solution I found was to add lineterminator='\n' into the read_csv function like this:
df_clean = pd.read_csv('test_error.csv', lineterminator='\n')
If you are using python and its a big file you may use engine='python'
as below and should work.
df = pd.read_csv( file_, index_col=None, header=0, engine='python' )
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With