Here is the scenario that I am trying to rid of: I am trying to read the following type of csv: <pre class="prettyprint"><code>para1,para2,para3,para4 1,2,3,4, 1,2,3,4,5, 1,2,3,4, 2,3,4,5,6,7,8,9,0, </code></pre> I am using the following command and getting the following error: <pre class="prettyprint"><code>>>> import pandas as pd >>> df =pd.read_csv("test.csv") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Python35\lib\site-packages\pandas\io\parsers.py", line 702, in parser_f return _read(filepath_or_buffer, kwds) File "C:\Python35\lib\site-packages\pandas\io\parsers.py", line 435, in _read data = parser.read(nrows) File "C:\Python35\lib\site-packages\pandas\io\parsers.py", line 1139, in read ret = self._engine.read(nrows) File "C:\Python35\lib\site-packages\pandas\io\parsers.py", line 1995, in read data = self._reader.read(nrows) File "pandas\_libs\parsers.pyx", line 899, in pandas._libs.parsers.TextReader.read File "pandas\_libs\parsers.pyx", line 914, in pandas._libs.parsers.TextReader._read_low_memory File "pandas\_libs\parsers.pyx", line 968, in pandas._libs.parsers.TextReader._read_rows File "pandas\_libs\parsers.pyx", line 955, in pandas._libs.parsers.TextReader._tokenize_rows File "pandas\_libs\parsers.pyx", line 2172, in pandas._libs.parsers.raise_parser_error pandas.errors.ParserError: Error tokenizing data. C error: Expected 4 fields in line 3, saw 5 </code></pre> I tried to search for the issue and got this thread on SO: Python Pandas Error tokenizing data So, I tried. This is not what I was expecting. It is truncating the values. <pre class="prettyprint"><code>>>> df =pd.read_csv("test.csv",error_bad_lines=False) b'Skipping line 3: expected 4 fields, saw 5\nSkipping line 5: expected 4 fields, saw 9\n' >>> df para1 para2 para3 para4 0 1 2 3 4 1 1 2 3 4 </code></pre> What I wanted is something like this: if there are extra values, then take the columns as the integer values with the highest column found in extra. then make the rest of the values as zero(0) till the last column and read the csv. The output I am expecting is something like this: <pre class="prettyprint"><code>>>> df =pd.read_csv("test.csv") >>> df para1 para2 para3 para4 0 1 2 3 4 0 1 2 3 4 NaN NaN NaN NaN NaN 1 1 2 3 4 5.0 NaN NaN NaN NaN 2 1 2 3 4 NaN NaN NaN NaN NaN 3 2 3 4 5 6.0 7.0 8.0 9.0 0.0 >>> df = df.fillna(0) >>> df para1 para2 para3 para4 0 1 2 3 4 0 1 2 3 4 0.0 0.0 0.0 0.0 0.0 1 1 2 3 4 5.0 0.0 0.0 0.0 0.0 2 1 2 3 4 0.0 0.0 0.0 0.0 0.0 3 2 3 4 5 6.0 7.0 8.0 9.0 0.0 </code></pre> But please take a note of, I do not want to take care of the column. Instead the program must automatically understand and make the column headers as given above. Second, please try to avoid suggesting me to write the header. As there can be number of columns where I might not able to write the header but just leave it as it is. so the missing column header will be the number integer as stated above. Do someone have any solution for the query, please let me know?

I'm not sure if there is a cleaner way to do this, but I tested it out and it works using just pandas: <pre class="prettyprint"><code>df = pd.read_csv('test.csv', header=None, sep='\n') df= df[0].str.split(',', expand=True) new_header = df.iloc[0].fillna(df.columns.to_series()) df = df[1:] df.columns = new_header </code></pre>

Getting error while trying to read csv using pandas Python due to extra column values

Tags:

python

pandas

Here is the scenario that I am trying to rid of:
I am trying to read the following type of csv:

para1,para2,para3,para4
1,2,3,4,
1,2,3,4,5,
1,2,3,4,
2,3,4,5,6,7,8,9,0,

I am using the following command and getting the following error:

>>> import pandas as pd
>>> df =pd.read_csv("test.csv")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python35\lib\site-packages\pandas\io\parsers.py", line 702, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "C:\Python35\lib\site-packages\pandas\io\parsers.py", line 435, in _read
    data = parser.read(nrows)
  File "C:\Python35\lib\site-packages\pandas\io\parsers.py", line 1139, in read
    ret = self._engine.read(nrows)
  File "C:\Python35\lib\site-packages\pandas\io\parsers.py", line 1995, in read
    data = self._reader.read(nrows)
  File "pandas\_libs\parsers.pyx", line 899, in pandas._libs.parsers.TextReader.read
  File "pandas\_libs\parsers.pyx", line 914, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas\_libs\parsers.pyx", line 968, in pandas._libs.parsers.TextReader._read_rows
  File "pandas\_libs\parsers.pyx", line 955, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas\_libs\parsers.pyx", line 2172, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 4 fields in line 3, saw 5

I tried to search for the issue and got this thread on SO:
Python Pandas Error tokenizing data

So, I tried. This is not what I was expecting. It is truncating the values.

>>> df =pd.read_csv("test.csv",error_bad_lines=False)
b'Skipping line 3: expected 4 fields, saw 5\nSkipping line 5: expected 4 fields, saw 9\n'
>>> df


para1  para2  para3  para4
0      1      2      3      4
1      1      2      3      4

What I wanted is something like this:
if there are extra values, then take the columns as the integer values with the highest column found in extra. then make the rest of the values as zero(0) till the last column and read the csv.

The output I am expecting is something like this:

>>> df =pd.read_csv("test.csv")
>>> df
   para1  para2  para3  para4    0    1    2    3    4
0      1      2      3      4  NaN  NaN  NaN  NaN  NaN
1      1      2      3      4  5.0  NaN  NaN  NaN  NaN
2      1      2      3      4  NaN  NaN  NaN  NaN  NaN
3      2      3      4      5  6.0  7.0  8.0  9.0  0.0
>>> df = df.fillna(0)
>>> df
   para1  para2  para3  para4    0    1    2    3    4
0      1      2      3      4  0.0  0.0  0.0  0.0  0.0
1      1      2      3      4  5.0  0.0  0.0  0.0  0.0
2      1      2      3      4  0.0  0.0  0.0  0.0  0.0
3      2      3      4      5  6.0  7.0  8.0  9.0  0.0

But please take a note of, I do not want to take care of the column. Instead the program must automatically understand and make the column headers as given above.

Second, please try to avoid suggesting me to write the header. As there can be number of columns where I might not able to write the header but just leave it as it is. so the missing column header will be the number integer as stated above. Do someone have any solution for the query, please let me know?

772

asked May 20 '19 11:05

Jaffer Wilson

2 Answers

I'm not sure if there is a cleaner way to do this, but I tested it out and it works using just pandas:

df = pd.read_csv('test.csv', header=None, sep='\n')
df= df[0].str.split(',', expand=True)
new_header = df.iloc[0].fillna(df.columns.to_series())
df = df[1:]
df.columns = new_header

118

answered Sep 20 '22 15:09

Kilgore

Ok, that means that you will have to parse the file until its end to get the actual number of columns, because pandas.read_csv has no provision for that requirement.

If high performance is not a concern (*), a simple way is to rely on the good old csv module and dynamically add columns as needed:

with open('test.csv') as fd:
    rd = csv.reader(fd)
    header = next(rd)     # initialize column names from first row
    next_key = 0          # additional columns will start at '0'
    data = {k: list() for k in header}  # initialize data list per column
    for row in rd:
        while len(row) > len(header):    # add eventual new columns
            header.append(str(next_key))
            data[header[-1]] = [np.nan] * len(data[header[0]])
            next_key += 1                # increase next column name
        # eventually extend the row up to the header size
        row.extend([np.nan] * (len(header) - len(row)))
        # and add data to the column lists
        for i, k in enumerate(header): data[k].append(row[i])

# data is now in a dict format, suitable to feed DataFrame
df = pd.DataFrame(data)

(*) above code will not be very efficient because it adds element to lists one at a time. This would be terrible for pandas DataFrame, and is not very very nice even for Python lists. It could be improved by allocating bunches in numpy.ndarray but at the price of increased complexity.

answered Sep 22 '22 15:09

Serge Ballesta

Related questions
                            
                                Why is my program became really laggy after I added rotation, and how do I fix this?
                            
                                Understanding inconsistent cythonized code behavior - PyQt5 vs. PySide2
                            
                                Layout and Dropdown menu in Dash - Python
                            
                                Heroku app successfully deploying, but receiving application error when loading site
                            
                                Use the highest value for duplicate IDs (Pandas DataFrame)
                            
                                How to handle Google Authenticator with Selenium
                            
                                Pandas datetime week not as expected
                            
                                Displaying matplotlib plot using Flask
                            
                                Iterable unpacking and slice assignment
                            
                                requests.exceptions.ConnectionError: HTTPConnectionPool(host='127.0.0.1', port=8000): Max retries exceeded with url: /api/1/
                            
                                Enable APIs using serviceusage API with a service account
                            
                                How to install libcurl with nss backend in aws ec2? (Python 3.6 64bit Amazon Linux)
                            
                                Downsizing from Anaconda to Miniconda
                            
                                Tensorflow2.0 training: model.compile vs GradientTape
                            
                                Suppress OpenMP debug messages when running Tensorflow on CPU
                            
                                How to vectorize pandas dataframe forward column value search
                            
                                Pandas: Separate column containing semicolon into multiple columns based on the values
                            
                                How to permanently mock return value of a function in python unittest
                            
                                How to find top_left, top_right, bottom_left, right coordinates in 2d mask where cell has specified value?
                            
                                How to extract json from nested column to dataframe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With