I am webscraping some data from a few websites, and using pandas to modify it. On the first few chunks of data it worked well, but later I get this error message: <pre class="prettyprint"><code>Traceback(most recent call last): File "data.py", line 394 in <module> df2[['STATUS_ID_1','STATUS_ID_2']] = df2['STATUS'].str.split(n=1, expand=True) File "/home/web/.local/lib/python2.7/site-packages/pandas/core/frame.py, line 2326, in __setitem__ self._setitem_array(key,value) File "/home/web/.local/lib/python2.7/site-packages/pandas/core/frame.py, line 2350, in _setitem_array raise ValueError("Columns must be same length as key') ValueError: Columns must be same length as key </code></pre> My code is here: <pre class="prettyprint"><code>df2 = pd.DataFrame(datatable,columns = cols) df2['FLIGHT_ID_1'] = df2['FLIGHT'].str[:3] df2['FLIGHT_ID_2'] = df2['FLIGHT'].str[3:].str.zfill(4) df2[['STATUS_ID_1','STATUS_ID_2']] = df2['STATUS'].str.split(n=1, expand=True) </code></pre> EDIT-jezrael : i used your code, and maked a print from this: I hope with this we can find where is the problem..because it seems it is randomly when the scripts has got a problem with this split.. <pre class="prettyprint"><code> 0 1 2 Landed 8:33 AM 3 Landed 9:37 AM 4 Landed 9:10 AM 5 Landed 9:57 AM 6 Landed 9:36 AM 8 Landed 8:51 AM 9 Landed 9:18 AM 11 Landed 8:53 AM 12 Landed 7:59 AM 13 Landed 7:52 AM 14 Landed 8:56 AM 15 Landed 8:09 AM 18 Landed 8:42 AM 19 Landed 9:39 AM 20 Landed 9:45 AM 21 Landed 7:44 AM 23 Landed 8:36 AM 27 Landed 9:53 AM 29 Landed 9:26 AM 30 Landed 8:23 AM 35 Landed 9:59 AM 36 Landed 8:38 AM 37 Landed 9:38 AM 38 Landed 9:37 AM 40 Landed 9:27 AM 43 Landed 9:14 AM 44 Landed 9:22 AM 45 Landed 8:18 AM 46 Landed 10:01 AM 47 Landed 10:21 AM .. ... ... 316 Delayed 5:00 PM 317 Delayed 4:34 PM 319 Estimated 2:58 PM 320 Estimated 3:02 PM 321 Delayed 4:47 PM 323 Estimated 3:08 PM 325 Delayed 3:52 PM 326 Estimated 3:09 PM 327 Estimated 2:37 PM 328 Estimated 3:17 PM 329 Estimated 3:20 PM 330 Estimated 2:39 PM 331 Delayed 4:04 PM 332 Delayed 4:36 PM 337 Estimated 3:47 PM 339 Estimated 3:37 PM 341 Delayed 4:32 PM 345 Estimated 3:34 PM 349 Estimated 3:24 PM 356 Delayed 4:56 PM 358 Estimated 3:45 PM 367 Estimated 4:09 PM 370 Estimated 4:04 PM 371 Estimated 4:11 PM 373 Delayed 5:21 PM 382 Estimated 3:56 PM 384 Delayed 4:28 PM 389 Delayed 4:41 PM 393 Estimated 4:02 PM 397 Delayed 5:23 PM [240 rows x 2 columns] </code></pre>

You need a bit modify solution, because sometimes it return 2 and sometimes only one column: <pre class="prettyprint"><code>df2 = pd.DataFrame({'STATUS':['Estimated 3:17 PM','Delayed 3:00 PM']}) df3 = df2['STATUS'].str.split(n=1, expand=True) df3.columns = ['STATUS_ID{}'.format(x+1) for x in df3.columns] print (df3) STATUS_ID1 STATUS_ID2 0 Estimated 3:17 PM 1 Delayed 3:00 PM df2 = df2.join(df3) print (df2) STATUS STATUS_ID1 STATUS_ID2 0 Estimated 3:17 PM Estimated 3:17 PM 1 Delayed 3:00 PM Delayed 3:00 PM </code></pre> Another possible data - all data have no whitespaces and solution working too: <pre class="prettyprint"><code>df2 = pd.DataFrame({'STATUS':['Canceled','Canceled']}) </code></pre> and solution return: <pre class="prettyprint"><code>print (df2) STATUS STATUS_ID1 0 Canceled Canceled 1 Canceled Canceled </code></pre> All together: <pre class="prettyprint"><code>df3 = df2['STATUS'].str.split(n=1, expand=True) df3.columns = ['STATUS_ID{}'.format(x+1) for x in df3.columns] df2 = df2.join(df3) </code></pre>

To solve this error, check the shape of the object you're trying to assign the df columns (using <code>np.shape</code>). The second (or the last) dimension must match the number of columns you're trying to assign to. For example, if you try to assign a 2-column numpy array to 3 columns, you'll see this error. A general workaround (for case 1 and case 2 below) is to cast the object you're trying to assign to a DataFrame and <code>join()</code> it to <code>df</code>, i.e. instead of (1), use (2). <pre class="prettyprint lang-py prettyprint-override"><code>df[cols] = vals # (1) df = df.join(vals) if isinstance(vals, pd.DataFrame) else df.join(pd.DataFrame(vals)) # (2) </code></pre> If you're trying to replace values in an existing column and got this error (case 3(a) below), convert the object to list and assign. <pre class="prettyprint lang-py prettyprint-override"><code>df[cols] = vals.values.tolist() </code></pre> If you have duplicate columns (case 3(b) below), then there's no easy fix. You'll have to make the dimensions match manually. <hr> This error occurs in 3 cases: Case 1: When you try to assign a list-like object (e.g. lists, tuples, sets, numpy arrays, and pandas Series) to a list of DataFrame column(s) as new arrays1 but the number of columns doesn't match the second (or last) dimension (found using <code>np.shape</code>) of the list-like object. So the following reproduces this error: <pre class="prettyprint lang-py prettyprint-override"><code>df = pd.DataFrame({'A': [0, 1]}) cols, vals = ['B'], [[2], [4, 5]] df[cols] = vals # number of columns is 1 but the list has shape (2,) </code></pre> Note that if the columns are not given as list, pandas Series, numpy array or Pandas Index, this error won't occur. So the following doesn't reproduce the error: <pre class="prettyprint lang-py prettyprint-override"><code>df[('B',)] = vals # the column is given as a tuple </code></pre> One interesting edge case occurs when the list-like object is multi-dimensional (but not a numpy array). In that case, under the hood, the object is cast to a pandas DataFrame first and is checked if its last dimension matches the number of columns. This produces the following interesting case: <pre class="prettyprint lang-py prettyprint-override"><code># the error occurs below because pd.DataFrame(vals1) has shape (2, 2) and len(['B']) != 2 vals1 = [[[2], [3]], [[4], [5]]] df[cols] = vals1 # no error below because pd.DataFrame(vals2) has shape (2, 1) and len(['B']) == 1 vals2 = [[[[2], [3]]], [[[4], [5]]]] df[cols] = vals2 </code></pre> Case 2: When you try to assign a DataFrame to a list (or pandas Series or numpy array or pandas Index) of columns but the respective numbers of columns don't match. This case is what caused the error in the OP. The following reproduce the error: <pre class="prettyprint lang-py prettyprint-override"><code>df = pd.DataFrame({'A': [0, 1]}) df[['B']] = pd.DataFrame([[2, 3], [4]]) # a 2-column df is trying to be assigned to a single column df[['B', 'C']] = pd.DataFrame([[2], [4]]) # a single column df is trying to be assigned to 2 columns </code></pre> Case 3: When you try to replace the values of existing column(s) by a DataFrame (or a list-like object) whose number of columns doesn't match the number of columns it's replacing. So the following reproduce the error: <pre class="prettyprint lang-py prettyprint-override"><code># case 3(a) df1 = pd.DataFrame({'A': [0, 1]}) df1['A'] = pd.DataFrame([[2, 3], [4, 5]]) # df1 has a single column named 'A' but a 2-column-df is trying to be assigned # case 3(b): duplicate column names matter too df2 = pd.DataFrame([[0, 1], [2, 3]], columns=['A','A']) df2['A'] = pd.DataFrame([[2], [4]]) # df2 has 2 columns named 'A' but a single column df is being assigned </code></pre> 1: <code>df.loc[:, cols] = vals</code> may overwrite data inplace, so this won't produce the error but will create columns of NaN values.

Pandas error in Python: columns must be same length as key

Tags:

python

pandas

web-scraping

I am webscraping some data from a few websites, and using pandas to modify it.

On the first few chunks of data it worked well, but later I get this error message:

Traceback(most recent call last):
  File "data.py", line 394 in <module> df2[['STATUS_ID_1','STATUS_ID_2']] = df2['STATUS'].str.split(n=1, expand=True)
  File "/home/web/.local/lib/python2.7/site-packages/pandas/core/frame.py, line 2326, in __setitem__ self._setitem_array(key,value)
  File "/home/web/.local/lib/python2.7/site-packages/pandas/core/frame.py, line 2350, in _setitem_array
raise ValueError("Columns must be same length as key')  ValueError: Columns must be same length as key

My code is here:

df2 = pd.DataFrame(datatable,columns = cols)
df2['FLIGHT_ID_1'] = df2['FLIGHT'].str[:3]
df2['FLIGHT_ID_2'] = df2['FLIGHT'].str[3:].str.zfill(4)
df2[['STATUS_ID_1','STATUS_ID_2']] = df2['STATUS'].str.split(n=1, expand=True)

EDIT-jezrael : i used your code, and maked a print from this: I hope with this we can find where is the problem..because it seems it is randomly when the scripts has got a problem with this split..

                 0         1
2       Landed   8:33 AM
3       Landed   9:37 AM
4       Landed   9:10 AM
5       Landed   9:57 AM
6       Landed   9:36 AM
8       Landed   8:51 AM
9       Landed   9:18 AM
11      Landed   8:53 AM
12      Landed   7:59 AM
13      Landed   7:52 AM
14      Landed   8:56 AM
15      Landed   8:09 AM
18      Landed   8:42 AM
19      Landed   9:39 AM
20      Landed   9:45 AM
21      Landed   7:44 AM
23      Landed   8:36 AM
27      Landed   9:53 AM
29      Landed   9:26 AM
30      Landed   8:23 AM
35      Landed   9:59 AM
36      Landed   8:38 AM
37      Landed   9:38 AM
38      Landed   9:37 AM
40      Landed   9:27 AM
43      Landed   9:14 AM
44      Landed   9:22 AM
45      Landed   8:18 AM
46      Landed  10:01 AM
47      Landed  10:21 AM
..         ...       ...
316    Delayed   5:00 PM
317    Delayed   4:34 PM
319  Estimated   2:58 PM
320  Estimated   3:02 PM
321    Delayed   4:47 PM
323  Estimated   3:08 PM
325    Delayed   3:52 PM
326  Estimated   3:09 PM
327  Estimated   2:37 PM
328  Estimated   3:17 PM
329  Estimated   3:20 PM
330  Estimated   2:39 PM
331    Delayed   4:04 PM
332    Delayed   4:36 PM
337  Estimated   3:47 PM
339  Estimated   3:37 PM
341    Delayed   4:32 PM
345  Estimated   3:34 PM
349  Estimated   3:24 PM
356    Delayed   4:56 PM
358  Estimated   3:45 PM
367  Estimated   4:09 PM
370  Estimated   4:04 PM
371  Estimated   4:11 PM
373    Delayed   5:21 PM
382  Estimated   3:56 PM
384    Delayed   4:28 PM
389    Delayed   4:41 PM
393  Estimated   4:02 PM
397    Delayed   5:23 PM

[240 rows x 2 columns]

281

asked Oct 05 '17 12:10

Harley

2 Answers

You need a bit modify solution, because sometimes it return 2 and sometimes only one column:

df2 = pd.DataFrame({'STATUS':['Estimated 3:17 PM','Delayed 3:00 PM']})


df3 = df2['STATUS'].str.split(n=1, expand=True)
df3.columns = ['STATUS_ID{}'.format(x+1) for x in df3.columns]
print (df3)
  STATUS_ID1 STATUS_ID2
0  Estimated    3:17 PM
1    Delayed    3:00 PM

df2 = df2.join(df3)
print (df2)
              STATUS STATUS_ID1 STATUS_ID2
0  Estimated 3:17 PM  Estimated    3:17 PM
1    Delayed 3:00 PM    Delayed    3:00 PM

Another possible data - all data have no whitespaces and solution working too:

df2 = pd.DataFrame({'STATUS':['Canceled','Canceled']})

and solution return:

print (df2)
     STATUS STATUS_ID1
0  Canceled   Canceled
1  Canceled   Canceled

All together:

df3 = df2['STATUS'].str.split(n=1, expand=True)
df3.columns = ['STATUS_ID{}'.format(x+1) for x in df3.columns]
df2 = df2.join(df3)

194

answered Oct 15 '22 09:10

jezrael

To solve this error, check the shape of the object you're trying to assign the df columns (using np.shape). The second (or the last) dimension must match the number of columns you're trying to assign to. For example, if you try to assign a 2-column numpy array to 3 columns, you'll see this error.

A general workaround (for case 1 and case 2 below) is to cast the object you're trying to assign to a DataFrame and join() it to df, i.e. instead of (1), use (2).

df[cols] = vals   # (1)
df = df.join(vals) if isinstance(vals, pd.DataFrame) else df.join(pd.DataFrame(vals))  # (2)

If you're trying to replace values in an existing column and got this error (case 3(a) below), convert the object to list and assign.

df[cols] = vals.values.tolist()

If you have duplicate columns (case 3(b) below), then there's no easy fix. You'll have to make the dimensions match manually.

This error occurs in 3 cases:

Case 1: When you try to assign a list-like object (e.g. lists, tuples, sets, numpy arrays, and pandas Series) to a list of DataFrame column(s) as new arrays¹ but the number of columns doesn't match the second (or last) dimension (found using np.shape) of the list-like object. So the following reproduces this error:

df = pd.DataFrame({'A': [0, 1]})
cols, vals = ['B'], [[2], [4, 5]]
df[cols] = vals # number of columns is 1 but the list has shape (2,)

Note that if the columns are not given as list, pandas Series, numpy array or Pandas Index, this error won't occur. So the following doesn't reproduce the error:

df[('B',)] = vals # the column is given as a tuple

One interesting edge case occurs when the list-like object is multi-dimensional (but not a numpy array). In that case, under the hood, the object is cast to a pandas DataFrame first and is checked if its last dimension matches the number of columns. This produces the following interesting case:

# the error occurs below because pd.DataFrame(vals1) has shape (2, 2) and len(['B']) != 2
vals1 = [[[2], [3]], [[4], [5]]]
df[cols] = vals1

# no error below because pd.DataFrame(vals2) has shape (2, 1) and len(['B']) == 1
vals2 = [[[[2], [3]]], [[[4], [5]]]]
df[cols] = vals2

Case 2: When you try to assign a DataFrame to a list (or pandas Series or numpy array or pandas Index) of columns but the respective numbers of columns don't match. This case is what caused the error in the OP. The following reproduce the error:

df = pd.DataFrame({'A': [0, 1]})
df[['B']] = pd.DataFrame([[2, 3], [4]]) # a 2-column df is trying to be assigned to a single column

df[['B', 'C']] = pd.DataFrame([[2], [4]]) # a single column df is trying to be assigned to 2 columns

Case 3: When you try to replace the values of existing column(s) by a DataFrame (or a list-like object) whose number of columns doesn't match the number of columns it's replacing. So the following reproduce the error:

# case 3(a)
df1 = pd.DataFrame({'A': [0, 1]})
df1['A'] = pd.DataFrame([[2, 3], [4, 5]]) # df1 has a single column named 'A' but a 2-column-df is trying to be assigned

# case 3(b): duplicate column names matter too
df2 = pd.DataFrame([[0, 1], [2, 3]], columns=['A','A'])
df2['A'] = pd.DataFrame([[2], [4]]) # df2 has 2 columns named 'A' but a single column df is being assigned

¹: df.loc[:, cols] = vals may overwrite data inplace, so this won't produce the error but will create columns of NaN values.

answered Oct 15 '22 07:10

cottontail

Related questions
                            
                                Tensor type mismatch when moving to GPU
                            
                                What happens when you initialize instance variables outside of __init__
                            
                                Default python /usr/bin/python instead of /usr/local/bin/python
                            
                                How to merge overlapping columns
                            
                                The table-striped class is not giving me alternate color
                            
                                How to install miniconda on Ubuntu automatically
                            
                                Rolling Window In Pandas - Explanation
                            
                                Why return type is not checked in python3? [duplicate]
                            
                                Keras model.predict always 0
                            
                                Django Admin, sort with custom function
                            
                                Increasing bar width in bar chart using Altair
                            
                                How do you properly integrate unit tests for file parsing with pytest?
                            
                                Merge the first row with the column headers in a dataframe
                            
                                Pandas - Get unique values from column along with lists of row indices where they appear
                            
                                Trying to understand scipy and numpy interpolation
                            
                                How to parse the output received by gRPC stub client from tensorflow serving server?
                            
                                Count number of special characters [^&$#] appearing in a paragraph
                            
                                python rstrip or remove end of string by a pattern of characters
                            
                                Insert a node into an abstract syntax tree
                            
                                Converting raw file content from Flask file upload into dataframe using pandas

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With