Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas error in Python: columns must be same length as key

I am webscraping some data from a few websites, and using pandas to modify it.

On the first few chunks of data it worked well, but later I get this error message:

Traceback(most recent call last):
  File "data.py", line 394 in <module> df2[['STATUS_ID_1','STATUS_ID_2']] = df2['STATUS'].str.split(n=1, expand=True)
  File "/home/web/.local/lib/python2.7/site-packages/pandas/core/frame.py, line 2326, in __setitem__ self._setitem_array(key,value)
  File "/home/web/.local/lib/python2.7/site-packages/pandas/core/frame.py, line 2350, in _setitem_array
raise ValueError("Columns must be same length as key')  ValueError: Columns must be same length as key

My code is here:

df2 = pd.DataFrame(datatable,columns = cols)
df2['FLIGHT_ID_1'] = df2['FLIGHT'].str[:3]
df2['FLIGHT_ID_2'] = df2['FLIGHT'].str[3:].str.zfill(4)
df2[['STATUS_ID_1','STATUS_ID_2']] = df2['STATUS'].str.split(n=1, expand=True)

EDIT-jezrael : i used your code, and maked a print from this: I hope with this we can find where is the problem..because it seems it is randomly when the scripts has got a problem with this split..

                 0         1
2       Landed   8:33 AM
3       Landed   9:37 AM
4       Landed   9:10 AM
5       Landed   9:57 AM
6       Landed   9:36 AM
8       Landed   8:51 AM
9       Landed   9:18 AM
11      Landed   8:53 AM
12      Landed   7:59 AM
13      Landed   7:52 AM
14      Landed   8:56 AM
15      Landed   8:09 AM
18      Landed   8:42 AM
19      Landed   9:39 AM
20      Landed   9:45 AM
21      Landed   7:44 AM
23      Landed   8:36 AM
27      Landed   9:53 AM
29      Landed   9:26 AM
30      Landed   8:23 AM
35      Landed   9:59 AM
36      Landed   8:38 AM
37      Landed   9:38 AM
38      Landed   9:37 AM
40      Landed   9:27 AM
43      Landed   9:14 AM
44      Landed   9:22 AM
45      Landed   8:18 AM
46      Landed  10:01 AM
47      Landed  10:21 AM
..         ...       ...
316    Delayed   5:00 PM
317    Delayed   4:34 PM
319  Estimated   2:58 PM
320  Estimated   3:02 PM
321    Delayed   4:47 PM
323  Estimated   3:08 PM
325    Delayed   3:52 PM
326  Estimated   3:09 PM
327  Estimated   2:37 PM
328  Estimated   3:17 PM
329  Estimated   3:20 PM
330  Estimated   2:39 PM
331    Delayed   4:04 PM
332    Delayed   4:36 PM
337  Estimated   3:47 PM
339  Estimated   3:37 PM
341    Delayed   4:32 PM
345  Estimated   3:34 PM
349  Estimated   3:24 PM
356    Delayed   4:56 PM
358  Estimated   3:45 PM
367  Estimated   4:09 PM
370  Estimated   4:04 PM
371  Estimated   4:11 PM
373    Delayed   5:21 PM
382  Estimated   3:56 PM
384    Delayed   4:28 PM
389    Delayed   4:41 PM
393  Estimated   4:02 PM
397    Delayed   5:23 PM

[240 rows x 2 columns]
like image 281
Harley Avatar asked Oct 05 '17 12:10

Harley


People also ask

How do I fix pandas key error?

How to Fix the KeyError? We can simply fix the error by correcting the spelling of the key. If we are not sure about the spelling we can simply print the list of all column names and crosscheck.

How do I get the length of a column of a DataFrame in Python?

The len() function returns the length rows of the Dataframe, we can filter a number of columns using the df. columns to get the count of columns.

What does loc [] do in Python?

The loc property is used to access a group of rows and columns by label(s) or a boolean array. . loc[] is primarily label based, but may also be used with a boolean array.

What does size () in pandas do?

The size property returns the number of elements in the DataFrame. The number of elements is the number of rows * the number of columns.


2 Answers

You need a bit modify solution, because sometimes it return 2 and sometimes only one column:

df2 = pd.DataFrame({'STATUS':['Estimated 3:17 PM','Delayed 3:00 PM']})


df3 = df2['STATUS'].str.split(n=1, expand=True)
df3.columns = ['STATUS_ID{}'.format(x+1) for x in df3.columns]
print (df3)
  STATUS_ID1 STATUS_ID2
0  Estimated    3:17 PM
1    Delayed    3:00 PM

df2 = df2.join(df3)
print (df2)
              STATUS STATUS_ID1 STATUS_ID2
0  Estimated 3:17 PM  Estimated    3:17 PM
1    Delayed 3:00 PM    Delayed    3:00 PM

Another possible data - all data have no whitespaces and solution working too:

df2 = pd.DataFrame({'STATUS':['Canceled','Canceled']})

and solution return:

print (df2)
     STATUS STATUS_ID1
0  Canceled   Canceled
1  Canceled   Canceled

All together:

df3 = df2['STATUS'].str.split(n=1, expand=True)
df3.columns = ['STATUS_ID{}'.format(x+1) for x in df3.columns]
df2 = df2.join(df3)
like image 194
jezrael Avatar answered Oct 15 '22 09:10

jezrael


To solve this error, check the shape of the object you're trying to assign the df columns (using np.shape). The second (or the last) dimension must match the number of columns you're trying to assign to. For example, if you try to assign a 2-column numpy array to 3 columns, you'll see this error.

A general workaround (for case 1 and case 2 below) is to cast the object you're trying to assign to a DataFrame and join() it to df, i.e. instead of (1), use (2).

df[cols] = vals   # (1)
df = df.join(vals) if isinstance(vals, pd.DataFrame) else df.join(pd.DataFrame(vals))  # (2)

If you're trying to replace values in an existing column and got this error (case 3(a) below), convert the object to list and assign.

df[cols] = vals.values.tolist()

If you have duplicate columns (case 3(b) below), then there's no easy fix. You'll have to make the dimensions match manually.



This error occurs in 3 cases:

Case 1: When you try to assign a list-like object (e.g. lists, tuples, sets, numpy arrays, and pandas Series) to a list of DataFrame column(s) as new arrays1 but the number of columns doesn't match the second (or last) dimension (found using np.shape) of the list-like object. So the following reproduces this error:

df = pd.DataFrame({'A': [0, 1]})
cols, vals = ['B'], [[2], [4, 5]]
df[cols] = vals # number of columns is 1 but the list has shape (2,)

Note that if the columns are not given as list, pandas Series, numpy array or Pandas Index, this error won't occur. So the following doesn't reproduce the error:

df[('B',)] = vals # the column is given as a tuple

One interesting edge case occurs when the list-like object is multi-dimensional (but not a numpy array). In that case, under the hood, the object is cast to a pandas DataFrame first and is checked if its last dimension matches the number of columns. This produces the following interesting case:

# the error occurs below because pd.DataFrame(vals1) has shape (2, 2) and len(['B']) != 2
vals1 = [[[2], [3]], [[4], [5]]]
df[cols] = vals1

# no error below because pd.DataFrame(vals2) has shape (2, 1) and len(['B']) == 1
vals2 = [[[[2], [3]]], [[[4], [5]]]]
df[cols] = vals2

Case 2: When you try to assign a DataFrame to a list (or pandas Series or numpy array or pandas Index) of columns but the respective numbers of columns don't match. This case is what caused the error in the OP. The following reproduce the error:

df = pd.DataFrame({'A': [0, 1]})
df[['B']] = pd.DataFrame([[2, 3], [4]]) # a 2-column df is trying to be assigned to a single column

df[['B', 'C']] = pd.DataFrame([[2], [4]]) # a single column df is trying to be assigned to 2 columns

Case 3: When you try to replace the values of existing column(s) by a DataFrame (or a list-like object) whose number of columns doesn't match the number of columns it's replacing. So the following reproduce the error:

# case 3(a)
df1 = pd.DataFrame({'A': [0, 1]})
df1['A'] = pd.DataFrame([[2, 3], [4, 5]]) # df1 has a single column named 'A' but a 2-column-df is trying to be assigned

# case 3(b): duplicate column names matter too
df2 = pd.DataFrame([[0, 1], [2, 3]], columns=['A','A'])
df2['A'] = pd.DataFrame([[2], [4]]) # df2 has 2 columns named 'A' but a single column df is being assigned

1: df.loc[:, cols] = vals may overwrite data inplace, so this won't produce the error but will create columns of NaN values.

like image 2
cottontail Avatar answered Oct 15 '22 07:10

cottontail