I am webscraping some data from a few websites, and using pandas to modify it.
On the first few chunks of data it worked well, but later I get this error message:
Traceback(most recent call last):
File "data.py", line 394 in <module> df2[['STATUS_ID_1','STATUS_ID_2']] = df2['STATUS'].str.split(n=1, expand=True)
File "/home/web/.local/lib/python2.7/site-packages/pandas/core/frame.py, line 2326, in __setitem__ self._setitem_array(key,value)
File "/home/web/.local/lib/python2.7/site-packages/pandas/core/frame.py, line 2350, in _setitem_array
raise ValueError("Columns must be same length as key') ValueError: Columns must be same length as key
My code is here:
df2 = pd.DataFrame(datatable,columns = cols)
df2['FLIGHT_ID_1'] = df2['FLIGHT'].str[:3]
df2['FLIGHT_ID_2'] = df2['FLIGHT'].str[3:].str.zfill(4)
df2[['STATUS_ID_1','STATUS_ID_2']] = df2['STATUS'].str.split(n=1, expand=True)
EDIT-jezrael : i used your code, and maked a print from this: I hope with this we can find where is the problem..because it seems it is randomly when the scripts has got a problem with this split..
0 1
2 Landed 8:33 AM
3 Landed 9:37 AM
4 Landed 9:10 AM
5 Landed 9:57 AM
6 Landed 9:36 AM
8 Landed 8:51 AM
9 Landed 9:18 AM
11 Landed 8:53 AM
12 Landed 7:59 AM
13 Landed 7:52 AM
14 Landed 8:56 AM
15 Landed 8:09 AM
18 Landed 8:42 AM
19 Landed 9:39 AM
20 Landed 9:45 AM
21 Landed 7:44 AM
23 Landed 8:36 AM
27 Landed 9:53 AM
29 Landed 9:26 AM
30 Landed 8:23 AM
35 Landed 9:59 AM
36 Landed 8:38 AM
37 Landed 9:38 AM
38 Landed 9:37 AM
40 Landed 9:27 AM
43 Landed 9:14 AM
44 Landed 9:22 AM
45 Landed 8:18 AM
46 Landed 10:01 AM
47 Landed 10:21 AM
.. ... ...
316 Delayed 5:00 PM
317 Delayed 4:34 PM
319 Estimated 2:58 PM
320 Estimated 3:02 PM
321 Delayed 4:47 PM
323 Estimated 3:08 PM
325 Delayed 3:52 PM
326 Estimated 3:09 PM
327 Estimated 2:37 PM
328 Estimated 3:17 PM
329 Estimated 3:20 PM
330 Estimated 2:39 PM
331 Delayed 4:04 PM
332 Delayed 4:36 PM
337 Estimated 3:47 PM
339 Estimated 3:37 PM
341 Delayed 4:32 PM
345 Estimated 3:34 PM
349 Estimated 3:24 PM
356 Delayed 4:56 PM
358 Estimated 3:45 PM
367 Estimated 4:09 PM
370 Estimated 4:04 PM
371 Estimated 4:11 PM
373 Delayed 5:21 PM
382 Estimated 3:56 PM
384 Delayed 4:28 PM
389 Delayed 4:41 PM
393 Estimated 4:02 PM
397 Delayed 5:23 PM
[240 rows x 2 columns]
How to Fix the KeyError? We can simply fix the error by correcting the spelling of the key. If we are not sure about the spelling we can simply print the list of all column names and crosscheck.
The len() function returns the length rows of the Dataframe, we can filter a number of columns using the df. columns to get the count of columns.
The loc property is used to access a group of rows and columns by label(s) or a boolean array. . loc[] is primarily label based, but may also be used with a boolean array.
The size property returns the number of elements in the DataFrame. The number of elements is the number of rows * the number of columns.
You need a bit modify solution, because sometimes it return 2 and sometimes only one column:
df2 = pd.DataFrame({'STATUS':['Estimated 3:17 PM','Delayed 3:00 PM']})
df3 = df2['STATUS'].str.split(n=1, expand=True)
df3.columns = ['STATUS_ID{}'.format(x+1) for x in df3.columns]
print (df3)
STATUS_ID1 STATUS_ID2
0 Estimated 3:17 PM
1 Delayed 3:00 PM
df2 = df2.join(df3)
print (df2)
STATUS STATUS_ID1 STATUS_ID2
0 Estimated 3:17 PM Estimated 3:17 PM
1 Delayed 3:00 PM Delayed 3:00 PM
Another possible data - all data have no whitespaces and solution working too:
df2 = pd.DataFrame({'STATUS':['Canceled','Canceled']})
and solution return:
print (df2)
STATUS STATUS_ID1
0 Canceled Canceled
1 Canceled Canceled
All together:
df3 = df2['STATUS'].str.split(n=1, expand=True)
df3.columns = ['STATUS_ID{}'.format(x+1) for x in df3.columns]
df2 = df2.join(df3)
To solve this error, check the shape of the object you're trying to assign the df columns (using np.shape
). The second (or the last) dimension must match the number of columns you're trying to assign to. For example, if you try to assign a 2-column numpy array to 3 columns, you'll see this error.
A general workaround (for case 1 and case 2 below) is to cast the object you're trying to assign to a DataFrame and join()
it to df
, i.e. instead of (1), use (2).
df[cols] = vals # (1)
df = df.join(vals) if isinstance(vals, pd.DataFrame) else df.join(pd.DataFrame(vals)) # (2)
If you're trying to replace values in an existing column and got this error (case 3(a) below), convert the object to list and assign.
df[cols] = vals.values.tolist()
If you have duplicate columns (case 3(b) below), then there's no easy fix. You'll have to make the dimensions match manually.
This error occurs in 3 cases:
Case 1: When you try to assign a list-like object (e.g. lists, tuples, sets, numpy arrays, and pandas Series) to a list of DataFrame column(s) as new arrays1 but the number of columns doesn't match the second (or last) dimension (found using np.shape
) of the list-like object. So the following reproduces this error:
df = pd.DataFrame({'A': [0, 1]})
cols, vals = ['B'], [[2], [4, 5]]
df[cols] = vals # number of columns is 1 but the list has shape (2,)
Note that if the columns are not given as list, pandas Series, numpy array or Pandas Index, this error won't occur. So the following doesn't reproduce the error:
df[('B',)] = vals # the column is given as a tuple
One interesting edge case occurs when the list-like object is multi-dimensional (but not a numpy array). In that case, under the hood, the object is cast to a pandas DataFrame first and is checked if its last dimension matches the number of columns. This produces the following interesting case:
# the error occurs below because pd.DataFrame(vals1) has shape (2, 2) and len(['B']) != 2
vals1 = [[[2], [3]], [[4], [5]]]
df[cols] = vals1
# no error below because pd.DataFrame(vals2) has shape (2, 1) and len(['B']) == 1
vals2 = [[[[2], [3]]], [[[4], [5]]]]
df[cols] = vals2
Case 2: When you try to assign a DataFrame to a list (or pandas Series or numpy array or pandas Index) of columns but the respective numbers of columns don't match. This case is what caused the error in the OP. The following reproduce the error:
df = pd.DataFrame({'A': [0, 1]})
df[['B']] = pd.DataFrame([[2, 3], [4]]) # a 2-column df is trying to be assigned to a single column
df[['B', 'C']] = pd.DataFrame([[2], [4]]) # a single column df is trying to be assigned to 2 columns
Case 3: When you try to replace the values of existing column(s) by a DataFrame (or a list-like object) whose number of columns doesn't match the number of columns it's replacing. So the following reproduce the error:
# case 3(a)
df1 = pd.DataFrame({'A': [0, 1]})
df1['A'] = pd.DataFrame([[2, 3], [4, 5]]) # df1 has a single column named 'A' but a 2-column-df is trying to be assigned
# case 3(b): duplicate column names matter too
df2 = pd.DataFrame([[0, 1], [2, 3]], columns=['A','A'])
df2['A'] = pd.DataFrame([[2], [4]]) # df2 has 2 columns named 'A' but a single column df is being assigned
1: df.loc[:, cols] = vals
may overwrite data inplace, so this won't produce the error but will create columns of NaN values.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With