Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Value Error when Slicing in Pandas

Tags:

python

pandas

I have a DataFrame that I would like to use the 'str.contrains()' method. I believed I had found how to do this when I read pandas + dataframe - select by partial string. However, I keep getting a value error.

My DataFrame is as follow:

ID,ENROLLMENT_DATE,TRAINER_MANAGING,TRAINER_OPERATOR,FIRST_VISIT_DATE
1536D,12-Feb-12,"06DA1B3-Lebanon NH",,15-Feb-12
F15D,18-May-12,"06405B2-Lebanon NH",,25-Jul-12
8096,8-Aug-12,"0643D38-Hanover NH","0643D38-Hanover NH",25-Jun-12
A036,1-Apr-12,"06CB8CF-Hanover NH","06CB8CF-Hanover NH",9-Aug-12
8944,19-Feb-12,"06D26AD-Hanover NH",,4-Feb-12
1004E,8-Jun-12,"06388B2-Lebanon NH",,24-Dec-11
11795,3-Jul-12,"0649597-White River VT","0649597-White River VT",30-Mar-12
30D7,11-Nov-12,"06D95A3-Hanover NH","06D95A3-Hanover NH",30-Nov-11
3AE2,21-Feb-12,"06405B2-Lebanon NH",,26-Oct-12
B0FE,17-Feb-12,"06D1B9D-Hartland VT",,16-Feb-12
127A1,11-Dec-11,"064456E-Hanover NH","064456E-Hanover NH",11-Nov-12
161FF,20-Feb-12,"0643D38-Hanover NH","0643D38-Hanover NH",3-Jul-12
A036,30-Nov-11,"063B208-Randolph VT","063B208-Randolph VT",
475B,25-Sep-12,"06D26AD-Hanover NH",,5-Nov-12
151A3,7-Mar-12,"06388B2-Lebanon NH",,16-Nov-12
CA62,3-Jan-12,,,
D31B,18-Dec-11,"06405B2-Lebanon NH",,9-Jan-12
20F5,8-Jul-12,"0669C50-Randolph VT",,3-Feb-12
8096,19-Dec-11,"0649597-White River VT","0649597-White River VT",9-Apr-12
14E48,1-Aug-12,"06D3206-Hanover NH",,
177F8,20-Aug-12,"063B208-Randolph VT","063B208-Randolph VT",5-May-12
553E,11-Oct-12,"06D95A3-Hanover NH","06D95A3-Hanover NH",8-Mar-12
12D5F,18-Jul-12,"0649597-White River VT","0649597-White River VT",2-Nov-12
C6DC,13-Apr-12,"06388B2-Lebanon NH",,
11795,27-Feb-12,"0643D38-Hanover NH","0643D38-Hanover NH",19-Jun-12
17B43,11-Aug-12,,,22-Oct-12
A036,11-Aug-12,"06D3206-Hanover NH",,19-Jun-12

Then I run the following code:

test = pandas.read_csv('testcsv.csv')
test[test.TRAINER_MANAGING.str.contains('Han', na=False)]

and I get the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-54-e0c4624c9346> in <module>()
----> 1 test[test.TRAINER_MANAGING.str.contains('Han', na=False)]

.virtualenvs/ipython/lib/python2.7/site-packages/pandas/core/frame.pyc in __getitem__(self, key)
   1958 
   1959             # also raises Exception if object array with NA values
-> 1960             if com._is_bool_indexer(key):
   1961                 key = np.asarray(key, dtype=bool)
   1962             return self._getitem_array(key)

.virtualenvs/ipython/lib/python2.7/site-packages/pandas/core/common.pyc in _is_bool_indexer(key)
    685         if not lib.is_bool_array(key):
    686             if isnull(key).any():
--> 687                 raise ValueError('cannot index with vector containing '
    688                                  'NA / NaN values')
    689             return False

ValueError: cannot index with vector containing NA / NaN values

I feel like I am missing something simple. Any help would be appreciated.

like image 688
BigHandsome Avatar asked Feb 06 '13 14:02

BigHandsome


People also ask

What is value error in pandas?

One of the most commonly reported error in pandas is ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all() and it may sometimes be quite tricky to deal with, especially if you are new to pandas library (or even Python).

How do I remove a timestamp timezone?

tz_localize(None) method can be applied to the dataframe column to remove the timezone information. The output similar to the above example reflects that after manipulation, the UTC timezone information is no longer present in the timestamp column.

How do you slice the pandas index?

Slicing Rows and Columns by Index Position When slicing by index position in Pandas, the start index is included in the output, but the stop index is one step beyond the row you want to select. So the slice return row 0 and row 1, but does not return row 2. The second slice [:] indicates that all columns are required.


1 Answers

Your string search still returns nan values whereas the slicing operation works with booleans only. It appears 'na=False' is not working (in this case?), i can replicate it on my machine with the latest (released) Pandas version.

You can workaround it by first applying the .fillna() function to the results like:

test[test.TRAINER_MANAGING.str.contains('Han').fillna(False)]

Which returns:

       ID ENROLLMENT_DATE    TRAINER_MANAGING    TRAINER_OPERATOR FIRST_VISIT_DATE
2    8096        8-Aug-12  0643D38-Hanover NH  0643D38-Hanover NH        25-Jun-12
3    A036        1-Apr-12  06CB8CF-Hanover NH  06CB8CF-Hanover NH         9-Aug-12
4    8944       19-Feb-12  06D26AD-Hanover NH                 NaN         4-Feb-12
7    30D7       11-Nov-12  06D95A3-Hanover NH  06D95A3-Hanover NH        30-Nov-11
10  127A1       11-Dec-11  064456E-Hanover NH  064456E-Hanover NH        11-Nov-12
11  161FF       20-Feb-12  0643D38-Hanover NH  0643D38-Hanover NH         3-Jul-12
13   475B       25-Sep-12  06D26AD-Hanover NH                 NaN         5-Nov-12
19  14E48        1-Aug-12  06D3206-Hanover NH                 NaN              NaN
21   553E       11-Oct-12  06D95A3-Hanover NH  06D95A3-Hanover NH         8-Mar-12
24  11795       27-Feb-12  0643D38-Hanover NH  0643D38-Hanover NH        19-Jun-12
26   A036       11-Aug-12  06D3206-Hanover NH                 NaN        19-Jun-12

I have never used the str.contains function before so im not sure if it doesnt work correctly. We should open an issue on github if it should work as in your example.

like image 102
Rutger Kassies Avatar answered Nov 02 '22 23:11

Rutger Kassies