I have a DataFrame that I would like to use the 'str.contrains()' method. I believed I had found how to do this when I read pandas + dataframe - select by partial string. However, I keep getting a value error.
My DataFrame is as follow:
ID,ENROLLMENT_DATE,TRAINER_MANAGING,TRAINER_OPERATOR,FIRST_VISIT_DATE
1536D,12-Feb-12,"06DA1B3-Lebanon NH",,15-Feb-12
F15D,18-May-12,"06405B2-Lebanon NH",,25-Jul-12
8096,8-Aug-12,"0643D38-Hanover NH","0643D38-Hanover NH",25-Jun-12
A036,1-Apr-12,"06CB8CF-Hanover NH","06CB8CF-Hanover NH",9-Aug-12
8944,19-Feb-12,"06D26AD-Hanover NH",,4-Feb-12
1004E,8-Jun-12,"06388B2-Lebanon NH",,24-Dec-11
11795,3-Jul-12,"0649597-White River VT","0649597-White River VT",30-Mar-12
30D7,11-Nov-12,"06D95A3-Hanover NH","06D95A3-Hanover NH",30-Nov-11
3AE2,21-Feb-12,"06405B2-Lebanon NH",,26-Oct-12
B0FE,17-Feb-12,"06D1B9D-Hartland VT",,16-Feb-12
127A1,11-Dec-11,"064456E-Hanover NH","064456E-Hanover NH",11-Nov-12
161FF,20-Feb-12,"0643D38-Hanover NH","0643D38-Hanover NH",3-Jul-12
A036,30-Nov-11,"063B208-Randolph VT","063B208-Randolph VT",
475B,25-Sep-12,"06D26AD-Hanover NH",,5-Nov-12
151A3,7-Mar-12,"06388B2-Lebanon NH",,16-Nov-12
CA62,3-Jan-12,,,
D31B,18-Dec-11,"06405B2-Lebanon NH",,9-Jan-12
20F5,8-Jul-12,"0669C50-Randolph VT",,3-Feb-12
8096,19-Dec-11,"0649597-White River VT","0649597-White River VT",9-Apr-12
14E48,1-Aug-12,"06D3206-Hanover NH",,
177F8,20-Aug-12,"063B208-Randolph VT","063B208-Randolph VT",5-May-12
553E,11-Oct-12,"06D95A3-Hanover NH","06D95A3-Hanover NH",8-Mar-12
12D5F,18-Jul-12,"0649597-White River VT","0649597-White River VT",2-Nov-12
C6DC,13-Apr-12,"06388B2-Lebanon NH",,
11795,27-Feb-12,"0643D38-Hanover NH","0643D38-Hanover NH",19-Jun-12
17B43,11-Aug-12,,,22-Oct-12
A036,11-Aug-12,"06D3206-Hanover NH",,19-Jun-12
Then I run the following code:
test = pandas.read_csv('testcsv.csv')
test[test.TRAINER_MANAGING.str.contains('Han', na=False)]
and I get the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-54-e0c4624c9346> in <module>()
----> 1 test[test.TRAINER_MANAGING.str.contains('Han', na=False)]
.virtualenvs/ipython/lib/python2.7/site-packages/pandas/core/frame.pyc in __getitem__(self, key)
1958
1959 # also raises Exception if object array with NA values
-> 1960 if com._is_bool_indexer(key):
1961 key = np.asarray(key, dtype=bool)
1962 return self._getitem_array(key)
.virtualenvs/ipython/lib/python2.7/site-packages/pandas/core/common.pyc in _is_bool_indexer(key)
685 if not lib.is_bool_array(key):
686 if isnull(key).any():
--> 687 raise ValueError('cannot index with vector containing '
688 'NA / NaN values')
689 return False
ValueError: cannot index with vector containing NA / NaN values
I feel like I am missing something simple. Any help would be appreciated.
One of the most commonly reported error in pandas is ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all() and it may sometimes be quite tricky to deal with, especially if you are new to pandas library (or even Python).
tz_localize(None) method can be applied to the dataframe column to remove the timezone information. The output similar to the above example reflects that after manipulation, the UTC timezone information is no longer present in the timestamp column.
Slicing Rows and Columns by Index Position When slicing by index position in Pandas, the start index is included in the output, but the stop index is one step beyond the row you want to select. So the slice return row 0 and row 1, but does not return row 2. The second slice [:] indicates that all columns are required.
Your string search still returns nan values whereas the slicing operation works with booleans only. It appears 'na=False' is not working (in this case?), i can replicate it on my machine with the latest (released) Pandas version.
You can workaround it by first applying the .fillna() function to the results like:
test[test.TRAINER_MANAGING.str.contains('Han').fillna(False)]
Which returns:
ID ENROLLMENT_DATE TRAINER_MANAGING TRAINER_OPERATOR FIRST_VISIT_DATE
2 8096 8-Aug-12 0643D38-Hanover NH 0643D38-Hanover NH 25-Jun-12
3 A036 1-Apr-12 06CB8CF-Hanover NH 06CB8CF-Hanover NH 9-Aug-12
4 8944 19-Feb-12 06D26AD-Hanover NH NaN 4-Feb-12
7 30D7 11-Nov-12 06D95A3-Hanover NH 06D95A3-Hanover NH 30-Nov-11
10 127A1 11-Dec-11 064456E-Hanover NH 064456E-Hanover NH 11-Nov-12
11 161FF 20-Feb-12 0643D38-Hanover NH 0643D38-Hanover NH 3-Jul-12
13 475B 25-Sep-12 06D26AD-Hanover NH NaN 5-Nov-12
19 14E48 1-Aug-12 06D3206-Hanover NH NaN NaN
21 553E 11-Oct-12 06D95A3-Hanover NH 06D95A3-Hanover NH 8-Mar-12
24 11795 27-Feb-12 0643D38-Hanover NH 0643D38-Hanover NH 19-Jun-12
26 A036 11-Aug-12 06D3206-Hanover NH NaN 19-Jun-12
I have never used the str.contains function before so im not sure if it doesnt work correctly. We should open an issue on github if it should work as in your example.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With