Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Searching a Pandas series using a string produces a KeyError

I'm trying to use df[df['col'].str.contains("string")] (described in these two SO questions: 1 & 2) to select rows based on a partial string match. Here's my code:

import requests
import json
import pandas as pd
import datetime

url = "http://api.turfgame.com/v4/zones/all" # get request returns .json 
r = requests.get(url)
df = pd.read_json(r.content) # create a df containing all zone info

print df[df['region'].str.contains("Uppsala")].head()

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-23-55bbf5679808> in <module>()
----> 1 print df[df['region'].str.contains("Uppsala")].head()

C:\Users\User\AppData\Local\Enthought\Canopy32\User\lib\site-packages\pandas\core\frame.pyc in __getitem__(self, key)
   1670         if isinstance(key, (Series, np.ndarray, list)):
   1671             # either boolean or fancy integer index
-> 1672             return self._getitem_array(key)
   1673         elif isinstance(key, DataFrame):
   1674             return self._getitem_frame(key)

C:\Users\User\AppData\Local\Enthought\Canopy32\User\lib\site-packages\pandas\core\frame.pyc in _getitem_array(self, key)
   1714             return self.take(indexer, axis=0, convert=False)
   1715         else:
-> 1716             indexer = self.ix._convert_to_indexer(key, axis=1)
   1717             return self.take(indexer, axis=1, convert=True)
   1718 

C:\Users\User\AppData\Local\Enthought\Canopy32\User\lib\site-packages\pandas\core\indexing.pyc in _convert_to_indexer(self, obj, axis, is_setter)
   1083                     if isinstance(obj, tuple) and is_setter:
   1084                         return {'key': obj}
-> 1085                     raise KeyError('%s not in index' % objarr[mask])
   1086 
   1087                 return indexer

KeyError: '[ nan  nan  nan ...,  nan  nan  nan] not in index'

I don't understand the which I get a KeyError because df.columns returns:

Index([u'dateCreated', u'id', u'latitude', u'longitude', u'name', u'pointsPerHour', u'region', u'takeoverPoints', u'totalTakeovers'], dtype='object')

So the Key is in the list of columns and opening the page in an internet browser I can find 739 instances of 'Uppsala'.

The column in which I'm search was a nested .json table that looks like this {"id":200,"name":"Scotland","country":"gb"}. Do I have do something special to search between '{}' characters? Could somebody explain where I've made my mistake(s)?

like image 522
Jason Avatar asked Jan 10 '23 18:01

Jason


1 Answers

Looks to me like your region column contains dictionaries, which aren't really supported as elements, and so .str isn't working. One way to solve the problem is to promote the region dictionaries to columns in their own right, maybe with something like:

>>> region = pd.DataFrame(df.pop("region").tolist())
>>> df = df.join(region, rsuffix="_region")

after which you have

>>> df.head()
                dateCreated     id   latitude  longitude         name  pointsPerHour  takeoverPoints  totalTakeovers country  id_region             name_region
0  2013-06-15T08:00:00+0000  14639  55.947079  -3.206477  GrandSquare              1             185              32      gb        200                Scotland
1  2014-06-15T20:02:37+0000  31571  55.649181  12.609056   Stenringen              1             185               6      dk        172             Hovedstaden
2  2013-06-15T08:00:00+0000  18958  54.593570  -5.955772  Hospitality              0             250               1      gb        206        Northern Ireland
3  2013-06-15T08:00:00+0000  18661  53.754283  -1.526638  LanshawZone              0             250               0      gb        202  Yorkshire & The Humber
4  2013-06-15T08:00:00+0000  17424  55.949285  -3.144777   NoDogsZone              0             250               5      gb        200                Scotland

and

>>> df[df["name_region"].str.contains("Uppsala")].head()
                  dateCreated     id   latitude  longitude          name  pointsPerHour  takeoverPoints  totalTakeovers country  id_region name_region
28   2013-07-16T18:53:48+0000  20828  59.793476  17.775389  MoraStenRast              5             125             536      se        142     Uppsala
59   2013-02-08T21:42:53+0000  14797  59.570418  17.482116      BålWoods              3             155             555      se        142     Uppsala
102  2014-06-19T12:00:00+0000  31843  59.617637  17.077094       EnaAlle              5             125             168      se        142     Uppsala
328  2012-09-24T20:08:22+0000  11461  59.634438  17.066398      BluePark              6             110            1968      se        142     Uppsala
330  2014-08-28T20:00:00+0000  33695  59.867027  17.710792  EnbackensBro              4             140              59      se        142     Uppsala

(A hack workaround would be df["region"].apply(str).str.contains("Uppsala"), but I think it's best to clean the data right at the start.)

like image 82
DSM Avatar answered Jan 26 '23 19:01

DSM