Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas | Read json file with list/array-like fields to Boolean columns

Here is a JSON string that contains a list of objects with each having another list embedded.

[
  {
    "name": "Alice",
    "hobbies": [
      "volleyball",
      "shopping",
      "movies"
    ]
  },
  {
    "name": "Bob",
    "hobbies": [
      "fishing",
      "movies"
    ]
  }
]

Using pandas.read_json() this turns into a DataFrame like this:

  name      hobbies
  --------------------------------------
1 Alice     [volleyball, shopping, movies]
2 Bob       [fishing, movies]

However, I would like to flatten the lists into Boolean columns like this:

  name      volleyball  shopping    movies  fishing 
  ----------------------------------------------------
1 Alice     True        True        True    False
2 Bob       False       False       True    True

I.e. when the list contains a value, the field in the corresponding column is filled with a Boolean True, otherwise with False.

I have also looked into pandas.io.json.json_normalize(), but that does not seem support this idea either. Is there any built-in way (either Python3, or pandas) to do this?

(PS. I realize that you can cook up your own code to 'normalize' the dictionary objects before loading the whole list into a DataFrame, but I might be reinventing the wheel with that and probably in a very inefficient way).

like image 669
imrek Avatar asked Oct 31 '22 07:10

imrek


1 Answers

you can do the following:

In [56]: data = [
   ....:   {
   ....:     "name": "Alice",
   ....:     "hobbies": [
   ....:       "volleyball",
   ....:       "shopping",
   ....:       "movies"
   ....:     ]
   ....:   },
   ....:   {
   ....:     "name": "Bob",
   ....:     "hobbies": [
   ....:       "fishing",
   ....:       "movies"
   ....:     ]
   ....:   }
   ....: ]

 In [57]: df = pd.io.json.json_normalize(data, 'hobbies', ['name']).rename(columns={0:'hobby'})

In [59]: df['count'] = 1

In [60]: df
Out[60]:
        hobby   name  count
0  volleyball  Alice      1
1    shopping  Alice      1
2      movies  Alice      1
3     fishing    Bob      1
4      movies    Bob      1

In [61]: df.pivot_table(index='name', columns='hobby', values='count').fillna(0)
Out[61]:
hobby  fishing  movies  shopping  volleyball
name
Alice      0.0     1.0       1.0         1.0
Bob        1.0     1.0       0.0         0.0

Or even better:

In [88]: r = df.pivot_table(index='name', columns='hobby', values='count').fillna(0)

In [89]: r
Out[89]:
hobby  fishing  movies  shopping  volleyball
name
Alice      0.0     1.0       1.0         1.0
Bob        1.0     1.0       0.0         0.0

let's generate list of 'boolean' columns dynamically

In [90]: cols_boolean = [c for c in r.columns.tolist() if c != 'name']

In [91]: r = r[cols_boolean].astype(bool)

In [92]: print(r)
hobby fishing movies shopping volleyball
name
Alice   False   True     True       True
Bob      True   True    False      False
like image 137
MaxU - stop WAR against UA Avatar answered Nov 09 '22 14:11

MaxU - stop WAR against UA