Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Reconstruct dataframe from strings of key:value pairs

Suppose I have following dataset:

  0
0 foo:1 bar:2 baz:3
1 bar:4 baz:5
2 foo:6

So each line is essentially a dict serialized into string, where key:value pairs are separated by space. There are hundreds of key:value pairs in each row, while number of unique keys is some few thousands. So data is sparse, so to speak.

What I want to get is a nice DataFrame where keys are columns and values are cells. And missing values are replaced by zeros. Like this:

  foo bar baz
0   1   2   3
1   0   4   5
2   6   0   0

I know I can split string into key:value pairs:

In: frame[0].str.split(' ')
Out:
  0
0 [foo:1, bar:2, baz:3]
1 [bar:4, baz:5]
2 [foo:6]

But what's next?

Edit: I'm running within AzureML Studio environment. So efficiency is important.

like image 279
z4y4ts Avatar asked Dec 06 '25 05:12

z4y4ts


1 Answers

You can try list comprehension and then create new DataFrame from_records and fillna with 0:

s = df['0'].str.split(' ')

d = [dict(w.split(':', 1) for w in x) for x in s]
print d
#[{'baz': '3', 'foo': '1', 'bar': '2'}, {'baz': '5', 'bar': '4'}, {'foo': '6'}]

print pd.DataFrame.from_records(d).fillna(0)
#  bar baz foo
#0   2   3   1
#1   4   5   0
#2   0   0   6

EDIT:

You can get better performance, if use in function from_records parameters index and columns:

print df
                               0
0              foo:1 bar:2 baz:3
1                    bar:4 baz:5
2                          foo:6
3  foo:1 bar:2 baz:3 bal:8 adi:5

s = df['0'].str.split(' ')
d = [dict(w.split(':', 1) for w in x) for x in s]
print d
[{'baz': '3', 'foo': '1', 'bar': '2'}, 
 {'baz': '5', 'bar': '4'}, 
 {'foo': '6'}, 
 {'baz': '3', 'bal': '8', 'foo': '1', 'bar': '2', 'adi': '5'}]

If longest dictionary have all keys, which create all possible columns:

cols = sorted(d, key=len, reverse=True)[0].keys()
print cols
['baz', 'bal', 'foo', 'bar', 'adi']

df = pd.DataFrame.from_records( d, index= df.index, columns=cols )
df = df.fillna(0)

print df
  baz bal foo bar adi
0   3   0   1   2   0
1   5   0   0   4   0
2   0   0   6   0   0
3   3   8   1   2   5

EDIT2: If longest dictionary doesnt contain all keys and keys are in other dictionaries, use:

list(set( val for dic in d for val in dic.keys()))

Sample:

print df
                               0
0            foo1:1 bar:2 baz1:3
1                    bar:4 baz:5
2                          foo:6
3  foo:1 bar:2 baz:3 bal:8 adi:5

s = df['0'].str.split(' ')
d = [dict(w.split(':', 1) for w in x) for x in s]

print d
[{'baz1': '3', 'bar': '2', 'foo1': '1'}, 
 {'baz': '5', 'bar': '4'}, 
 {'foo': '6'}, 
 {'baz': '3', 'bal': '8', 'foo': '1', 'bar': '2', 'adi': '5'}]

cols =  list(set( val for dic in d for val in dic.keys()))
print cols 
['bar', 'baz', 'baz1', 'bal', 'foo', 'foo1', 'adi']

df = pd.DataFrame.from_records( d, index= df.index, columns=cols )
df = df.fillna(0)

print df
  bar baz baz1 bal foo foo1 adi
0   2   0    3   0   0    1   0
1   4   5    0   0   0    0   0
2   0   0    0   0   6    0   0
3   2   3    0   8   1    0   5
like image 187
jezrael Avatar answered Dec 07 '25 18:12

jezrael