Suppose I have following dataset:
0
0 foo:1 bar:2 baz:3
1 bar:4 baz:5
2 foo:6
So each line is essentially a dict serialized into string, where key:value pairs are separated by space. There are hundreds of key:value pairs in each row, while number of unique keys is some few thousands. So data is sparse, so to speak.
What I want to get is a nice DataFrame where keys are columns and values are cells. And missing values are replaced by zeros. Like this:
foo bar baz
0 1 2 3
1 0 4 5
2 6 0 0
I know I can split string into key:value pairs:
In: frame[0].str.split(' ')
Out:
0
0 [foo:1, bar:2, baz:3]
1 [bar:4, baz:5]
2 [foo:6]
But what's next?
Edit: I'm running within AzureML Studio environment. So efficiency is important.
You can try list comprehension and then create new DataFrame from_records and fillna with 0:
s = df['0'].str.split(' ')
d = [dict(w.split(':', 1) for w in x) for x in s]
print d
#[{'baz': '3', 'foo': '1', 'bar': '2'}, {'baz': '5', 'bar': '4'}, {'foo': '6'}]
print pd.DataFrame.from_records(d).fillna(0)
# bar baz foo
#0 2 3 1
#1 4 5 0
#2 0 0 6
EDIT:
You can get better performance, if use in function from_records parameters index and columns:
print df
0
0 foo:1 bar:2 baz:3
1 bar:4 baz:5
2 foo:6
3 foo:1 bar:2 baz:3 bal:8 adi:5
s = df['0'].str.split(' ')
d = [dict(w.split(':', 1) for w in x) for x in s]
print d
[{'baz': '3', 'foo': '1', 'bar': '2'},
{'baz': '5', 'bar': '4'},
{'foo': '6'},
{'baz': '3', 'bal': '8', 'foo': '1', 'bar': '2', 'adi': '5'}]
If longest dictionary have all keys, which create all possible columns:
cols = sorted(d, key=len, reverse=True)[0].keys()
print cols
['baz', 'bal', 'foo', 'bar', 'adi']
df = pd.DataFrame.from_records( d, index= df.index, columns=cols )
df = df.fillna(0)
print df
baz bal foo bar adi
0 3 0 1 2 0
1 5 0 0 4 0
2 0 0 6 0 0
3 3 8 1 2 5
EDIT2: If longest dictionary doesnt contain all keys and keys are in other dictionaries, use:
list(set( val for dic in d for val in dic.keys()))
Sample:
print df
0
0 foo1:1 bar:2 baz1:3
1 bar:4 baz:5
2 foo:6
3 foo:1 bar:2 baz:3 bal:8 adi:5
s = df['0'].str.split(' ')
d = [dict(w.split(':', 1) for w in x) for x in s]
print d
[{'baz1': '3', 'bar': '2', 'foo1': '1'},
{'baz': '5', 'bar': '4'},
{'foo': '6'},
{'baz': '3', 'bal': '8', 'foo': '1', 'bar': '2', 'adi': '5'}]
cols = list(set( val for dic in d for val in dic.keys()))
print cols
['bar', 'baz', 'baz1', 'bal', 'foo', 'foo1', 'adi']
df = pd.DataFrame.from_records( d, index= df.index, columns=cols )
df = df.fillna(0)
print df
bar baz baz1 bal foo foo1 adi
0 2 0 3 0 0 1 0
1 4 5 0 0 0 0 0
2 0 0 0 0 6 0 0
3 2 3 0 8 1 0 5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With