I'm moving some of my R
stuff to Python
, hence I have to use pandas.DataFrame
s. There are several things I'd like to optimise.
Suppose we've got a table
key value
abc 1
abc 2
abd 1
and we want to get a dictionary of form {key -> list[values]}
. Here is how I get this done right now.
from pandas import DataFrame
from StringIO import StringIO
def get_dict(df):
"""
:param df:
:type df: DataFrame
"""
def f(accum, row):
"""
:param accum:
:type accum: dict
"""
key, value = row[1]
return accum.setdefault(key, []).append(value) or accum
return reduce(f, df.iterrows(), {})
table = StringIO("key\tvalue\nabc\t1\nabc\t2\nabd\t1")
parsed_table = [row.rstrip().split("\t") for row in table]
df = DataFrame(parsed_table[1:], columns=parsed_table[0])
result = get_dict(df) # -> {'abc': ['1', '2'], 'abd': ['1']}
Two things I don't like about it:
reduce
uses standard Python iteration protocol that kills the speed of NumPy-based data structures like DataFrame
. I know that DataFrame.apply
has a reduce
mode, but it doesn't take a starting value like dict
. R
, i.e. row$key
instead of row[1][0]
Thank you in advance
One option is to use groupby and apply to end with a pandas Series:
In [2]: df
Out[2]:
key value
0 abc 1
1 abc 2
2 abd 1
In [3]: df.groupby("key").value.apply(list)
Out[3]:
key
abc [1, 2]
abd [1]
Name: value, dtype: object
In [4]: _3.ix['abc']
Out[4]: [1, 2]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With