Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Write 2d dictionary into a dataframe or tab-delimited file using python

Tags:

python

pandas

I have a 2-d dictionary in the following format:

myDict = {('a','b'):10, ('a','c'):20, ('a','d'):30, ('b','c'):40, ('b','d'):50,('c','d'):60}

How can I write this into a tab-delimited file so that the file contains the following. While filling a tuple (x, y) will fill two locations: (x,y) and (y,x). (x,x) is always 0.

The output would be :

    a   b   c   d
a   0   10  20  30
b   10  0   40  50
c   20  40  0   60
d   30  50  60  0 

PS: If somehow the dictionary can be converted into a dataframe (using pandas) then it can be easily written into a file using pandas function

like image 248
learner Avatar asked Sep 12 '25 18:09

learner


1 Answers

You can do this with the lesser-known align method and a little unstack magic:

In [122]: s = Series(myDict, index=MultiIndex.from_tuples(myDict))

In [123]: df = s.unstack()

In [124]: lhs, rhs = df.align(df.T)

In [125]: res = lhs.add(rhs, fill_value=0).fillna(0)

In [126]: res
Out[126]:
    a   b   c   d
a   0  10  20  30
b  10   0  40  50
c  20  40   0  60
d  30  50  60   0

Finally, to write this to a CSV file, use the to_csv method:

In [128]: res.to_csv('res.csv', sep='\t')

In [129]: !cat res.csv
        a       b       c       d
a       0.0     10.0    20.0    30.0
b       10.0    0.0     40.0    50.0
c       20.0    40.0    0.0     60.0
d       30.0    50.0    60.0    0.0

If you want to keep things as integers, cast using DataFrame.astype(), like so:

In [137]: res.astype(int).to_csv('res.csv', sep='\t')

In [138]: !cat res.csv
        a       b       c       d
a       0       10      20      30
b       10      0       40      50
c       20      40      0       60
d       30      50      60      0

(It was cast to float because of the intermediate step of filling in nan values where indices from one frame were missing from the other)

@Dan Allan's answer using combine_first is nice:

In [130]: df.combine_first(df.T).fillna(0)
Out[130]:
    a   b   c   d
a   0  10  20  30
b  10   0  40  50
c  20  40   0  60
d  30  50  60   0

Timings:

In [134]: timeit df.combine_first(df.T).fillna(0)
100 loops, best of 3: 2.01 ms per loop

In [135]: timeit lhs, rhs = df.align(df.T); res = lhs.add(rhs, fill_value=0).fillna(0)
1000 loops, best of 3: 1.27 ms per loop

Those timings are probably a bit polluted by construction costs, so what do things look like with some really huge frames?

In [143]: df = DataFrame({i: randn(1e7) for i in range(1, 11)})

In [144]: df2 = DataFrame({i: randn(1e7) for i in range(10)})

In [145]: timeit lhs, rhs = df.align(df2); res = lhs.add(rhs, fill_value=0).fillna(0)
1 loops, best of 3: 4.41 s per loop

In [146]: timeit df.combine_first(df2).fillna(0)
1 loops, best of 3: 2.95 s per loop

DataFrame.combine_first() is faster for larger frames.

like image 102
Phillip Cloud Avatar answered Sep 14 '25 09:09

Phillip Cloud