Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to do array-wise operation on assignment?

Tags:

python

pandas

I looked through Stack Overflow thoroughly and I couldn't find any helpful results. At this point I'm not even sure if this is possible, but because I'm only a beginner I thought I'd at least ask it here.

Basically, I have multiple data set, each with about 8 million rows and I do not want to loop over each row. I read at multiple places that vectorization is almost always the fastest operation with pandas DataFrames, but I can't think of a way to write my script without a loop. Speed is crucial because I'd rather not keep my computer running for a month straight.

I have to take two values from one DataFrame and use them as indices for the other DataFrame and change the value to 1. Suppose the following code:

>>> import pandas as pd
>>> df1 = pd.DataFrame([[1,2],[3,4],[5,6]])
>>> df1.columns = ['A','B']
>>> df1
   A  B
0  1  2
1  3  4
2  5  6
>>> df2 = pd.DataFrame(0, index = list(df1['B']), columns = list(df1['A']))
>>> df2
   1  3  5
2  0  0  0
4  0  0  0
6  0  0  0

Right now, I have a for loop that works like this:

>>> listA = list(df1['A'])
>>> listB = list(df2['B'])
>>> row_count = len(listB)
>>> for index in range(row_count):
...     col = listA[index]
...     row = listB[index]
...     df2[col][row] = 1

Using for loop over the range() iterator seems significantly faster than iterrows(). But I seek to make my script run as fast as it possibly can (because I have a gigantic amount of data) so I was wondering if I could get rid of the loop. I thought there would be a fair chance of pandas module having a method for DataFrames that I don't know that can do this work very efficiently.

Any help is appreciated.

Edit: The possible duplicate does not address my problem because my goal is not to change diagonal values to 1; it was just a coincidence in my example because I have very simple DataFrames. Also, I'm sorry if this is not how I'm supposed to format edits; I'm new to this community.

like image 713
spicypumpkin Avatar asked Dec 03 '16 22:12

spicypumpkin


2 Answers

I think you need pd.get_dummies but first set_index from column B:

print (df1.set_index('B')['A']) 
B
2    1
4    3
6    5
Name: A, dtype: int64

print (pd.get_dummies(df1.set_index('B')['A']))
   1  3  5
B         
2  1  0  0
4  0  1  0
6  0  0  1

If duplicates, need groupby with aggregate max:

df1 = pd.DataFrame([[1,2],[3,4],[5,6], [1,6]])
df1.columns = ['A','B']
print (df1)
   A  B
0  1  2
1  3  4
2  5  6
3  1  6

df2 = pd.get_dummies(df1.set_index('B')['A'])
df2 = df2.groupby(level=0).max()
print (df2)
   1  3  5
B         
2  1  0  0
4  0  1  0
6  1  0  1

Alternative edit by DYZ (resets the index and refers to the column instead):

print(pd.get_dummies(df1.set_index('B')['A']).reset_index().groupb‌​y('B').max())
like image 170
jezrael Avatar answered Oct 13 '22 00:10

jezrael


answer
option #6 is my best attempt.

edit:
For option 6 instead of overwriting the assignment, you can increment instead. This small tweak should get you a count.

df2.values[row_indexers, col_indexers] += 1

option 1

df1 = pd.DataFrame([[1,2], [3,4], [5,6], [1,6]], columns=['A', 'B'])
df2 = pd.DataFrame(0, index = list(df1['B'].unique()),
                    columns = list(df1['A'].unique()))

df1.groupby(list('AB')).size().gt(0).mul(1) \
    .reindex(df2.unstack().index, fill_value=0) \
    .unstack(0)

enter image description here


option 2

df1 = pd.DataFrame([[1,2], [3,4], [5,6], [1,6]], columns=['A', 'B'])
df2 = pd.DataFrame(0, index = list(df1['B'].unique()),
                    columns = list(df1['A'].unique()))

mux = pd.MultiIndex.from_arrays(df1.values.T).drop_duplicates()
df2.update(pd.Series(1, mux).unstack(0))
df2

enter image description here


option 3

df1 = pd.DataFrame([[1,2], [3,4], [5,6], [1,6]], columns=['A', 'B'])
df2 = pd.DataFrame(0, index = list(df1['B'].unique()),
                    columns = list(df1['A'].unique()))

mux = pd.MultiIndex.from_arrays(df1.values.T).drop_duplicates()
df2.where(pd.Series(False, mux).unstack(0, fill_value=True), 1)

enter image description here


option 4

df1 = pd.DataFrame([[1,2], [3,4], [5,6], [1,6]], columns=['A', 'B'])
df2 = pd.DataFrame(0, index = list(df1['B'].unique()),
                    columns = list(df1['A'].unique()))

mux = pd.MultiIndex.from_arrays(df1.values.T).drop_duplicates()
df2[pd.Series(True, mux).unstack(0, fill_value=False)] = 1
df2

enter image description here


option 5

df1 = pd.DataFrame([[1,2], [3,4], [5,6], [1,6]], columns=['A', 'B'])
df2 = pd.DataFrame(0, index = list(df1['B'].unique()),
                    columns = list(df1['A'].unique()))

for i, (a, b) in df1.iterrows():
    df2.set_value(b, a, 1)
df2

enter image description here

option 6
inspired by @ayhan and @Divakar

df1 = pd.DataFrame([[1,2], [3,4], [5,6], [1,6]], columns=['A', 'B'])
df2 = pd.DataFrame(0, index = list(df1['B'].unique()),
                    columns = list(df1['A'].unique()))

row_indexers = df2.index.values.searchsorted(df1.B.values)
col_indexers = df2.columns.values.searchsorted(df1.A.values)

df2.values[row_indexers, col_indexers] = 1
df2

enter image description here


timing
given sample
code:

df1 = pd.DataFrame([[1,2], [3,4], [5,6], [1,6]], columns=['A', 'B'])
df2 = pd.DataFrame(0, index = list(df1['B'].unique()),
                    columns = list(df1['A'].unique()))

def pir1():
    return df1.groupby(list('AB')).size().gt(0).mul(1) \
        .reindex(df2.unstack().index, fill_value=0) \
        .unstack(0)

def pir2():
    mux = pd.MultiIndex.from_arrays(df1.values.T).drop_duplicates()
    df2.update(pd.Series(1, mux).unstack(0))

def pir3():
    mux = pd.MultiIndex.from_arrays(df1.values.T).drop_duplicates()
    return df2.where(pd.Series(False, mux).unstack(0, fill_value=True), 1)

def pir4():
    mux = pd.MultiIndex.from_arrays(df1.values.T).drop_duplicates()
    df2[pd.Series(True, mux).unstack(0, fill_value=False)] = 1

def pir5():
    for i, (a, b) in df1.iterrows():
        df2.set_value(b, a, 1)

def pir6():
    row_indexers = df2.index.values.searchsorted(df1.B.values)
    col_indexers = df2.columns.values.searchsorted(df1.A.values)

    df2.values[row_indexers, col_indexers] = 1
    return df2

def ayhan1():
    row_indexers = [df2.index.get_loc(r_label) for r_label in df1.B]
    col_indexers = [df2.columns.get_loc(c_label) for c_label in df1.A]

    df2.values[row_indexers, col_indexers] = 1

def jez1():
    return pd.get_dummies(df1.set_index('B')['A']).groupby(level=0).max()

enter image description here

much larger sample
code:

from itertools import combinations
from string import ascii_letters
letter_pairs = [t[0] + t[1] for t in combinations(ascii_letters, 2)]
df1 = pd.DataFrame(dict(A=np.random.randint(0, 100, 10000),
                        B=np.random.choice(letter_pairs, 10000)))
df2 = pd.DataFrame(0, index = list(df1['B'].unique()),
                    columns = list(df1['A'].unique()))

enter image description here

like image 37
piRSquared Avatar answered Oct 13 '22 01:10

piRSquared