Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient pairwise comparison of rows in pandas DataFrame

I am currently working with a smallish dataset (about 9 million rows). Unfortunately, most of the entries are strings, and even with coercion to categories, the frame sits at a few GB in memory.

What I would like to do is compare each row with other rows and do a straight comparison of contents. For example, given

   A   B     C      D
0 cat blue  old Saturday
1 dog red   old Saturday

I would like to compute

      d_A   d_B   d_C   d_D
0, 0  True  True  True  True
0, 1  False False True  True
1, 0  False False True  True
1, 1  True  True  True  True

Obviously, combinatorial explosion will preclude a comparison of every record with every other record. So we can instead use blocking, by applying groupby, say on column A.

My question is, is there a a way to do this in either pandas or dask, that is faster than the following sequence:

  1. Group by index
  2. Outer join each group to itself to produce pairs
  3. dataframe.apply comparison function on each row of pairs

For reference, assume I have access to a good number of cores (hundreds), and about 200G of memory.

like image 605
Fred Byrd Avatar asked Oct 23 '25 01:10

Fred Byrd


2 Answers

The solution turned out to be using numpy in place of step 3). While we cannot create an outer join of every row, we can group by values in column A and create smaller groups to outer join.

The trick is then to use numpy.equal.outer(df1, df2).ravel() When dataframes are passed as inputs to a numpy function in this way, the result is a much faster (at least 30x) vectorized result. For example:

>>> df = pd.DataFrame
   A   B     C      D
0 cat blue  old Saturday
1 dog red   old Saturday

>>> result = pd.DataFrame(columns=["A", "B", "C", "D"], 
                            index=pd.MultiIndex.from_product([df.index, df.index]))
>>> result["A"] = np.equal.outer(df["A"], df["A"]).ravel()
>>> result
        A     B     C     D
0, 0  True   NaN   NaN   NaN  
0, 1  False  NaN   NaN   NaN  
1, 0  False  NaN   NaN   NaN  
1, 1  True   NaN   NaN   NaN  

You can repeat for each column, or just automate the process with columnwise apply on result.

like image 63
Fred Byrd Avatar answered Oct 24 '25 15:10

Fred Byrd


  1. You might consider phrasing your problem as a join operation
  2. You might consider using categoricals to reduce memory use
like image 41
MRocklin Avatar answered Oct 24 '25 14:10

MRocklin



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!