Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast (vectorized) way to find points in one DF belonging to equally sized rectangles (given by two points) from the second DF

I have data frame "A" that looks like this:

type    latw    lngs    late    lngn
0   1000    45.457966   9.174864    45.458030   9.174907
1   1000    45.457966   9.174864    45.458030   9.174907
2   1000    45.458030   9.174864    45.458094   9.174907
3   1000    45.458094   9.174864    45.458157   9.174907
4   1000    45.458157   9.174864    45.458221   9.174907
5   1000    45.458221   9.174864    45.458285   9.174907
6   1000    45.458285   9.174864    45.458349   9.174907
7   1000    45.458349   9.174864    45.458413   9.174907
8   1000    45.458413   9.174864    45.458477   9.174907
9   1000    45.458477   9.174864    45.458540   9.174907
10  1000    45.458540   9.174864    45.458604   9.174907
11  1000    45.458604   9.174864    45.458668   9.174907
12  1000    45.458668   9.174864    45.458732   9.174907
13  1000    45.458732   9.174864    45.458796   9.174907
14  1000    45.458796   9.174864    45.458860   9.174907
15  1000    45.458860   9.174864    45.458923   9.174907
16  1000    45.458923   9.174864    45.458987   9.174907
17  1000    45.458987   9.174864    45.459051   9.174907
18  1000    45.459051   9.174864    45.459115   9.174907
19  1000    45.459115   9.174864    45.459179   9.174907
20  1000    45.459179   9.174864    45.459243   9.174907
21  1000    45.459243   9.174864    45.459306   9.174907
22  1000    45.459306   9.174864    45.459370   9.174907
23  1000    45.459370   9.174864    45.459434   9.174907
24  1000    45.459434   9.174864    45.459498   9.174907
25  1000    45.459498   9.174864    45.459562   9.174907
26  1000    45.459562   9.174864    45.459626   9.174907
27  1000    45.459626   9.174864    45.459689   9.174907
28  1000    45.459689   9.174864    45.459753   9.174907
29  1000    45.459753   9.174864    45.459817   9.174907
... ... ... ... ... ...
970 1000    45.460583   9.175545    45.460647   9.175587
971 1000    45.460647   9.175545    45.460711   9.175587
972 1000    45.460711   9.175545    45.460775   9.175587
973 1000    45.460775   9.175545    45.460838   9.175587
974 1000    45.460838   9.175545    45.460902   9.175587
975 1000    45.460902   9.175545    45.460966   9.175587
976 1000    45.460966   9.175545    45.461030   9.175587
977 1000    45.461030   9.175545    45.461094   9.175587
978 1000    45.461094   9.175545    45.461157   9.175587
979 1000    45.461157   9.175545    45.461221   9.175587
980 1000    45.461221   9.175545    45.461285   9.175587
981 1000    45.461285   9.175545    45.461349   9.175587
982 1000    45.461349   9.175545    45.461413   9.175587
983 1000    45.461413   9.175545    45.461477   9.175587
984 1000    45.461477   9.175545    45.461540   9.175587
985 1000    45.461540   9.175545    45.461604   9.175587
986 1000    45.461604   9.175545    45.461668   9.175587
987 1000    45.457966   9.175587    45.458030   9.175630
988 1000    45.458030   9.175587    45.458094   9.175630
989 1000    45.458094   9.175587    45.458157   9.175630
990 1000    45.458157   9.175587    45.458221   9.175630
991 1000    45.458221   9.175587    45.458285   9.175630
992 1000    45.458285   9.175587    45.458349   9.175630
993 1000    45.458349   9.175587    45.458413   9.175630
994 1000    45.458413   9.175587    45.458477   9.175630
995 1000    45.458477   9.175587    45.458540   9.175630
996 1000    45.458540   9.175587    45.458604   9.175630
997 1000    45.458604   9.175587    45.458668   9.175630
998 1000    45.458668   9.175587    45.458732   9.175630
999 1000    45.458732   9.175587    45.458796   9.175630

It has 22,000,000 rows × 5 columns and there is data frame "B" which looks like this:

    type        Lat       Lng
0      0  45.465739  9.180830
1      2  45.463950  9.187113
2      1  45.468015  9.180648
3      1  45.462209  9.187447
4      0  45.459578  9.184007
5      1  45.459822  9.187034
6      2  45.454988  9.180310
7      2  45.459818  9.189377
8      0  45.462200  9.187440
9      0  45.467160  9.180100
10     2  45.459407  9.183300
11     2  45.457699  9.187434
12     1  45.455319  9.186697
13     0  45.461138  9.191943
14     2  45.456397  9.189028
15     0  45.457062  9.185878
16     1  45.461980  9.187024
17     1  45.464319  9.183142
18     2  45.464227  9.187065
19     1  45.460886  9.185216

It has 2,000,000 rows × 3 columns. I want to replace type's value of data frame "A" with "B" Where:

A[latw]<B[lat]<A[late] and A[lngs]<B[lng]<B[lngn]

I want to check a location from B belongs to which one of the rectangles in A.

PS I'm looking for the fastest way in python such as using parallel processing.

like image 932
asikhalaban Avatar asked Jan 18 '17 18:01

asikhalaban


2 Answers

You can do something like this:

Assuming we have the following DFs:

In [150]: A
Out[150]:
     type       latw       late      lngs      lngn
0    1000  45.457966  45.458030  9.174864  9.174907
1    1001  45.457966  45.458030  9.174864  9.174907
2    1002  45.458030  45.458094  9.174864  9.174907
3    1003  45.458094  45.458157  9.174864  9.174907
16   1004  45.458923  45.458987  9.174864  9.174907
17   1005  45.458987  45.459051  9.174864  9.174907
970  1006  45.460583  45.460647  9.175545  9.175587
971  1007  45.460647  45.460711  9.175545  9.175587
972  1008  45.460711  45.460775  9.175545  9.175587
996  1009  45.458540  45.458604  9.175587  9.175630
997  1010  45.458604  45.458668  9.175587  9.175630
998  1011  45.458668  45.458732  9.175587  9.175630
999  1012  45.458732  45.458796  9.175587  9.175630

In [151]: B
Out[151]:
   type      Lat     Lng
0     0  45.4581  9.1749
1     1  45.4590  9.1749
2     2  45.4586  9.1756

Solution:

B['new'] = B.apply(lambda x: A.loc[  (A.latw < x.Lat) & (x.Lat < A.late)
                                   & (A.lngs < x.Lng) & (x.Lng < A.lngn), 'type'].head(1),
                   axis=1) \
            .values.diagonal()

In [153]: B
Out[153]:
   type      Lat     Lng     new
0     0  45.4581  9.1749  1003.0
1     1  45.4590  9.1749  1005.0
2     2  45.4586  9.1756  1009.0

PS I'm not sure this is the fastest way to achieve that...

like image 170
MaxU - stop WAR against UA Avatar answered Sep 22 '22 16:09

MaxU - stop WAR against UA


High Level Explanation

This answer is quite involved. The following are high-level points

  • Use np.searchsorted with the west latitudes in A relative to the sorted latitudes in B. Repeat this for east latitudes but with a parameter of side='right'. This tells me where each west/east latitudes fall in the spectrum defined in B. np.searchsorted are index values. So if the east based results are greater than the west based results, that implies that there was a latitude from B between west and east.
    • Track where east searches are greater than west searches.
    • Find the difference. Not only do I care if the east search is greater than the west search, the difference tells me how many B.Lat's are in between. I expand arrays to track all possible matches.
  • Repeat this process for south and north longitudes but only over the subset in which we found matching latitudes. No need to waste effort looking where we know we don't have a match.
    • Perform tricky slicing to unwind positions
  • Use numpy structured arrays to find intersections of 2-D arrays
    • I used this answer from @JoeKington please up vote his answer because this is awesome!
  • This final intersection has ordinal positions from A and matching ordinal positions from B.
    • remember I said we could possible match more than one row from B, I take the first.

Demonstration

use A and B provided by @MaxU

p = process(A, B)
print(p)

[[3 0]
 [5 1]
 [9 2]]

To show the results and compare. I'll put them side by side

pd.concat([
        A.iloc[p[:, 0]].reset_index(drop=True),
        B.iloc[p[:, 1]].reset_index(drop=True)
    ], axis=1, keys=['A', 'B'])

enter image description here

However, to swap the type values, I put a flag in the process function to reassign. Just run process(A, B, reassign_type=True)

timing
functions
I defined the functions to return the same thing and make it an apples to apples comparison

def pir(A, B):
    return A.values[process(A, B)[:, 0], 0]

def maxu(A, B):
    return B.apply(
        lambda x: A.loc[
            (A.latw < x.Lat) &
            (x.Lat < A.late) &
            (A.lngs < x.Lng) &
            (x.Lng < A.lngn), 'type'
        ].head(1),
        axis=1
    ).values.diagonal()

enter image description here

more thorough testing

from timeit import timeit
from itertools import product

rng = np.arange(20, 51, 5)
idx = pd.MultiIndex.from_arrays([rng, rng], names=['B', 'A'])
functions = pd.Index(['pir', 'maxu'], name='method')

results = pd.DataFrame(index=idx, columns=functions)

for bi, ai in idx:
    n = ai
    ws = np.random.rand(n, 2) * 10 + np.array([[40, 30]])
    A = pd.DataFrame(
        np.hstack([-np.ones((n, 1)), ws, ws + np.random.rand(n, 2) * 2]),
        columns=['type', 'latw', 'lngs', 'late', 'lngn']
    )
    m = int(bi ** .5)
    prng = np.arange(m) / m * 10 + .5
    B = pd.DataFrame(
        np.array(list(product(prng, prng))) + np.array([[40, 30]]),
        columns=['Lat', "Lng"])
    B.insert(0, 'type', np.random.randint(3, size=len(B)))
    for j in functions:
        results.set_value(
            (bi, ai), j,
            timeit(
                '{}(A, B)'.format(j),
                'from __main__ import {}, A, B'.format(j),
                number=10
            )
        )

results.plot(rot=45)

enter image description here


Code

from collections import namedtuple
TwoSided = namedtuple('TwoSided', ['match', 'idx', 'found'])

def two_sided_search(left, right, a):
    # sort `a` and save order for later
    argsort = a.argsort()
    asorted = a[argsort]

    # where does `left` fall in `asorted`
    s_left = asorted.searchsorted(left)
    # where does `right` fall in `asorted`
    s_right = asorted.searchsorted(right, side='right')

    rng = np.arange(left.size)

    # a match happens when where we found `left`, `right` was found later
    match_idx = rng[s_left < s_right]

    # ignore positions with no match
    left_pos = s_left[match_idx]
    right_pos = s_right[match_idx]
    # distance from where left was found to where right was found
    # this represents how many items in a are wrapped by left and right
    diff_pos = right_pos - left_pos
    # build an index on `match_idx` paired with all positions in
    # `asorted`.
    d2 = left_pos - np.append(0, diff_pos[:-1].cumsum())
    p1 = np.arange(len(left_pos)).repeat(diff_pos)
    p2 = np.arange(diff_pos.sum()) + d2.repeat(diff_pos)

    # returns
    # `match_idx` which is an integer slice of either `left` or `right`
    # for each element in `match_idx` there are one or more elements in
    # `a` that were surrounded by `left` and `right`
    # `p1` are the positions in `match_idx` that are paired with `argsort[p2]`
    # for example, consider `p1 = [1, 1, 2, 2]` and `argsort[p2] = [3, 5, 12, 5]`
    # this means that `left[match_idx[1]] < a[3] < right[match_idx[1]]`
    # and `left[match_idx[1]] < a[5] < right[match_idx[1]]`
    # and `left[match_idx[2]] < a[12] < right[match_idx[2]]`
    # and `left[match_idx[2]] < a[5] < right[match_idx[2]]`

    return TwoSided(match_idx, p1, argsort[p2])

def intersectNd(a, b):
    # taken from https://stackoverflow.com/a/8317403/2336654
    # returns the interesection of 2 2-D arrays
    # the strategy is to convert to a structured array then
    # use np.intersect1d then convert back to unstructured
    ncols = a.shape[1]
    dtype = dict(
        names=['f{}'.format(i) for i in range(ncols)],
        formats=ncols * [a.dtype]
    )

    return np.intersect1d(
        a.view(dtype), b.view(dtype)
    ).view(a.dtype).reshape(-1, ncols)

def where_in(b1, b2):
    # return a mask of lenght len(b2) where True
    # when element of b2 is in b1
    bsort = b1.argsort()
    bsrtd = b1[bsort]
    bsrch = bsrtd.searchsorted(b2) % bsrtd.size
    return bsrtd[bsrch] == b2

def process(A, B, reassign_type=False):
    lats = two_sided_search(
        A.latw.values,
        A.late.values,
        B.Lat.values,
    )

    # subset A & B
    # no point looking for longitude matches
    # in rows without a latitude match
    lngs = A.lngs.values[lats.match]
    lngn = A.lngn.values[lats.match]
    u, i = np.unique(lats.found, return_index=1)
    lng = B.Lng.values[u]

    lons = two_sided_search(
        lngs, lngn, lng
    )

    # `lats.match` is where we found successful latitudes
    # `lons.match` is where we found successful longitudes
    # since `lons.match` was based on `lats.match` we can slice
    # to get the original positions of `A` that satisfy both
    # the latw < blat < late & lngs < blng < lngn
    match = lats.match[lons.match]

    # however, for every row in A that might be a match for B,
    # there may be multilple B's
    #
    # We start addressing this by finding
    # where in lats.idx do we have matches
    # this gives me a boolean array of lats.match
    # index values repeated for each row in B
    # that satisfies our conditions
    wi = where_in(lons.match, lats.idx)

    # a (? x 2) array representing all combinations of
    # rows in A that matched the 
    lon_pos = np.hstack([
            lats.match[lons.match[lons.idx], None],
            u[lons.found, None]
        ])

    lat_pos = np.hstack([
            lats.match[lats.idx[wi], None],
            lats.found[wi, None]
        ])

    x = intersectNd(lat_pos, lon_pos)
    p, pi = np.unique(x[:, 0], return_index=1)

    if reassign_type:
        A.iloc[x[pi, 0], A.columns.get_loc('type')] = \
            B.iloc[x[pi, 1], B.columns.get_loc('type')].values
    else:
        return x[pi]
like image 41
piRSquared Avatar answered Sep 23 '22 16:09

piRSquared