I have data frame "A" that looks like this:
type latw lngs late lngn
0 1000 45.457966 9.174864 45.458030 9.174907
1 1000 45.457966 9.174864 45.458030 9.174907
2 1000 45.458030 9.174864 45.458094 9.174907
3 1000 45.458094 9.174864 45.458157 9.174907
4 1000 45.458157 9.174864 45.458221 9.174907
5 1000 45.458221 9.174864 45.458285 9.174907
6 1000 45.458285 9.174864 45.458349 9.174907
7 1000 45.458349 9.174864 45.458413 9.174907
8 1000 45.458413 9.174864 45.458477 9.174907
9 1000 45.458477 9.174864 45.458540 9.174907
10 1000 45.458540 9.174864 45.458604 9.174907
11 1000 45.458604 9.174864 45.458668 9.174907
12 1000 45.458668 9.174864 45.458732 9.174907
13 1000 45.458732 9.174864 45.458796 9.174907
14 1000 45.458796 9.174864 45.458860 9.174907
15 1000 45.458860 9.174864 45.458923 9.174907
16 1000 45.458923 9.174864 45.458987 9.174907
17 1000 45.458987 9.174864 45.459051 9.174907
18 1000 45.459051 9.174864 45.459115 9.174907
19 1000 45.459115 9.174864 45.459179 9.174907
20 1000 45.459179 9.174864 45.459243 9.174907
21 1000 45.459243 9.174864 45.459306 9.174907
22 1000 45.459306 9.174864 45.459370 9.174907
23 1000 45.459370 9.174864 45.459434 9.174907
24 1000 45.459434 9.174864 45.459498 9.174907
25 1000 45.459498 9.174864 45.459562 9.174907
26 1000 45.459562 9.174864 45.459626 9.174907
27 1000 45.459626 9.174864 45.459689 9.174907
28 1000 45.459689 9.174864 45.459753 9.174907
29 1000 45.459753 9.174864 45.459817 9.174907
... ... ... ... ... ...
970 1000 45.460583 9.175545 45.460647 9.175587
971 1000 45.460647 9.175545 45.460711 9.175587
972 1000 45.460711 9.175545 45.460775 9.175587
973 1000 45.460775 9.175545 45.460838 9.175587
974 1000 45.460838 9.175545 45.460902 9.175587
975 1000 45.460902 9.175545 45.460966 9.175587
976 1000 45.460966 9.175545 45.461030 9.175587
977 1000 45.461030 9.175545 45.461094 9.175587
978 1000 45.461094 9.175545 45.461157 9.175587
979 1000 45.461157 9.175545 45.461221 9.175587
980 1000 45.461221 9.175545 45.461285 9.175587
981 1000 45.461285 9.175545 45.461349 9.175587
982 1000 45.461349 9.175545 45.461413 9.175587
983 1000 45.461413 9.175545 45.461477 9.175587
984 1000 45.461477 9.175545 45.461540 9.175587
985 1000 45.461540 9.175545 45.461604 9.175587
986 1000 45.461604 9.175545 45.461668 9.175587
987 1000 45.457966 9.175587 45.458030 9.175630
988 1000 45.458030 9.175587 45.458094 9.175630
989 1000 45.458094 9.175587 45.458157 9.175630
990 1000 45.458157 9.175587 45.458221 9.175630
991 1000 45.458221 9.175587 45.458285 9.175630
992 1000 45.458285 9.175587 45.458349 9.175630
993 1000 45.458349 9.175587 45.458413 9.175630
994 1000 45.458413 9.175587 45.458477 9.175630
995 1000 45.458477 9.175587 45.458540 9.175630
996 1000 45.458540 9.175587 45.458604 9.175630
997 1000 45.458604 9.175587 45.458668 9.175630
998 1000 45.458668 9.175587 45.458732 9.175630
999 1000 45.458732 9.175587 45.458796 9.175630
It has 22,000,000 rows × 5 columns and there is data frame "B" which looks like this:
type Lat Lng
0 0 45.465739 9.180830
1 2 45.463950 9.187113
2 1 45.468015 9.180648
3 1 45.462209 9.187447
4 0 45.459578 9.184007
5 1 45.459822 9.187034
6 2 45.454988 9.180310
7 2 45.459818 9.189377
8 0 45.462200 9.187440
9 0 45.467160 9.180100
10 2 45.459407 9.183300
11 2 45.457699 9.187434
12 1 45.455319 9.186697
13 0 45.461138 9.191943
14 2 45.456397 9.189028
15 0 45.457062 9.185878
16 1 45.461980 9.187024
17 1 45.464319 9.183142
18 2 45.464227 9.187065
19 1 45.460886 9.185216
It has 2,000,000 rows × 3 columns. I want to replace type's value of data frame "A" with "B" Where:
A[latw]<B[lat]<A[late] and A[lngs]<B[lng]<B[lngn]
I want to check a location from B belongs to which one of the rectangles in A.
PS I'm looking for the fastest way in python such as using parallel processing.
You can do something like this:
Assuming we have the following DFs:
In [150]: A
Out[150]:
type latw late lngs lngn
0 1000 45.457966 45.458030 9.174864 9.174907
1 1001 45.457966 45.458030 9.174864 9.174907
2 1002 45.458030 45.458094 9.174864 9.174907
3 1003 45.458094 45.458157 9.174864 9.174907
16 1004 45.458923 45.458987 9.174864 9.174907
17 1005 45.458987 45.459051 9.174864 9.174907
970 1006 45.460583 45.460647 9.175545 9.175587
971 1007 45.460647 45.460711 9.175545 9.175587
972 1008 45.460711 45.460775 9.175545 9.175587
996 1009 45.458540 45.458604 9.175587 9.175630
997 1010 45.458604 45.458668 9.175587 9.175630
998 1011 45.458668 45.458732 9.175587 9.175630
999 1012 45.458732 45.458796 9.175587 9.175630
In [151]: B
Out[151]:
type Lat Lng
0 0 45.4581 9.1749
1 1 45.4590 9.1749
2 2 45.4586 9.1756
Solution:
B['new'] = B.apply(lambda x: A.loc[ (A.latw < x.Lat) & (x.Lat < A.late)
& (A.lngs < x.Lng) & (x.Lng < A.lngn), 'type'].head(1),
axis=1) \
.values.diagonal()
In [153]: B
Out[153]:
type Lat Lng new
0 0 45.4581 9.1749 1003.0
1 1 45.4590 9.1749 1005.0
2 2 45.4586 9.1756 1009.0
PS I'm not sure this is the fastest way to achieve that...
This answer is quite involved. The following are high-level points
np.searchsorted
with the west latitudes in A
relative to the sorted latitudes in B
. Repeat this for east latitudes but with a parameter of side='right'
. This tells me where each west/east latitudes fall in the spectrum defined in B
. np.searchsorted
are index values. So if the east based results are greater than the west based results, that implies that there was a latitude from B
between west and east.
B.Lat
's are in between. I expand arrays to track all possible matches.numpy
structured arrays to find intersections of 2-D arrays
A
and matching ordinal positions from B
.
B
, I take the first.use A
and B
provided by @MaxU
p = process(A, B)
print(p)
[[3 0]
[5 1]
[9 2]]
To show the results and compare. I'll put them side by side
pd.concat([
A.iloc[p[:, 0]].reset_index(drop=True),
B.iloc[p[:, 1]].reset_index(drop=True)
], axis=1, keys=['A', 'B'])
However, to swap the type
values, I put a flag in the process
function to reassign. Just run process(A, B, reassign_type=True)
timing
functions
I defined the functions to return the same thing and make it an apples to apples comparison
def pir(A, B):
return A.values[process(A, B)[:, 0], 0]
def maxu(A, B):
return B.apply(
lambda x: A.loc[
(A.latw < x.Lat) &
(x.Lat < A.late) &
(A.lngs < x.Lng) &
(x.Lng < A.lngn), 'type'
].head(1),
axis=1
).values.diagonal()
more thorough testing
from timeit import timeit
from itertools import product
rng = np.arange(20, 51, 5)
idx = pd.MultiIndex.from_arrays([rng, rng], names=['B', 'A'])
functions = pd.Index(['pir', 'maxu'], name='method')
results = pd.DataFrame(index=idx, columns=functions)
for bi, ai in idx:
n = ai
ws = np.random.rand(n, 2) * 10 + np.array([[40, 30]])
A = pd.DataFrame(
np.hstack([-np.ones((n, 1)), ws, ws + np.random.rand(n, 2) * 2]),
columns=['type', 'latw', 'lngs', 'late', 'lngn']
)
m = int(bi ** .5)
prng = np.arange(m) / m * 10 + .5
B = pd.DataFrame(
np.array(list(product(prng, prng))) + np.array([[40, 30]]),
columns=['Lat', "Lng"])
B.insert(0, 'type', np.random.randint(3, size=len(B)))
for j in functions:
results.set_value(
(bi, ai), j,
timeit(
'{}(A, B)'.format(j),
'from __main__ import {}, A, B'.format(j),
number=10
)
)
results.plot(rot=45)
from collections import namedtuple
TwoSided = namedtuple('TwoSided', ['match', 'idx', 'found'])
def two_sided_search(left, right, a):
# sort `a` and save order for later
argsort = a.argsort()
asorted = a[argsort]
# where does `left` fall in `asorted`
s_left = asorted.searchsorted(left)
# where does `right` fall in `asorted`
s_right = asorted.searchsorted(right, side='right')
rng = np.arange(left.size)
# a match happens when where we found `left`, `right` was found later
match_idx = rng[s_left < s_right]
# ignore positions with no match
left_pos = s_left[match_idx]
right_pos = s_right[match_idx]
# distance from where left was found to where right was found
# this represents how many items in a are wrapped by left and right
diff_pos = right_pos - left_pos
# build an index on `match_idx` paired with all positions in
# `asorted`.
d2 = left_pos - np.append(0, diff_pos[:-1].cumsum())
p1 = np.arange(len(left_pos)).repeat(diff_pos)
p2 = np.arange(diff_pos.sum()) + d2.repeat(diff_pos)
# returns
# `match_idx` which is an integer slice of either `left` or `right`
# for each element in `match_idx` there are one or more elements in
# `a` that were surrounded by `left` and `right`
# `p1` are the positions in `match_idx` that are paired with `argsort[p2]`
# for example, consider `p1 = [1, 1, 2, 2]` and `argsort[p2] = [3, 5, 12, 5]`
# this means that `left[match_idx[1]] < a[3] < right[match_idx[1]]`
# and `left[match_idx[1]] < a[5] < right[match_idx[1]]`
# and `left[match_idx[2]] < a[12] < right[match_idx[2]]`
# and `left[match_idx[2]] < a[5] < right[match_idx[2]]`
return TwoSided(match_idx, p1, argsort[p2])
def intersectNd(a, b):
# taken from https://stackoverflow.com/a/8317403/2336654
# returns the interesection of 2 2-D arrays
# the strategy is to convert to a structured array then
# use np.intersect1d then convert back to unstructured
ncols = a.shape[1]
dtype = dict(
names=['f{}'.format(i) for i in range(ncols)],
formats=ncols * [a.dtype]
)
return np.intersect1d(
a.view(dtype), b.view(dtype)
).view(a.dtype).reshape(-1, ncols)
def where_in(b1, b2):
# return a mask of lenght len(b2) where True
# when element of b2 is in b1
bsort = b1.argsort()
bsrtd = b1[bsort]
bsrch = bsrtd.searchsorted(b2) % bsrtd.size
return bsrtd[bsrch] == b2
def process(A, B, reassign_type=False):
lats = two_sided_search(
A.latw.values,
A.late.values,
B.Lat.values,
)
# subset A & B
# no point looking for longitude matches
# in rows without a latitude match
lngs = A.lngs.values[lats.match]
lngn = A.lngn.values[lats.match]
u, i = np.unique(lats.found, return_index=1)
lng = B.Lng.values[u]
lons = two_sided_search(
lngs, lngn, lng
)
# `lats.match` is where we found successful latitudes
# `lons.match` is where we found successful longitudes
# since `lons.match` was based on `lats.match` we can slice
# to get the original positions of `A` that satisfy both
# the latw < blat < late & lngs < blng < lngn
match = lats.match[lons.match]
# however, for every row in A that might be a match for B,
# there may be multilple B's
#
# We start addressing this by finding
# where in lats.idx do we have matches
# this gives me a boolean array of lats.match
# index values repeated for each row in B
# that satisfies our conditions
wi = where_in(lons.match, lats.idx)
# a (? x 2) array representing all combinations of
# rows in A that matched the
lon_pos = np.hstack([
lats.match[lons.match[lons.idx], None],
u[lons.found, None]
])
lat_pos = np.hstack([
lats.match[lats.idx[wi], None],
lats.found[wi, None]
])
x = intersectNd(lat_pos, lon_pos)
p, pi = np.unique(x[:, 0], return_index=1)
if reassign_type:
A.iloc[x[pi, 0], A.columns.get_loc('type')] = \
B.iloc[x[pi, 1], B.columns.get_loc('type')].values
else:
return x[pi]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With