I have loaded my application logs in BigQuery and I need to calculate country based on IP address from those logs.
I have written a join query between my table and a GeoIP mapping table that I downloaded from MaxMind.
An ideal query would be OUTER JOIN
with range filter, however BQ
supports only =
in join conditions.
So the query does an INNER JOIN
and handles missing values in each side of the JOIN
.
I have amended my original query so it could run on the Wikipedia public data set.
Can someone please help me make this run faster?
SELECT id, client_ip, client_ip_code, B.Country_Name as Country_Name
FROM
(SELECT id, contributor_ip as client_ip, INTEGER(PARSE_IP(contributor_ip)) AS client_ip_code, 1 AS One
FROM [publicdata:samples.wikipedia] Limit 1000) AS A1
JOIN
(SELECT From_IP_Code, To_IP_Code, Country_Name, 1 AS One
FROM
-- 3 IP sets: 1.valid ranges, 2.Gaps, 3. Gap at the end of the set
-- all Ranges of valid IPs:
(SELECT From_IP_Code, To_IP_Code, Country_Name FROM [QA_DATASET.GeoIP])
-- Missing rages lower from From_IP
,(SELECT
PriorRangeEndIP + 1 From_IP_Code,
From_IP_Code - 1 AS To_IP_Code,
'NA' AS Country_Name
FROM
-- use of LAG function to find prior valid range
(SELECT
From_IP_Code,
To_IP_Code, Country_Name,
LAG(To_IP_Code, 1, INTEGER(0))
OVER(ORDER BY From_IP_Code asc) PriorRangeEndIP
FROM [QA_DATASET.GeoIP]) A
-- If gap from prior valid range is > 1 than its a gap to fill
WHERE From_IP_Code > PriorRangeEndIP + 1)
-- Missing rages higher tan Max To_IP
,(SELECT MAX(To_IP_Code) + 1 as From_IP_Code, INTEGER(4311810304) as To_IP_Code, 'NA' AS Country_Name
FROM [QA_DATASET.GeoIP])
) AS B
ON A1.ONE = B.ONE -- fake join condition to overcome allowed use of only = in joins
-- Join condition where valid IP exists on left
WHERE
A1.client_ip_code >= B.From_IP_Code
AND A1.client_ip_code <= B.To_IP_Code
OR (A1.client_ip_code IS NULL
AND B.From_IP_Code = 1) -- where there is no valid IP on left contributor_ip
Inspired by https://medium.com/@hoffa/geolocation-with-bigquery-de-identify-76-million-ip-addresses-in-20-seconds-e9e652480bd2 I came up with this solution that handles the ipv6 addresses also:
WITH test_data AS (
SELECT '2a02:2f0c:570c:fe00:1db7:21c4:21fa:f89' AS ip UNION ALL
SELECT '79.114.150.111' AS ip
)
-- replace the input_data with your data
, ipv4 AS (
SELECT DISTINCT ip, NET.SAFE_IP_FROM_STRING(ip) AS ip_bytes
FROM test_data
WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 4
), ipv4d AS (
SELECT ip, city_name, country_name, latitude, longitude
FROM (
SELECT ip, ip_bytes & NET.IP_NET_MASK(4, mask) network_bin, mask
FROM ipv4, UNNEST(GENERATE_ARRAY(8,32)) mask
)
JOIN `demo_bq_dataset.geoip_city_v4`
USING (network_bin, mask)
), ipv6 AS (
SELECT DISTINCT ip, NET.SAFE_IP_FROM_STRING(ip) AS ip_bytes
FROM test_data
WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 16
), ipv6d AS (
SELECT ip, city_name, country_name, latitude, longitude
FROM (
SELECT ip, ip_bytes & NET.IP_NET_MASK(16, mask) network_bin, mask
FROM ipv6, UNNEST(GENERATE_ARRAY(19,64)) mask
)
JOIN `demo_bq_dataset.geoip_city_v6`
USING (network_bin, mask)
)
SELECT * FROM ipv4d
UNION ALL
SELECT * FROM ipv6d
In order to get the geoip_city_v4
and geoip_city_v6
you need to download the geoip database from https://maxmind.com/
You can follow this tutorial in to update and prepare you dataset https://hodo.dev/posts/post37-gcp-bigquery-geoip/.
2019, much improved answer:
#standardSQL
# replace with your source of IP addresses
# here I'm using the same Wikipedia set from the previous article
WITH source_of_ip_addresses AS (
SELECT REGEXP_REPLACE(contributor_ip, 'xxx', '0') ip, COUNT(*) c
FROM `publicdata.samples.wikipedia`
WHERE contributor_ip IS NOT null
GROUP BY 1
)
SELECT country_name, SUM(c) c
FROM (
SELECT ip, country_name, c
FROM (
SELECT *, NET.SAFE_IP_FROM_STRING(ip) & NET.IP_NET_MASK(4, mask) network_bin
FROM source_of_ip_addresses, UNNEST(GENERATE_ARRAY(9,32)) mask
WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 4
)
JOIN `fh-bigquery.geocode.201806_geolite2_city_ipv4_locs`
USING (network_bin, mask)
)
GROUP BY 1
ORDER BY 2 DESC
Cleaned up version of this answer at: http://googlecloudplatform.blogspot.com/2014/03/geoip-geolocation-with-google-bigquery.html
Let me tidy the original query:
SELECT
id,
client_ip,
client_ip_code,
B.Country_Name AS Country_Name
FROM (
SELECT
id,
contributor_ip AS client_ip,
INTEGER(PARSE_IP(contributor_ip)) AS client_ip_code,
1 AS One
FROM
[publicdata:samples.wikipedia]
WHERE contributor_ip IS NOT NULL
LIMIT
1000
) AS A1
LEFT JOIN
(
SELECT
From_IP_Code,
To_IP_Code,
Country_Name,
1 AS One
FROM
--3 IP sets: 1.valid ranges, 2.Gaps, 3. Gap at the END of the set
(
SELECT
From_IP_Code,
To_IP_Code,
Country_Name
FROM
[playscape-proj:GeoIP.GeoIP]) -- all Ranges ov valid IPs
,
(
SELECT
PriorRangeEndIP+1 From_IP_Code,
From_IP_Code-1 AS To_IP_Code,
'NA' AS Country_Name -- Missing rages lower FROM From_IP
from(
SELECT
From_IP_Code,
To_IP_Code,
Country_Name
,
LAG(To_IP_Code,
1,
INTEGER(0)) OVER(
ORDER BY
From_IP_Code ASC) PriorRangeEndIP --use of LAG function to find prior valid range
FROM
[playscape-proj:GeoIP.GeoIP])A
WHERE
From_IP_Code>PriorRangeEndIP+1) -- If gap FROM prior valid range IS >1 than its a gap to fill
,
(
SELECT
MAX(To_IP_Code)+1 AS From_IP_Code,
INTEGER (4311810304) AS To_IP_Code,
'NA' AS Country_Name -- Missing rages higher tan Max To_IP
FROM
[playscape-proj:GeoIP.GeoIP])
) AS B
ON A1.ONE=B.ONE --fake JOIN condition to overcome allowed use of = only IN joins
WHERE
A1.client_ip_code>=B.From_IP_Code
AND A1.client_ip_code<=B.To_IP_Code -- JOIN condition WHERE valid IP exists ON left
OR (A1.client_ip_code IS NULL
AND B.From_IP_Code=1 ) -- WHERE there IS no valid IP ON left contributor_ip;
That's a long query! (and a very interesting one). It runs in 14 seconds. How can we optimize it?
Some tricks I found:
So I'm changing:
1 AS One
to INTEGER(PARSE_IP(contributor_ip)/(256*256*256)) AS One
(twice).And now it runs in 3 seconds! 5% of the ips could not be geolocated, probably by the described gaps (easy fix).
Now, how about going from the LIMIT 1000 to LIMIT 300000. How long will it take?
37 seconds! Much better than the described 25 minutes. If you want to go even higher, I would suggest turning the right side table into a static one - as once computed it doesn't change at all, it's just an expansion of the basic rules. Then you can use JOIN EACH.
SELECT
id,
client_ip,
client_ip_code,
B.Country_Name AS Country_Name
FROM (
SELECT
id,
contributor_ip AS client_ip,
INTEGER(PARSE_IP(contributor_ip)) AS client_ip_code,
INTEGER(PARSE_IP(contributor_ip)/(256*256*256)) AS One
FROM
[publicdata:samples.wikipedia]
WHERE contributor_ip IS NOT NULL
LIMIT
300000
) AS A1
JOIN
(
SELECT
From_IP_Code,
To_IP_Code,
Country_Name,
INTEGER(From_IP_Code/(256*256*256)) AS One
FROM
--3 IP sets: 1.valid ranges, 2.Gaps, 3. Gap at the END of the set
(
SELECT
From_IP_Code,
To_IP_Code,
Country_Name
FROM
[playscape-proj:GeoIP.GeoIP]) -- all Ranges ov valid IPs
,
(
SELECT
PriorRangeEndIP+1 From_IP_Code,
From_IP_Code-1 AS To_IP_Code,
'NA' AS Country_Name -- Missing rages lower FROM From_IP
from(
SELECT
From_IP_Code,
To_IP_Code,
Country_Name
,
LAG(To_IP_Code,
1,
INTEGER(0)) OVER(
ORDER BY
From_IP_Code ASC) PriorRangeEndIP --use of LAG function to find prior valid range
FROM
[playscape-proj:GeoIP.GeoIP])A
WHERE
From_IP_Code>PriorRangeEndIP+1) -- If gap FROM prior valid range IS >1 than its a gap to fill
,
(
SELECT
MAX(To_IP_Code)+1 AS From_IP_Code,
INTEGER (4311810304) AS To_IP_Code,
'NA' AS Country_Name -- Missing rages higher tan Max To_IP
FROM
[playscape-proj:GeoIP.GeoIP])
) AS B
ON A1.ONE=B.ONE --fake JOIN condition to overcome allowed use of = only IN joins
WHERE
A1.client_ip_code>=B.From_IP_Code
AND A1.client_ip_code<=B.To_IP_Code -- JOIN condition WHERE valid IP exists ON left
OR (A1.client_ip_code IS NULL
AND B.From_IP_Code=1 ) -- WHERE there IS no valid IP ON left contributor_ip;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With