I am very new to R and searched about this on forums but couldn't get a close enough solution for this. I am trying to do a mapping between the ip address & corresponding geo locations. I have 2 data sets.
Set-a (1,60,000 rows):
ip(int) | ID(int)
Set-b (16,00,000 rows):
Ip1(int) | Ip2(int) | Code(str) | Country(str) | Area1(str) | Area2(str)
I am trying to do the following: if ip lies between Ip1 & Ip2 then add Country & Region to Set-a.
I am doing the following (obviously not a very good way to do this):
ip1<-as.numeric(b$Ip1)
ip2<-as.numeric(b$Ip2)
country<-b$Country
area1<-b$Area1
area2<-b$Area2
for(i in 1:160000){
for(j in 1:1674303){
if(a[i]>ip1[j] & a[i]<ip2[j]) {
a$country[i]<-country[j]
a$area1[i]<-area1[j]
a$area2[i]<-area2[j]}
}
}
Can someone please tell me an efficient way to do this. This is taking a lot of time. (for i=1 to 100 took some 10 mins to run).
The sample data set-b is:
Ip1, Ip2, Code, Country, Area1, Area2
"0","16777215","-","-","-","-"
"16777216","16777471","AU","AUSTRALIA","QUEENSLAND","SOUTH BRISBANE"
"16777472","16778239","CN","CHINA","FUJIAN","FUZHOU"
"16778240","16778495","AU","AUSTRALIA","VICTORIA","MELBOURNE"
"16778496","16778751","AU","AUSTRALIA","NEW SOUTH WALES","SYDNEY"
It is in continuos increasing order.
The dput(head(a)) & dput(head(b)) respectively are: (refer sample data above)
structure(IP_Addr = c("38825563", "38921619", "42470287", "42471923","42473368","42473428"),
Desc_value = c("0", "1.2", "4.97", "1", "5.9", "22.06")), .Names = c("IP_Addr", "Desc_value"), row.names = c(NA, 6L), class = "data.frame")
structure(list(Ip1 = c("0", "16777216", "16777472", "16778240",
"16778496", "16778752"), Ip2 = c("16777215", "16777471", "16778239",
"16778495", "16778751", "16779263"), Code = c("-", "AU", "CN",
"AU", "AU", "AU"), Country = c("-", "AUSTRALIA", "CHINA", "AUSTRALIA",
"AUSTRALIA", "AUSTRALIA"), Area1 = c("-", "QUEENSLAND", "FUJIAN",
"VICTORIA", "NEW SOUTH WALES", "-"), Area2 = c("-", "SOUTH BRISBANE",
"FUZHOU", "MELBOURNE", "SYDNEY", "-")), .Names = c("Ip1", "Ip2",
"Code", "Country", "Area1", "Area2"), row.names = c(NA, 6L), class = "data.frame")
Here's a data.table solution:
# Let's take Blue Magister's example set:
set.seed(10)
a <- data.frame(ip=sample(16777216:16778751,10,replace=TRUE))
b <- read.table(sep=",",header=TRUE,text='Ip1, Ip2, Code, Country, Area1, Area2
"0","16777215","-","-","-","-"
"16777216","16777471","AU","AUSTRALIA","QUEENSLAND","SOUTH BRISBANE"
"16777472","16778239","CN","CHINA","FUJIAN","FUZHOU"
"16778240","16778495","AU","AUSTRALIA","VICTORIA","MELBOURNE"
"16778496","16778751","AU","AUSTRALIA","NEW SOUTH WALES","SYDNEY"')
b$Ip1 <-as.numeric(b$Ip1)
# include library, convert to data.table
library(data.table)
a = data.table(a)
b = data.table(b, key = "Ip1")
# and now the actual computation
a = b[a, roll = Inf][, Ip2 := NULL] # yep, amazingly, it's *that* simple in data.table
setnames(a, "Ip1", "ip") # you can also include, exclude whatever columns you want
a
# ip Code Country Area1 Area2
# 1: 16777995 CN CHINA FUJIAN FUZHOU
# 2: 16777687 CN CHINA FUJIAN FUZHOU
# 3: 16777871 CN CHINA FUJIAN FUZHOU
# 4: 16778280 AU AUSTRALIA VICTORIA MELBOURNE
# 5: 16777346 AU AUSTRALIA QUEENSLAND SOUTH BRISBANE
# 6: 16777562 CN CHINA FUJIAN FUZHOU
# 7: 16777637 CN CHINA FUJIAN FUZHOU
# 8: 16777634 CN CHINA FUJIAN FUZHOU
# 9: 16778161 CN CHINA FUJIAN FUZHOU
#10: 16777875 CN CHINA FUJIAN FUZHOU
Had Ip1 been an exhaustive list of numbers that ip could match, then above would simply be a merge (of Ip1 in b with first column of a, i.e. ip), but data.table also provides an option of what to do when there is no exact match. You can tell it to e.g. roll the previous observation forward (which is what I did above), or roll it back or roll to the nearest observation - see ?data.table for a little more information.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With