I have a dataset consisting of 100k unique data records, to benchmark the code, I need to test on data with 5 million unique records, I don't want to generate random data. I would like to use the 100k data records which I have as the base dataset and generate the remaining data similar to it with unique values for certain columns, How can I do that using python or Scala ?
Here's the sample data
latitude longitude step count
25.696395 -80.297496 1 1
25.699544 -80.297055 1 1
25.698612 -80.292015 1 1
25.939942 -80.341607 1 1
25.939221 -80.349899 1 1
25.944992 -80.346589 1 1
27.938951 -82.492018 1 1
27.944691 -82.48961 1 3
28.355484 -81.55574 1 1
Each pair of latitude and longitude should be unique across the data generated, I should be able to set min and max values for these columns as well
You can generate data conforming to normal distribution easily using R, you can follow the following steps
#Read the data into a dataframe
library(data.table)
data = data = fread("data.csv", sep=",", select = c("latitude", "longitude"))
#Remove duplicate and null values
df = data.frame("Lat"=data$"latitude", "Lon"=data$"longitude")
df1 = unique(df[1:2])
df2 <- na.omit(df1)
#Determine the mean and standard deviation of latitude and longitude values
meanLat = mean(df2$Lat)
meanLon = mean(df2$Lon)
sdLat = sd(df2$Lat)
sdLon = sd(df2$Lon)
#Use Normal distribution to generate new data of 1 million records
newData = list()
newData$Lat = sapply(rep(0, 1000000), function(x) (sum(runif(12))-6) * sdLat + meanLat)
newData$Lon = sapply(rep(0, 1000000), function(x) (sum(runif(12))-6) * sdLon + meanLon)
finalData = rbind(df2,newData)
now final data contains both old records and new records
Write the finalData dataframe to a CSV file and you can read it from Scala or python
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With