Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Conversion of latitude and longitude for fraud detection classification ML

I am trying to build a ML classification model for fraud detection upon account signup. The data I have on hand are: name, email address, coordinate (latitude and longitude of IP address upon signup), and label (fraud vs non-fraud). Here is a short summary of my dataset:

>>> dataset.summary
Index(['name', 'email','latitude','longitude','label'],dtype='object')
>>> dataset.shape
(93207, 4)

So far I am having trouble understanding how to treat the coordinate variable when training my model. Some users on StackExchange recommended converting the latitude and longitude to X, Y and Z coordinates using some combinations of sine and cosine functions. (i.e. https://datascience.stackexchange.com/questions/13567/ways-to-deal-with-longitude-latitude-feature) But I don't know whether that's really necessarily in my classification use case? I have thought about combining the latitude and longitude into 1 variable for each record. However, some regions have negative values in longitudes. Also, some fraudsters could be located in regions of high latitude and longitude, while other fraudsters could be located in regions of low latitude and longitude. So perhaps combining the latitude and longitude into 1 variable won't help train the model?

I could also convert the latitude and longitude into city name. But if I do, a city would have similar spellings to another city that is very far away, which again might not help to train the model. Any suggestions?

like image 350
Stanleyrr Avatar asked Apr 10 '18 00:04

Stanleyrr


People also ask

How do you classify latitude and longitude?

latitudes are N, and are expressed as a positive number when displayed as decimal degrees (i.e. 38.4556). Imaginary lines are drawn from pole to pole. These lines are called meridians of longitude. and are expressed as a negative number when displayed as decimal degrees (i.e. –90.2345).

How do you use latitude and longitude in machine learning?

You can use clustering algorithm like k-Nearest Neighbor algorithm to group your geo-location data (using a small number of potential clusters) and assign each cluster or a group a unique id. These unique id can then replace your latitude and longitude column.

What data type is GPS coordinates?

Use DECIMAL(8,6) for latitude (90 to -90 degrees) and DECIMAL(9,6) for longitude (180 to -180 degrees). 6 decimal places is fine for most applications. Both should be "signed" to allow for negative values.


1 Answers

There are multiple ways to handle this problem. The link that you shared talks about the fact that treating the lat-long separately and performing feature scaling on them. The approach is good because it is assumed that if in spherical coordinate they are closer to each other, they would be actually closer to each other in real life.

But your problem is different. I guess you need to know how you can handle the lat-long in your model. You can proceed in the following ways.

1. Choosing the right model

Not all the machine learning techniques require you to scale or normalize the features. Scale normalization is done usually to make the model believe that all features are equal. This is required because some of the machine learning models are based on distance metrics like KNN, Logistic Regressions. So if you don't perform the scaling of features, it might screw up the learning. If you are using some tree-based models like DTs or Random-Forests or XGBoost or GBMs, I think you can use the features even without scaling. Hence you can directly use the lat-long in your feature set.

2. Perform Clustering to create a dummy variable

Mostly in these kinds of cases, you can perform the clustering of lat longs using some clustering techniques like KMeans, create a feature called cluster in your dataset and give its value the cluster number or distance from the cluster center and then remove the lat-long columns. You can also create a separate feature for each cluster and take the distance from each cluster centers and store that distance into these variables.

3. Reverse geocoding

As you have mentioned, you can also perform reverse geocoding to get the city and the country name. But in your case, this method might not be a strong predictor of fraud. But just for reference,

from pygeocoder import Geocoder
location = Geocoder.reverse_geocode(12.9716,77.5946)
print("City:",location.city)
print("Country:",location.country)

4. My recommendation

Perform some Hierarchical Clustering instead of KMeans because of KMeans workes along the maximizing variance if the feature space is linear in nature but if it is non-linear, then Hierarchical Clusterings like PAM, CLARA, and DBSCAN are best to use.

like image 152
Mayukh Sarkar Avatar answered Sep 22 '22 14:09

Mayukh Sarkar