Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to sample large database and implement K-means and K-nn in R?

I'm a new user to R, trying to move away from SAS. I'm asking this question here as I'm feeling a bit frustrated with all the packages and sources available for R, and I cant seem to get this working mainly due to data size.

I have the following:

A table called SOURCE in a local MySQL database with 200 predictor features and one class variable. The table has 3 million records and is 3GB large. The number of instances per class are not equal.

I want to:

  1. A randomly sample the SOURCE database to create a smaller dataset of with equal number of instances per class.
  2. Divide sample into training and testing set.
  3. Preform k-means clustering on training set to determine k centroids per class.
  4. Preform k-NN classification of test data with centroids.
like image 254
erichfw Avatar asked Dec 02 '12 18:12

erichfw


People also ask

Which R function is typically used for applying K means algorithm?

Each group is represented by the mean value of points in the group, known as the cluster centroid. K-means algorithm requires users to specify the number of cluster to generate. The R function kmeans() [stats package] can be used to compute k-means algorithm.

How do you evaluate K means clustering in R?

You can evaluate the clusters by looking at $totss and $betweenss. R comes with a default K Means function, kmeans(). It only requires two inputs: a matrix or data frame of all numeric values and a number of centers (i.e. your number of clusters or the K of k means).

What is K means in big data?

K-means groups similar data points together into clusters by minimizing the mean distance between geometric points. To do so, it iteratively partitions datasets into a fixed number (the K) of non-overlapping subgroups (or clusters) wherein each data point belongs to the cluster with the nearest mean cluster center.

What is k means clustering in R programming?

Last Updated : 02 Jul, 2020 K Means Clustering in R Programming is an Unsupervised Non-linear algorithm that cluster data based on similarity or similar groups. It seeks to partition the observations into a pre-specified number of clusters. Segmentation of data takes place to assign each training example to a segment called a cluster.

What is the difference between KNN and k-mean in R?

5- The knn algorithm does not works with ordered-factors in R but rather with factors. We will see that in the code below. 6- The k-mean algorithm is different than K- nearest neighbor algorithm. K-mean is used for clustering and is a unsupervised learning algorithm whereas Knn is supervised leaning algorithm that works on classification problems.

How does kNN algorithm work in R?

To understand how KNN in R works, we’ll take a look at another example. Suppose your data set has two classes. Class 1 has rectangles, whereas Class 2 has circles. You have to assign the new data point you input to one of these two classes by using this algorithm.

What is k nearest neighbor (kNN) algorithm?

KNN stands for K Nearest Neighbor. It’s a supervised machine learning algorithm that classifies data points into target classes according to the features of the points’ adjacent data points. Suppose you want your machine to identify the images of apples and oranges and distinguish between them.


1 Answers

The way I would proceed is:

1) Extract a list of ids of your table to R, you can do this with a simple SQL query using the RMySQL library.

2) Split the ids in any way you like in R, and then do subsequent SQL queries again using RMySQL (I found this two step approach much quicker than sampling directly in MySQL).

3) Depending on how large is your sample you could get away by using the base R kmeans implementation, this however might fail for bigger samples, in that case you should look into using bigkmeans from the library biganalytics.

like image 61
ArturoSaCo Avatar answered Nov 15 '22 21:11

ArturoSaCo