How to sample large database and implement K-means and K-nn in R?

Tags:

I'm a new user to R, trying to move away from SAS. I'm asking this question here as I'm feeling a bit frustrated with all the packages and sources available for R, and I cant seem to get this working mainly due to data size.

I have the following:

A table called SOURCE in a local MySQL database with 200 predictor features and one class variable. The table has 3 million records and is 3GB large. The number of instances per class are not equal.

I want to:

A randomly sample the SOURCE database to create a smaller dataset of with equal number of instances per class.
Divide sample into training and testing set.
Preform k-means clustering on training set to determine k centroids per class.
Preform k-NN classification of test data with centroids.

254

asked Dec 02 '12 18:12

erichfw

1 Answers

The way I would proceed is:

1) Extract a list of ids of your table to R, you can do this with a simple SQL query using the RMySQL library.

2) Split the ids in any way you like in R, and then do subsequent SQL queries again using RMySQL (I found this two step approach much quicker than sampling directly in MySQL).

3) Depending on how large is your sample you could get away by using the base R kmeans implementation, this however might fail for bigger samples, in that case you should look into using bigkmeans from the library biganalytics.

answered Nov 15 '22 21:11

ArturoSaCo

Related questions
                            
                                Send a email with Attachment in R using Gmail
                            
                                Using R for multi-class logistic regression
                            
                                Programmatically read Access (.mdb) files into R for both Windows and Mac
                            
                                Rewiring weighted graph produces NAs
                            
                                Counting how many times a condition is true within each group
                            
                                Editable plots in PowerPoint from python: equivalent of officer and rvg
                            
                                R data.table function doesn't recognize an already-specified argument
                            
                                dplyr 0.7.5 change in select() behavior
                            
                                How to exchange Msgpack files between Python and R?
                            
                                add a secondary y axis to ggplot2 plots - make it perfect
                            
                                What methods exist for distributing a semi-live dataset with an R package?
                            
                                'x' and 'w' must have same length - error in weighted.mean.default
                            
                                R: how to use long vectors with randomForest?
                            
                                predict() method for "mice" package
                            
                                R - combined geom_vline and geom_smooth in legend
                            
                                Planned contrasts using ezANOVA output in R
                            
                                How to output literal backticks in knitr::spin
                            
                                base R faster than readr for reading multiple CSV files
                            
                                Rounding off values in the Kable
                            
                                mclapply with big objects - "serialization is too large to store in a raw vector"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to sample large database and implement K-means and K-nn in R?

Tags:

r

large-data

machine-learning

k-means

knn

erichfw

People also ask

1 Answers

ArturoSaCo

Recent Activity

Donate For Us