How to split data (raw text) into test/train sets with scikit crossvalidation module?

Tags:

I have a large corpus of opinions (2500) in raw text. I would like to use scikit-learn library to split them into test/train sets. What could be the best aproach to solve this task with scikit-learn?. Could anybody provide me an example of spliting raw text in test/train sets (probably i´ll use tf-idf representation).

791

asked Sep 11 '14 17:09

anon

1 Answers

Suppose your data is a list of strings, i.e.

data = ["....", "...", ]

Then you can split it into training (80%) and test (20%) sets using train_test_split e.g. by doing:

from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size = 0.2)

Before you rush doing it, though, read those docs through. 2500 is not a "large corpus" and you probably want to do something like a k-fold cross-validation rather than a single holdout split.

167

answered Nov 03 '22 19:11

KT.

Related questions
                            
                                Manually changing learning_rate in tf.train.AdamOptimizer
                            
                                Keras give input to intermediate layer and get final output
                            
                                SciKit-Learn Label Encoder resulting in error 'argument must be a string or number'
                            
                                MATLAB: Self-Organizing Map (SOM) clustering
                            
                                How to test tensorflow cifar10 cnn tutorial model
                            
                                Which layers should I freeze for fine tuning a resnet model on keras?
                            
                                How can I download and skip VGG weights that have no counterpart with my CNN in Keras?
                            
                                Faster kNN algorithm in Python
                            
                                What does the KNN algorithm do in the training phase?
                            
                                How to add dummies to Pandas DataFrame?
                            
                                Pytorch: RuntimeError: reduce failed to synchronize: cudaErrorAssert: device-side assert triggered
                            
                                Annotate images in pascal voc xml [closed]
                            
                                Tensorflow GPU utilization only 60% (GTX 1070)
                            
                                Neural Network to predict nth square
                            
                                Hyperparameter tune for Tensorflow
                            
                                TD-IDF Find Cosine Similarity Between New Document and Dataset
                            
                                How to calculate bits per character of a string? (bpc)
                            
                                Train multiple models in parallel with sklearn?
                            
                                nvidia-smi does not display memory usage [closed]
                            
                                Getting Started with Neural Networks (ANN)?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to split data (raw text) into test/train sets with scikit crossvalidation module?

Tags:

machine-learning

classification

scikit-learn

text-classification

cross-validation

anon

People also ask

1 Answers

KT.

Recent Activity

Donate For Us