I have a large corpus of opinions (2500) in raw text. I would like to use scikit-learn library to split them into test/train sets. What could be the best aproach to solve this task with scikit-learn?. Could anybody provide me an example of spliting raw text in test/train sets (probably i´ll use tf-idf representation).
The train_test_split() method is used to split our data into train and test sets. First, we need to divide our data into features (X) and labels (y). The dataframe gets divided into X_train,X_test , y_train and y_test. X_train and y_train sets are used for training and fitting the model.
Split the data set into two pieces — a training set and a testing set. This consists of random sampling without replacement about 75 percent of the rows (you can vary this) and putting them into your training set. The remaining 25 percent is put into your test set.
Split the dataset We can use the train_test_split to first make the split on the original dataset. Then, to get the validation set, we can apply the same function to the train set to get the validation set. In the function below, the test set size is the ratio of the original data we want to use as the test set.
The Sklearn train_test_split function helps us create our training data and test data. This is because typically, the training data and test data come from the same original dataset. To get the data to build a model, we start with a single dataset, and then we split it into two datasets: train and test.
Suppose your data is a list of strings, i.e.
data = ["....", "...", ]
Then you can split it into training (80%) and test (20%) sets using train_test_split e.g. by doing:
from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size = 0.2)
Before you rush doing it, though, read those docs through. 2500 is not a "large corpus" and you probably want to do something like a k-fold cross-validation rather than a single holdout split.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With