Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Preparing CSV file data for Scikit-Learn Using Pandas?

I have a csv file without headers which I'm importing into python using pandas. The last column is the target class, while the rest of the columns are pixel values for images. How can I go ahead and split this dataset into a training set and a testing set using pandas (80/20)?

Also, once that is done how would I also split each of those sets so that I can define x (all columns except the last one), and y (the last column)?

I've imported my file using:

dataset = pd.read_csv('example.csv', header=None, sep=',')

Thanks

like image 229
KingPolygon Avatar asked Mar 28 '16 05:03

KingPolygon


People also ask

How do I create a CSV file using panda?

By using pandas. DataFrame. to_csv() method you can write/save/export a pandas DataFrame to CSV File. By default to_csv() method export DataFrame to a CSV file with comma delimiter and row index as the first column.

Can scikit-learn use pandas DataFrame?

Generally, scikit-learn works on any numeric data stored as numpy arrays or scipy sparse matrices. Other types that are convertible to numeric arrays such as pandas DataFrame are also acceptable.

What method is used in pandas to import a CSV file?

Using the read_csv() function from the pandas package, you can import tabular data from CSV files into pandas dataframe by specifying a parameter value for the file name (e.g. pd. read_csv("filename. csv") ).

Can you create a DataFrame from a CSV file?

Python3. Method #3: Using the csv module: One can directly import the csv files using the csv module and then create a data frame using that csv file.


2 Answers

I'd recommend using sklearn's train_test_split

from sklearn.model_selection import train_test_split
# for older versions import from sklearn.cross_validation
# from sklearn.cross_validation import train_test_split
X, y = dataset.iloc[:, :-1], dataset.iloc[:, -1]
kwargs = dict(test_size=0.2, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, **kwargs)
like image 160
ayhan Avatar answered Sep 27 '22 23:09

ayhan


You can try this.

Sperating target class from the rest:

pixel_values = Dataset[df.columns[0:len(Dataset.axes[1])-1]]
target_class = Dataset[df.columns[len(Dataset.axes[1])-1:]]

Now to create test and training samples:

I would just use numpy's randn:

 mask = np.random.rand(len(pixel_values )) < 0.8
 train = pixel_values [mask]
 test = pixel_values [~msk] 

Now you have traning and test samples in train and test with 80:20 ratio.

like image 45
Randhawa Avatar answered Sep 28 '22 01:09

Randhawa