I there. I just started with the machine learning with a simple example to try and learn. So, I want to classify the files in my disk based on the file type by making use of a classifier. The code I have written is,
import sklearn
import numpy as np
#Importing a local data set from the desktop
import pandas as pd
mydata = pd.read_csv('file_format.csv',skipinitialspace=True)
print mydata
x_train = mydata.script
y_train = mydata.label
#print x_train
#print y_train
x_test = mydata.script
from sklearn import tree
classi = tree.DecisionTreeClassifier()
classi.fit(x_train, y_train)
predictions = classi.predict(x_test)
print predictions
And I am getting the error as,
script class div label
0 5 6 7 html
1 0 0 0 python
2 1 1 1 csv
Traceback (most recent call last):
File "newtest.py", line 21, in <module>
classi.fit(x_train, y_train)
File "/home/initiouser2/.local/lib/python2.7/site-
packages/sklearn/tree/tree.py", line 790, in fit
X_idx_sorted=X_idx_sorted)
File "/home/initiouser2/.local/lib/python2.7/site-
packages/sklearn/tree/tree.py", line 116, in fit
X = check_array(X, dtype=DTYPE, accept_sparse="csc")
File "/home/initiouser2/.local/lib/python2.7/site-
packages/sklearn/utils/validation.py", line 410, in check_array
"if it contains a single sample.".format(array))
ValueError: Expected 2D array, got 1D array instead:
array=[ 5. 0. 1.].
Reshape your data either using array.reshape(-1, 1) if your data has a
single feature or array.reshape(1, -1) if it contains a single sample.
If anyone can help me with the code, it would be so helpful to me !!
When passing your input to the classifiers, pass 2D arrays (of shape (M, N)
where N >= 1), not 1D arrays (which have shape (N,)
). The error message is pretty clear,
Reshape your data either using
array.reshape(-1, 1)
if your data has a single feature orarray.reshape(1, -1)
if it contains a single sample.
from sklearn.model_selection import train_test_split
# X.shape should be (N, M) where M >= 1
X = mydata[['script']]
# y.shape should be (N, 1)
y = mydata['label']
# perform label encoding if "label" contains strings
# y = pd.factorize(mydata['label'])[0].reshape(-1, 1)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)
...
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
Some other helpful tips -
X=dataset.iloc[:, 0].values
y=dataset.iloc[:, 1].values
regressor=LinearRegression()
X=X.reshape(-1,1)
regressor.fit(X,y)
I had the following code. The reshape operator is not an inplace operator. So we have to replace it's value by the value after reshaping like given above.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With