I have a multi-label classification problem, for which I looked online and saw that for one-hot encoding the labels it is best to use the MultiLabelBinarizer.
I use this for my labels (which i separate from the dataset itself) as follows:
ohe = MultiLabelBinarizer()
labels = ohe.fit_transform(labels)
train, test, train_labels, test_labels = train_test_split(dataset, labels, test_size=0.2) #80% train split
But it throws me this following error:
Traceback (most recent call last): 
  File "learn.py", line 114, in <module> 
    train, test, train_labels, test_labels = train_test_split(dataset, labels, test_size=0.2) #80% train split
  File "C:\Users\xwb18152\AppData\Roaming\Python\Python38\site-packages\sklearn\model_selection\_split.py", line 2127, 
in train_test_split
    arrays = indexable(*arrays)
  File "C:\Users\xwb18152\AppData\Roaming\Python\Python38\site-packages\sklearn\utils\validation.py", line 293, in indexable
    check_consistent_length(*result)
  File "C:\Users\xwb18152\AppData\Roaming\Python\Python38\site-packages\sklearn\utils\validation.py", line 256, in check_consistent_length
    raise ValueError("Found input variables with inconsistent numbers of"
ValueError: Found input variables with inconsistent numbers of samples: [83292, 5]
--
EDIT: The labels dataset looks as follows (ignore the Interval column, this shouldnt be there and is not actually counted in the rows -- not sure why?):
          Movement  Distance  Speed  Delay  Loss 
Interval
0                1         1     25      0     0
2                1         1     25      0     0
4                1         1     25      0     0
6                1         1     25      0     0
8                1         1     25      0     0
...            ...       ...    ...    ...   ...
260              3         5     50      0     0
262              3         5     50      0     0
264              3         5     50      0     0
266              3         5     50      0     0
268              3         5     50      0     0
From this we can see that it is a multi-label multi-class classification problem. The shape of the dataset and labels before and after the Binarizer are as follows:
             Before             After
dataset      (83292, 15)        (83292, 15)
labels       (83292, 5)         (5, 18)
As you stated, labels orginal shape is (83292, 5) and once you applied MultiLabelBinarizer it became (5, 18).
train_test_split(X, y) function expect that X and y should have the same rows. E.g: 83292 datapoints available in your X and respective datapoints label should be available in your y variable.
Hence, in your case it should be X and y shape should be (83292, 15) and (83292, 18).
Try this:
Your MultiLabelBinarizer output having wrong dimension here. So, if your  labels is a dataframe object, then you should apply mlb.fit_transform(labels.values.tolist()).
this would produce the same no of rows as output here 83292.
Example of your labels should be like below format:
your y input can be like list of list or dataframe having one column which having list of values. Make sure you have X and y having same no of rows. You can represent multi-label multi-class y variable like below format. Or dataframe.shape should be (no_of_rows, 1)
[[1, 1, 25, 0, 0],
 [1, 1, 25, 0, 0],
 [1, 1, 25, 0, 0],
 [1, 1, 25, 0, 0],
 [1, 1, 25, 0, 0],
 [3, 5, 50, 0, 0],
 [3, 5, 50, 0, 0],
 [3, 5, 50, 0, 0],
 [3, 5, 50, 0, 0],
 [3, 5, 50, 0, 0]]
This means that the length of the various elements you're trying to split don't match.For X and y, sklearn will take the same indices, usually a random sample of 80% of the indices of your data. So, the lengths have to match.
Imagine it's trying to keep these indices. What would sklearn do when there's nothing at some index?
 0 1 0 0 1 0 1 0 0 1 0 1 0 1
 a b b a b a b a a b b b 
 ^   ^     ^ ^   ^   ^   ^ ^ 
Do this check to verify that the lengths match. Does this return True?
len(dataset) == len(labels)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With