Since this is a complicated problem (at least for me), I will try to keep this as brief as possible.
My data is of the form
import pandas as pd
import numpy as np
# edit: a1 and a2 are linked as they are part of the same object
a1 = np.array([[1, 2, 3], [4, 5], [7, 8, 9, 10]])
a2 = np.array([[5, 6, 5], [2, 3], [3, 4, 8, 1]])
b = np.array([6, 15, 24])
y = np.array([0, 1, 1])
df = pd.DataFrame(dict(a1=a1.tolist(),a2=a2.tolist(), b=b, y=y))
a1 a2 b y
0 [1, 2, 3] [5, 6, 5] 6 0
1 [4, 5] [2, 3] 15 1
2 [7, 8, 9, 10] [3, 4, 8, 1] 24 1
which I would like to use in sklearn for classification, e.g.
from sklearn import tree
X = df[['a1', 'a2', 'b']]
Y = df['y']
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
print(clf.predict([[2., 2.]]))
However, while pandas can handle lists as entries, sklearn, by design, cannot. In this example the clf.fit
will result in ValueError: setting an array element with a sequence.
to which you can find plenty of answers.
But how do you deal with such data?
I tried to split the data up into multiple columns (i.e. a1[0] ... a1[3]
- code for that is a bit lengthy), but a1[3]
will be empty (NaN
, 0
or whatever invalid value you think of). Imputation does not make sense here, since no value is supposed to be there.
Of course, such a procedure has an impact on the result of the classification as the algorithm might pick up the "zero" value as something meaningful.
If the dataset is large enough, so I thought, it might be worth splitting it up in equal lengths of a1
. But this procedure can reduce the power of the classification algorithm, since the length of a1
might help to distinguish between classes.
I also thought of using warm start
for algorithms that support (e.g. Perceptron) and fit it to data split by the length of a1
. But this would surely fail, would it not? The datasets would have different number of features, so I assume that something would go wrong.
Solutions to this problem surely must exist and I've simply not found the right place in the documentation.
The first and simplest way of handling variable length input is to set a special mask value in the dataset, and pad out the length of each input to the standard length with this mask value set for all additional entries created. Then, create a Masking layer in the model, placed ahead of all downstream layers.
A variable-length data type is of a specified length that can be changed. There are two types of variable-length data types. They are variable-length fields with an explicit length and variable-length fields with an implicit length. Variable-length fields with an explicit length.
The sklearn. preprocessing package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators. In general, learning algorithms benefit from standardization of the data set.
Lets assume for a second those numbers are numerical categories. What you can do is transform column 'a' into a set of binary columns, of which each corresponds to a possible value of 'a'.
Taking your example code, we would:
import pandas as pd
import numpy as np
a = np.array([[1, 2, 3], [4, 5], [7, 8, 9, 10]])
b = np.array([6, 15, 24])
y = np.array([0, 1, 1])
df = pd.DataFrame(dict(a=a.tolist(),b=b,y=y))
from sklearn.preprocessing import MultiLabelBinarizer
MLB = MultiLabelBinarizer()
df_2 = pd.DataFrame(MLB.fit_transform(df['a']), columns=MLB.classes_)
df_2
1 2 3 4 5 7 8 9 10
0 1 1 1 0 0 0 0 0 0
1 0 0 0 1 1 0 0 0 0
2 0 0 0 0 0 1 1 1 1
Than, we can just concat the old and new data:
new_df = pd.concat([df_2, df.drop('a',1)],1)
1 2 3 4 5 7 8 9 10 b y
0 1 1 1 0 0 0 0 0 0 6 0
1 0 0 0 1 1 0 0 0 0 15 1
2 0 0 0 0 0 1 1 1 1 24 1
Please do notice that if you have a training and a test set, it would be wise to first concat em, do the transform, and than separate 'em. Thats because one of the data sets can contain terms that do not belong to the other.
Hope that helps
Edit:
If you are worried that might make your df too big, its perfectly okay to apply PCA to the binarized variables. It will reduce cardinality while maintaining an arbitrary amount of variance/correlation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With