Since this is a complicated problem (at least for me), I will try to keep this as brief as possible. My data is of the form <pre class="prettyprint"><code>import pandas as pd import numpy as np # edit: a1 and a2 are linked as they are part of the same object a1 = np.array([[1, 2, 3], [4, 5], [7, 8, 9, 10]]) a2 = np.array([[5, 6, 5], [2, 3], [3, 4, 8, 1]]) b = np.array([6, 15, 24]) y = np.array([0, 1, 1]) df = pd.DataFrame(dict(a1=a1.tolist(),a2=a2.tolist(), b=b, y=y)) a1 a2 b y 0 [1, 2, 3] [5, 6, 5] 6 0 1 [4, 5] [2, 3] 15 1 2 [7, 8, 9, 10] [3, 4, 8, 1] 24 1 </code></pre> which I would like to use in sklearn for classification, e.g. <pre class="prettyprint"><code>from sklearn import tree X = df[['a1', 'a2', 'b']] Y = df['y'] clf = tree.DecisionTreeClassifier() clf = clf.fit(X, Y) print(clf.predict([[2., 2.]])) </code></pre> However, while pandas can handle lists as entries, sklearn, by design, cannot. In this example the <code>clf.fit</code> will result in <code>ValueError: setting an array element with a sequence.</code> to which you can find plenty of answers. <hr> But how do you deal with such data? I tried to split the data up into multiple columns (i.e. <code>a1[0] ... a1[3]</code> - code for that is a bit lengthy), but <code>a1[3]</code> will be empty (<code>NaN</code>, <code>0</code> or whatever invalid value you think of). Imputation does not make sense here, since no value is supposed to be there. Of course, such a procedure has an impact on the result of the classification as the algorithm might pick up the "zero" value as something meaningful. <hr> If the dataset is large enough, so I thought, it might be worth splitting it up in equal lengths of <code>a1</code>. But this procedure can reduce the power of the classification algorithm, since the length of <code>a1</code> might help to distinguish between classes. I also thought of using <code>warm start</code> for algorithms that support (e.g. Perceptron) and fit it to data split by the length of <code>a1</code>. But this would surely fail, would it not? The datasets would have different number of features, so I assume that something would go wrong. <hr> Solutions to this problem surely must exist and I've simply not found the right place in the documentation.

Lets assume for a second those numbers are numerical categories. What you can do is transform column 'a' into a set of binary columns, of which each corresponds to a possible value of 'a'. Taking your example code, we would: <pre class="prettyprint"><code>import pandas as pd import numpy as np a = np.array([[1, 2, 3], [4, 5], [7, 8, 9, 10]]) b = np.array([6, 15, 24]) y = np.array([0, 1, 1]) df = pd.DataFrame(dict(a=a.tolist(),b=b,y=y)) from sklearn.preprocessing import MultiLabelBinarizer MLB = MultiLabelBinarizer() df_2 = pd.DataFrame(MLB.fit_transform(df['a']), columns=MLB.classes_) df_2 1 2 3 4 5 7 8 9 10 0 1 1 1 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 2 0 0 0 0 0 1 1 1 1 </code></pre> Than, we can just concat the old and new data: <pre class="prettyprint"><code>new_df = pd.concat([df_2, df.drop('a',1)],1) 1 2 3 4 5 7 8 9 10 b y 0 1 1 1 0 0 0 0 0 0 6 0 1 0 0 0 1 1 0 0 0 0 15 1 2 0 0 0 0 0 1 1 1 1 24 1 </code></pre> Please do notice that if you have a training and a test set, it would be wise to first concat em, do the transform, and than separate 'em. Thats because one of the data sets can contain terms that do not belong to the other. Hope that helps Edit: If you are worried that might make your df too big, its perfectly okay to apply PCA to the binarized variables. It will reduce cardinality while maintaining an arbitrary amount of variance/correlation.

Preparing variable-length data for sklearn

Tags:

python

pandas

scikit-learn

Since this is a complicated problem (at least for me), I will try to keep this as brief as possible.

My data is of the form

import pandas as pd
import numpy as np
# edit: a1 and a2 are linked as they are part of the same object
a1 = np.array([[1, 2, 3], [4, 5], [7, 8, 9, 10]])
a2 = np.array([[5, 6, 5], [2, 3], [3, 4, 8, 1]])

b = np.array([6, 15, 24])
y = np.array([0, 1, 1])

df = pd.DataFrame(dict(a1=a1.tolist(),a2=a2.tolist(), b=b, y=y))  


                  a1            a2   b  y
0      [1, 2, 3]     [5, 6, 5]   6  0
1         [4, 5]        [2, 3]  15  1
2  [7, 8, 9, 10]  [3, 4, 8, 1]  24  1

which I would like to use in sklearn for classification, e.g.

from sklearn import tree
X = df[['a1', 'a2', 'b']]
Y = df['y']
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
print(clf.predict([[2., 2.]]))

However, while pandas can handle lists as entries, sklearn, by design, cannot. In this example the clf.fit will result in ValueError: setting an array element with a sequence. to which you can find plenty of answers.

But how do you deal with such data?

I tried to split the data up into multiple columns (i.e. a1[0] ... a1[3] - code for that is a bit lengthy), but a1[3] will be empty (NaN, 0 or whatever invalid value you think of). Imputation does not make sense here, since no value is supposed to be there.

Of course, such a procedure has an impact on the result of the classification as the algorithm might pick up the "zero" value as something meaningful.

If the dataset is large enough, so I thought, it might be worth splitting it up in equal lengths of a1. But this procedure can reduce the power of the classification algorithm, since the length of a1 might help to distinguish between classes.

I also thought of using warm start for algorithms that support (e.g. Perceptron) and fit it to data split by the length of a1. But this would surely fail, would it not? The datasets would have different number of features, so I assume that something would go wrong.

Solutions to this problem surely must exist and I've simply not found the right place in the documentation.

581

asked Jan 31 '17 09:01

DragonTux

1 Answers

Lets assume for a second those numbers are numerical categories. What you can do is transform column 'a' into a set of binary columns, of which each corresponds to a possible value of 'a'.

Taking your example code, we would:

import pandas as pd
import numpy as np

a = np.array([[1, 2, 3], [4, 5], [7, 8, 9, 10]])
b = np.array([6, 15, 24])
y = np.array([0, 1, 1])

df = pd.DataFrame(dict(a=a.tolist(),b=b,y=y))

from sklearn.preprocessing import MultiLabelBinarizer
MLB = MultiLabelBinarizer()
df_2 = pd.DataFrame(MLB.fit_transform(df['a']), columns=MLB.classes_)
df_2

    1   2   3   4   5   7   8   9   10
0   1   1   1   0   0   0   0   0   0
1   0   0   0   1   1   0   0   0   0
2   0   0   0   0   0   1   1   1   1

Than, we can just concat the old and new data:

new_df = pd.concat([df_2, df.drop('a',1)],1)

    1   2   3   4   5   7   8   9   10  b   y
0   1   1   1   0   0   0   0   0   0   6   0
1   0   0   0   1   1   0   0   0   0   15  1
2   0   0   0   0   0   1   1   1   1   24  1

Please do notice that if you have a training and a test set, it would be wise to first concat em, do the transform, and than separate 'em. Thats because one of the data sets can contain terms that do not belong to the other.

Hope that helps

Edit:

If you are worried that might make your df too big, its perfectly okay to apply PCA to the binarized variables. It will reduce cardinality while maintaining an arbitrary amount of variance/correlation.

182

answered Sep 30 '22 22:09

epattaro

Related questions
                            
                                pip install bs4 giving _socketobject error
                            
                                How to make test case fail if a django template has a rendering error that would silently fail in production
                            
                                How do I pickle a dictionary containing a module & class?
                            
                                Generate N positive integers within a range adding up to a total in python
                            
                                Sending attachment in HTML email with Python
                            
                                `TypeError: argument 2 must be a connection, cursor or None` in Psycopg2
                            
                                Pyspark - Sum over multiple sparse vectors (CountVectorizer Output)
                            
                                Selenium Remote Webdriver with remote profile
                            
                                django.db.utils.OperationalError: server closed the connection unexpectedly
                            
                                AWS Redis + uWSGI behind NGINX - high load
                            
                                Fonts Corrupted
                            
                                How to find all uses of a python function or variable in a python package
                            
                                Outliers using RPCA
                            
                                Trained keras model much slower making its predictions than in training
                            
                                Adding a property to an int value in python
                            
                                Unable to locate nested geopoint after updating to elasticsearch 2.3
                            
                                Calling a stateful LSTM as a functional model?
                            
                                Share Python logger across multiple files
                            
                                Troubleshooting tips for clustering word2vec output with DBSCAN
                            
                                Memory profiling of a running python application

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With