Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to deal with array of string features in traditional machine learning?

Problem

Let's say we have a dataframe that looks like this:

age  job         friends                                    label
23   'engineer'  ['World of Warcraft', 'Netflix', '9gag']   1
35   'manager'   NULL                                       0
...

If we are interested in training a classifier that predicts label using age, job, and friends as features, how would we go about transforming the features into a numerical array which can be fed into a model?

  • Age is pretty straightforward since it is already numerical.
  • Job can be hashed / indexed since it is a categorical variable.
  • Friends is a list of categorical variables. How would I go about representing this feature?

Approaches:

Hash each element of the list. Using the example dataframe, let's assume our hashing function has the following mapping:

NULL                -> 0
engineer            -> 42069
World of Warcraft   -> 9001
Netflix             -> 14
9gag                -> 9
manager             -> 250
 

Let's further assume that the maximum length of friends is 5. Anything shorter gets zero-padded on the right hand side. If friends size is larger than 5, then the first 5 elements are selected.

Approach 1: Hash and Stack

dataframe after feature transformation would look like this:

feature                             label
[23, 42069, 9001, 14, 9, 0, 0]      1
[35, 250,   0,    0,  0, 0, 0]      0

Limitations

Consider the following:

age  job           friends                                        label
23   'engineer'    ['World of Warcraft', 'Netflix', '9gag']       1
35   'manager'      NULL                                          0
26   'engineer'    ['Netflix', '9gag', 'World of Warcraft']       1
...

Compare the features of the first and third record:

feature                             label
[23, 42069, 9001, 14, 9, 0, 0]      1
[35, 250,   0,    0,  0, 0, 0]      0
[26, 42069, 14,    9, 9001, 0]      1

Both records have the same set of friends, but are ordered differently resulting in a different feature hashing even though they should be the same.

Approach 2: Hash, Order, and Stack

To solve the limitation of Approach 1, simply order the hashes from the friends feature. This would result in the following feature transform (assuming descending order):

feature                             label
[23, 42069, 9001, 14, 9, 0, 0]      1
[35, 250,   0,    0,  0, 0, 0]      0
[26, 42069, 9001, 14, 9, 0, 0]      1

This approach has a limitation too. Consider the following:

age  job           friends                                        label
23   'engineer'    ['World of Warcraft', 'Netflix', '9gag']       1
35   'manager'      NULL                                          0
26   'engineer'    ['Netflix', '9gag', 'World of Warcraft']       1
42   'manager'     ['Netflix', '9gag']                            1
...

Applying feature transform with ordering we get:

row  feature                             label
1    [23, 42069, 9001, 14, 9, 0, 0]      1
2    [35, 250,   0,    0,  0, 0, 0]      0
3    [26, 42069, 9001, 14, 9, 0, 0]      1
4    [44, 250, 14, 9, 0, 0, 0]           1

What is the problem with the above features? Well, the hashes for Netflix and 9gag in rows 1 and 3 have the same index in the array but not in row 4. This would mess up with the training.

Approach 3: Convert Array to Columns

What if we convert friends into a set of 5 columns and deal with each of the resulting columns just like we deal with any categorical variable?

Well, let's assume the friends vocabulary size is large (>100k). It would then be madness to go and create >100k columns where each column is responsible for the hash of the respective vocab element.

Approach 4: One-Hot-Encoding and then Sum

How about this? Convert each hash to one-hot-vector, and add up all these vectors.

In this case, the feature in row one for example would look like this:

[23, 42069, 01x8, 1, 01x4, 1, 01x8986, 1, 01x(max_hash_size-8987)]

Where 01x8 denotes a row of 8 zeros.

The problem with this approach is that these vectors will be very huge and sparse.

Approach 5: Use Embedding Layer and 1D-Conv

With this approach, we feed each word in the friends array to the embedding layer, then convolve. Similar to the Keras IMDB example: https://keras.io/examples/imdb_cnn/

Limitation: requires using deep learning frameworks. I want something which works with traditional machine learning. I want to do logistic regression or decision tree.

What are your thoughts on this?

like image 613
tooskoolforkool Avatar asked Jun 16 '20 13:06

tooskoolforkool


1 Answers

As another answer mentioned, you've already listed a number of alternatives that could work, depending on the dataset and the model and such.

For what it's worth, a typical logistic regression model that I've encountered would use Approach 3, and convert each of your friends strings into a binary feature. If you're opposed to having 100k features, you could treat these features like a bag-of-words model and discard the stopwords (very common features).

I'll also throw a hashing variant into the mix:

Bloom Filter

You could store the strings in question in a bloom filter for each training example, and use the bits of the bloom filter as a feature in your logistic regression model. This is basically a hashing solution like you've already mentioned, but it takes care of some of the indexing/sorting issues, and provides a more principled tradeoff between sparsity and feature uniqueness.

like image 117
lmjohns3 Avatar answered Nov 08 '22 17:11

lmjohns3