Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dealing with datasets with repeated multivalued features

Tags:

We have a Dataset that is in sparse representation and has 25 features and 1 binary label. For example, a line of dataset is:

Label: 0
exid: 24924687
Features:
11:0 12:1 13:0 14:6 15:0 17:2 17:2 17:2 17:2 17:2 17:2
21:11 21:42 21:42 21:42 21:42 21:42 
22:35 22:76 22:27 22:28 22:25 22:15 24:1888
25:9 33:322 33:452 33:452 33:452 33:452 33:452 35:14

So, sometimes features have multiple values and they can be the same or different, and the website says:

Some categorical features are multi-valued (order does not matter)

We don't know what is the semantic of features and the value that have been assigned to them (because of some privacy concern they are hidden to public)

We only know:

  • Label means if the user has clicked on the recommended ad or not.
  • Features are describing the product that has been recommended to user.
  • Task is to predict the probability of getting a click by the user, given an ad of a product.

Any comment on the following problems are appreciated:

  1. What's the best way to import this kind of datasets into a Python data structure.
  2. How to deal with multi-valued features, specially when they have similar values repeated k times?