I have 2 questions about formatting data for contextual bandit model training.
If I have data such as below...
1:1:0.2 | d1:us d2:female d3:12
Question 1) I read from VW Wiki that each feature is optionally followed by a float. In case where I have categorical features (such as us, female) as values, what is the best way to re-format them? I am thinking that I would just not suffix them with a float let them have a default value of 1. I'm hoping this would achieve one-hot encoding.
Question 2) I've been wrongly training the model by logging the data as below
1:1:0.2 | us female 12 
What I now realize is that "us", "female", and "12" are treated as features with default values as 1. Am I correct?
Yes, you're correct.
The input feature format is: space-separated with each feature as <name>:<value> where :<value>, if present, must be numeric.
To represent categorical values you could use something other than : as separator between <name> and <value>.  In this case the whole string would be considered the feature name.  This is often called "one-hot encoding" (each possible feature+value combo is treated as a separate feature).
Also note that the feature name 12 will be hashed by vw directly to slot 12 (modulo 2^bits) in the hash table, assuming this is what the user wanted, since numeric features are common (and are the libSVM convention). This can be disabled by the option --hash all on the command line.  The default is --hash strings meaning: (murmur3) hash feature-names which look like a string (not an integer) but leave alone (don't hash) feature names that look like numbers.
See also: How to represent categorical features in vowpal-wabbit which includes a cheat-sheet for representing input features in vw.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With