I am new to neural networks and I have a question about classification with the nnet package.
I have data which is a mixture of numeric and categoric variables. I wanted to make a win lose prediction by using nnet and a function call such as
nnet(WL~., data=training, size=10)
but this gives a different result than if I use a dataframe with only numeric versions of the variables (i.e. convert all the factors to numeric (except my prediction WL)).
Can someone explain to me what is happening here? I guess nnet is interpreting the variables different but I would like to understand what is happening. I appreciate its difficult without any data to recreate the problem but I am just looking at a high level explanation of how neural networks are fitted using nnet. I cant find this anywhere. Many thanks.
str(training)
'data.frame': 1346 obs. of 9 variables:
$ WL : Factor w/ 2 levels "win","lose": 2 2 1 1 NA 1 1 2 2 2 ...
$ team.rank : int 17 19 19 18 17 16 15 14 14 16 ...
$ opponent.rank : int 14 12 36 16 12 30 11 38 27 31 ...
$ HA : Factor w/ 2 levels "A","H": 1 1 2 2 2 2 2 1 1 2 ...
$ comp.stage : Factor w/ 3 levels "final","KO","league": 3 3 3 3 3 3 3 3 3 3 ...
$ days.since.last.match: num 132 9 5 7 14 7 7 7 14 7 ...
$ days.to.next.match : num 9 5 7 14 7 9 7 9 7 8 ...
$ comp.last.match : Factor w/ 5 levels "Anglo-Welsh Cup",..: 5 5 5 5 5 5 3 5 3 5 ...
$ comp.next.match : Factor w/ 4 levels "Anglo-Welsh Cup",..: 4 4 4 4 4 3 4 3 4 3 ...
vs
str(training.nnet)
'data.frame': 1346 obs. of 9 variables:
$ WL : Factor w/ 2 levels "win","lose": 2 2 1 1 NA 1 1 2 2 2 ...
$ team.rank : int 17 19 19 18 17 16 15 14 14 16 ...
$ opponent.rank : int 14 12 36 16 12 30 11 38 27 31 ...
$ HA : num 1 1 2 2 2 2 2 1 1 2 ...
$ comp.stage : num 3 3 3 3 3 3 3 3 3 3 ...
$ days.since.last.match: num 132 9 5 7 14 7 7 7 14 7 ...
$ days.to.next.match : num 9 5 7 14 7 9 7 9 7 8 ...
$ comp.last.match : num 5 5 5 5 5 5 3 5 3 5 ...
$ comp.next.match : num 4 4 4 4 4 3 4 3 4 3 ...
The difference you are looking for can be explained with a very small example:
fit.factors <- nnet(y ~ x, data.frame(y=c('W', 'L', 'W'), x=c('1', '2' , '3')), size=1)
fit.factors
# a 2-1-1 network with 5 weights
# inputs: x2 x3
# output(s): y
# options were - entropy fitting
fit.numeric <- nnet(y ~ x, data.frame(y=c('W', 'L', 'W'), x=c(1, 2, 3)), size=1)
fit.numeric
# a 1-1-1 network with 4 weights
# inputs: x
# output(s): y
# options were - entropy fitting
While fitting models in R, the factor variables are actually split out into several indicator/dummy variables.
Hence, a factor variable x = c('1', '2', '3')
actually is split into three variables: x1
, x2
, x3
, one of which holds the value 1
while others hold the value 0
. Moreover, since the factors {1, 2, 3}
are exhaustive, one (and only one) of x1
, x2
, x3
must be one. Hence, variables x1
, x2
, x3
are not independent since x1 + x2 + x3 = 1
. So we can drop the first variable x1
and keep only values of x2
and x3
in the model and conclude that the level is 1
if both x2 == 0
and x2 == 0
.
That is what you see in the output of nnet
; when x
is a factor, there are actually length(levels(x)) - 1
inputs to the neural network and if x
is a number, then there is only one input to the neural network which is x
.
Most R regression functions (nnet
, randomForest
, glm
, gbm
, etc.) do this mapping from a factor level to dummy variables internally and one doesn't need to be aware of it as a user.
Now it should be clear what is the difference between using a dataset with factors
and a dataset with numbers
replacing the factors
. If you do the conversion to numbers
, then you are:
This does result in a slightly simpler model (with fewer variables as we do not need dummy
variables for each level), but is often not the correct thing to do.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With