I use SPSS modeler v18.2.1 with R v3.5.1 (or v3.3.3) using Essentials for R 18.2.1.
I'm trying to make "Extension Transform (R syntax)" nodes to deal with some problems difficult for SPSS (future: make them Extension Bundles). I want them to add multiple cols, make a new data, etc AND give a next node data.frame
. But the data.frame
are incorrectly recognized by SPSS nodes (i.e., output of a next table nodes are different from the console output of print(modelerData)
).
How to do it ? (or it is a bug ?)
Any help would be greatly appreciated. Below is a reproducible simple example;
[preparation R env and data (please do it in pure R)]
# if not installed
install.packages(randomForest)
set.seed(1) # to reproduce
write.csv(iris[sort(sample(1:150, 100)), ], "iris_train_seed1.csv", row.names = FALSE)
[My node flow]
[R code of Extension Transform]
### library ###
library(randomForest)
# make_model
set.seed(1)
modelerModel <- randomForest(formula = Species ~ . ,
data = modelerData,
ntree = 100)
#### predict
pred_forest <- data.frame(pred = predict(modelerModel,
newdata = modelerData))
prob_forest <- as.data.frame(predict(modelerModel,
newdata = modelerData,
type = "prob"))
# overwriting modelerData
modelerData <- cbind(modelerData, pred_forest, prob_forest)
# function definition to make modelerDataModel
getMetaData <- function (data) {
if (dim(data)[1]<=0) {
print("Warning : modelerData has no line, all fieldStorage fields set to strings")
getStorage <- function(x){return("string")}
} else {
getStorage <- function(x) {
res <- NULL
#if x is a factor, typeof will return an integer so we treat the case on the side
if(is.factor(x)) {
res <- "string"
} else {
res <- switch(typeof(unlist(x)),
integer = "integer",
# integer = "real",
double = "real",
character = "string",
"string")
}
return (res)
}
}
col = vector("list", dim(data)[2])
for (i in 1:dim(data)[2]) {
col[[i]] <- c(fieldName=names(data[i]),
fieldLabel="",
fieldStorage=getStorage(data[[i]]),
fieldMeasure="",
fieldFormat="",
fieldRole="")
}
mdm<-do.call(cbind,col)
mdm<-data.frame(mdm)
return(mdm)
}
# overwriting modelerDataModel
modelerDataModel <- getMetaData(modelerData)
# to check
print(dim(modelerData))
print(head(modelerData))
print(dim(modelerDataModel))
print(modelerDataModel)
[Console Output of "to check" part (print(modelerData)
is my desired output of table node)]
# print(dim(modelerData))
[1] 100 9
# print(head(modelerData))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species pred setosa
1 4.9 3.0 1.4 0.2 setosa setosa 1
2 4.7 3.2 1.3 0.2 setosa setosa 1
3 5.0 3.6 1.4 0.2 setosa setosa 1
4 5.4 3.9 1.7 0.4 setosa setosa 1
5 4.6 3.4 1.4 0.3 setosa setosa 1
6 5.0 3.4 1.5 0.2 setosa setosa 1
versicolor virginica
1 0 0
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
# print(dim(modelerDataModel))
[1] 6 9
# print(modelerDataModel)
X1 X2 X3 X4 X5 X6
fieldName Sepal.Length Sepal.Width Petal.Length Petal.Width Species pred
fieldLabel
fieldStorage real real real real string string
fieldMeasure
fieldFormat
fieldRole
X7 X8 X9
fieldName setosa versicolor virginica
fieldLabel
fieldStorage real real real
fieldMeasure
fieldFormat
fieldRole
[The output of table node (why 11cols being ???)]
This might be because your Species
and pred
columns are of type factor
not character
and looking at the SPSS nodes docs, they don't have a type for factor
.. Since factor
has two levels.. the additional 2 columns on the output table node could be representing the factor level for those two columns as it's trying to coerce to string. You need them as a factor type for the predict
function at the start of your script, but right before you export the table node try:
modelerData[] <- lapply(modelerData, function(x) if (is.factor(x)) as.character(x) else {x})
I don't have SPSS to be able to test this theory, but hopefully that solves your problem or gets you a little closer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With