Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to make "modelerData" and "modelerDataModel" correctly in "Extension Transform (R syntax)" node adding multiple cols

Tags:

r

spss-modeler

I use SPSS modeler v18.2.1 with R v3.5.1 (or v3.3.3) using Essentials for R 18.2.1.

I'm trying to make "Extension Transform (R syntax)" nodes to deal with some problems difficult for SPSS (future: make them Extension Bundles). I want them to add multiple cols, make a new data, etc AND give a next node data.frame. But the data.frame are incorrectly recognized by SPSS nodes (i.e., output of a next table nodes are different from the console output of print(modelerData) ).
How to do it ? (or it is a bug ?)

Any help would be greatly appreciated. Below is a reproducible simple example;

[preparation R env and data (please do it in pure R)]

# if not installed 
install.packages(randomForest)

set.seed(1)  # to reproduce
write.csv(iris[sort(sample(1:150, 100)), ], "iris_train_seed1.csv", row.names = FALSE)

[My node flow]
enter image description here

[R code of Extension Transform]

### library ###
library(randomForest)

# make_model
set.seed(1)
modelerModel <- randomForest(formula = Species ~ . ,
                             data = modelerData,
                             ntree = 100)

#### predict
pred_forest <- data.frame(pred = predict(modelerModel, 
                                         newdata = modelerData))
prob_forest <- as.data.frame(predict(modelerModel, 
                                     newdata = modelerData,
                                     type = "prob"))


# overwriting modelerData
modelerData <- cbind(modelerData, pred_forest, prob_forest)

# function definition to make modelerDataModel 
getMetaData <- function (data) {
  if (dim(data)[1]<=0) {
    print("Warning : modelerData has no line, all fieldStorage fields set to strings")
    getStorage <- function(x){return("string")}
  } else {
    getStorage <- function(x) {
      res <- NULL
      #if x is a factor, typeof will return an integer so we treat the case on the side
      if(is.factor(x)) {
        res <- "string"
      } else {
        res <- switch(typeof(unlist(x)),
                      integer = "integer",
                      #  integer = "real",      
                      double = "real",
                      character = "string",
                      "string")
      }
      return (res)
    }
  }
  col = vector("list", dim(data)[2])
  for (i in 1:dim(data)[2]) {
    col[[i]] <- c(fieldName=names(data[i]),
                  fieldLabel="",
                  fieldStorage=getStorage(data[[i]]), 
                  fieldMeasure="",
                  fieldFormat="",
                  fieldRole="")
  }
  mdm<-do.call(cbind,col)
  mdm<-data.frame(mdm)
  return(mdm)
}

# overwriting modelerDataModel
modelerDataModel <- getMetaData(modelerData)

# to check
print(dim(modelerData))
print(head(modelerData))
print(dim(modelerDataModel))
print(modelerDataModel)

[Console Output of "to check" part (print(modelerData) is my desired output of table node)]

# print(dim(modelerData))
[1] 100   9

# print(head(modelerData))
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species   pred setosa
1          4.9         3.0          1.4         0.2  setosa setosa      1
2          4.7         3.2          1.3         0.2  setosa setosa      1
3          5.0         3.6          1.4         0.2  setosa setosa      1
4          5.4         3.9          1.7         0.4  setosa setosa      1
5          4.6         3.4          1.4         0.3  setosa setosa      1
6          5.0         3.4          1.5         0.2  setosa setosa      1
  versicolor virginica
1          0         0
2          0         0
3          0         0
4          0         0
5          0         0
6          0         0

# print(dim(modelerDataModel))
[1] 6 9

# print(modelerDataModel)
                       X1          X2           X3          X4      X5     X6
fieldName    Sepal.Length Sepal.Width Petal.Length Petal.Width Species   pred
fieldLabel                                                                   
fieldStorage         real        real         real        real  string string
fieldMeasure                                                                 
fieldFormat                                                                  
fieldRole                                                                    
                 X7         X8        X9
fieldName    setosa versicolor virginica
fieldLabel                              
fieldStorage   real       real      real
fieldMeasure                            
fieldFormat                             
fieldRole  

[The output of table node (why 11cols being ???)]
enter image description here

like image 885
cuttlefish44 Avatar asked Jun 10 '20 03:06

cuttlefish44


1 Answers

This might be because your Species and pred columns are of type factor not character and looking at the SPSS nodes docs, they don't have a type for factor.. Since factor has two levels.. the additional 2 columns on the output table node could be representing the factor level for those two columns as it's trying to coerce to string. You need them as a factor type for the predict function at the start of your script, but right before you export the table node try:

modelerData[] <- lapply(modelerData, function(x) if (is.factor(x)) as.character(x) else {x})

I don't have SPSS to be able to test this theory, but hopefully that solves your problem or gets you a little closer.

like image 153
Anna Nevison Avatar answered Sep 27 '22 21:09

Anna Nevison