Consider the following example
dtrain <- data_frame(text = c("Chinese Beijing Chinese",
"Chinese Chinese Shanghai",
"Chinese Macao",
"Tokyo Japan Chinese"),
doc_id = 1:4,
class = c(1, 1, 1, 0))
dtrain_spark <- copy_to(sc, dtrain, overwrite = TRUE)
> dtrain_spark
# Source: table<dtrain> [?? x 3]
# Database: spark_connection
text doc_id class
<chr> <int> <dbl>
1 Chinese Beijing Chinese 1 1
2 Chinese Chinese Shanghai 2 1
3 Chinese Macao 3 1
4 Tokyo Japan Chinese 4 0
Here I have the classic Naive Bayes example where class identifies documents falling into the China category.
I am able to run a Naives Bayes classifier in sparklyr by doing the following:
dtrain_spark %>%
ft_tokenizer(input.col = "text", output.col = "tokens") %>%
ft_count_vectorizer(input_col = 'tokens', output_col = 'myvocab') %>%
select(myvocab, class) %>%
ml_naive_bayes( label_col = "class",
features_col = "myvocab",
prediction_col = "pcol",
probability_col = "prcol",
raw_prediction_col = "rpcol",
model_type = "multinomial",
smoothing = 0.6,
thresholds = c(0.2, 0.4))
which outputs:
NaiveBayesModel (Transformer)
<naive_bayes_5e946aec597e>
(Parameters -- Column Names)
features_col: myvocab
label_col: class
prediction_col: pcol
probability_col: prcol
raw_prediction_col: rpcol
(Transformer Info)
num_classes: int 2
num_features: int 6
pi: num [1:2] -1.179 -0.368
theta: num [1:2, 1:6] -1.417 -0.728 -2.398 -1.981 -2.398 ...
thresholds: num [1:2] 0.2 0.4
However, I have two major questions:
How can I assess the performance of this classifier in-sample? Where are the accuracy metrics?
Even more importantly, how can I use this trained model to predict new values, say, in the following spark test dataframe?
Test data:
dtest <- data_frame(text = c("Chinese Chinese Chinese Tokyo Japan",
"random stuff"))
dtest_spark <- copy_to(sc, dtest, overwrite = TRUE)
> dtest_spark
# Source: table<dtest> [?? x 1]
# Database: spark_connection
text
<chr>
1 Chinese Chinese Chinese Tokyo Japan
2 random stuff
Thanks!
How can I assess the performance of this classifier in-sample? Where are the accuracy metrics?
In general (there are some models which provide some form of summary), evaluation on training dataset is a separate step in Apache Spark. This fits nicely in the native Pipeline API.
Background:
Spark ML Pipelines are primarily build from two types of objects:
Transformers - objects which provide transform method, which map DataFrame to updated DataFrame.
You can transform using Transformer with ml_transform method.
Estimators - objects which provide fit method, which map DataFrame to Transfomer. By convention corresponding Estimator / Transformer pairs are called Foo / FooModel.
You can fit Estimator in sparklyr using ml_fit model.
Additionally ML Pipelines can be combined with Evaluators (see ml_*_evaluator and ml_*_eval methods) which can be used to compute different metrics on the transformed data, based on columns generated by a model (usually probability column or raw prediction).
You can apply Evaluator using ml_evaluate method.
Are related components include cross validator and train validation splits, which can be used for parameter tuning.
Examples:
sparklyr PipelineStages can be evaluated eagerly (as in your own code), by passing data directly, or lazily by passing a spark_connection instance and calling aforementioned methods (ml_fit, ml_transform, etc.).
It means you can define a Pipeline as follows:
pipeline <- ml_pipeline(
ft_tokenizer(sc, input.col = "text", output.col = "tokens"),
ft_count_vectorizer(sc, input_col = 'tokens', output_col = 'myvocab'),
ml_naive_bayes(sc, label_col = "class",
features_col = "myvocab",
prediction_col = "pcol",
probability_col = "prcol",
raw_prediction_col = "rpcol",
model_type = "multinomial",
smoothing = 0.6,
thresholds = c(0.2, 0.4),
uid = "nb")
)
Fit the PipelineModel:
model <- ml_fit(pipeline, dtrain_spark)
Transform, and apply one of available Evaluators:
ml_transform(model, dtrain_spark) %>%
ml_binary_classification_evaluator(
label_col="class", raw_prediction_col= "rpcol",
metric_name = "areaUnderROC")
[1] 1
or
evaluator <- ml_multiclass_classification_evaluator(
sc,
label_col="class", prediction_col= "pcol",
metric_name = "f1")
ml_evaluate(evaluator, ml_transform(model, dtrain_spark))
[1] 1
Even more importantly, how can I use this trained model to predict new values, say, in the following spark test dataframe?
Use either ml_transform or ml_predict (the latter one is a convince wrapper, which applies further transformations on the output):
ml_transform(model, dtest_spark)
# Source: table<sparklyr_tmp_cc651477ec7> [?? x 6]
# Database: spark_connection
text tokens myvocab rpcol prcol pcol
<chr> <list> <list> <list> <list> <dbl>
1 Chinese Chinese Chinese Tokyo Japan <list [5]> <dbl [6]> <dbl [… <dbl … 0
2 random stuff <list [2]> <dbl [6]> <dbl [… <dbl … 1
Cross validation:
There is not enough data in the example, but you cross validate and fit hyperparameters as shown below:
# dontrun
ml_cross_validator(
dtrain_spark,
pipeline,
list(nb=list(smoothing=list(0.8, 1.0))), # Note that name matches UID
evaluator=evaluator)
Notes:
If you use Pipelines with Vector columns (not formula-based calls), I strongly recommend using standardized (default) column names:
label for dependent variable.features for assembled independent variables.rawPrediction, prediction, probability for raw prediction, prediction and probability columns respectively. If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With