I am trying to build for each of my users a vector containing the average number of records per hour of day. Hence the vector has to have 24 dimensions.
My original DataFrame has userID
and hour
columns, andI am starting by doing a groupBy
and counting the number of record per user per hour as follow:
val hourFreqDF = df.groupBy("userID", "hour").agg(count("*") as "hfreq")
Now, in order to generate a vector per user I am doing the follow, based on the first suggestion in this answer.
val hours = (0 to 23 map { n => s"$n" } toArray) val assembler = new VectorAssembler() .setInputCols(hours) .setOutputCol("hourlyConnections") val exprs = hours.map(c => avg(when($"hour" === c, $"hfreq").otherwise(lit(0))).alias(c)) val transformed = assembler.transform(hourFreqDF.groupBy($"userID") .agg(exprs.head, exprs.tail: _*))
When I run this example, I get the following warning:
Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
I presume this is because the expression is too long?
My question is: can I safely ignore this warning?
You can safely ignore it, if you are not interested in seeing the sql schema logs. Otherwise, you might want to set the property to a higher value, but it might affect the performance of your job:
spark.debug.maxToStringFields=100
Default value is: DEFAULT_MAX_TO_STRING_FIELDS = 25
The performance overhead of creating and logging strings for wide schemas can be large. To limit the impact, we bound the number of fields to include by default. This can be overridden by setting the 'spark.debug.maxToStringFields' conf in SparkEnv.
Taken from: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L90
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With