I have 100 million rows stored in many .csv files in a distributed file system. I'm using spark_read_csv() to load the data without issue. Many of my columns are stored as character logical values: "true"
, "false"
, "<na>"
. I do not have control over this.
When I attempt to convert the values to logical, the "<na>"
values are converted to FALSE
with the "false"
values. Any thoughts on how to overcome this?
test_lgl <-
tibble(a = c(TRUE, TRUE, NA, NA, FALSE, FALSE),
b = c("true", "true", "na", "<na>", "false", "f"))
test_lgl %>% mutate_if(is.character, as.logical)
# this works
a b
<lgl> <lgl>
1 TRUE TRUE
2 TRUE TRUE
3 NA NA
4 NA NA
5 FALSE FALSE
6 FALSE NA
sc <- spark_connect(master = "local")
spark_lgl <- copy_to(sc, test_lgl)
spark_lgl %>% mutate_if(is.character, as.logical)
# this does not
a b
<lgl> <lgl>
1 TRUE TRUE
2 TRUE TRUE
3 FALSE FALSE
4 FALSE FALSE
5 FALSE FALSE
6 FALSE FALSE
When I attempt to convert the values to logical, the
"<na>"
values are converted toFALSE
Surprisingly there are not. If you inspect the result further:
spark_lgl_boolean <- spark_lgl %>% mutate_if(is.character, as.logical)
spark_lgl_boolean %>% mutate_all(is.na)
Applying predicate on the first 100 rows
# Source: lazy query [?? x 2]
# Database: spark_connection
a b
<lgl> <lgl>
1 FALSE FALSE
2 FALSE FALSE
3 TRUE TRUE
4 TRUE TRUE
5 FALSE FALSE
6 FALSE FALSE
This is consistent with NA count:
spark_lgl_boolean %>%
mutate_all(is.na) %>%
mutate_all(as.numeric) %>%
summarize_all(sum)
# Source: lazy query [?? x 2]
# Database: spark_connection
a b
<dbl> <dbl>
1 2 2
Spark execution plan:
spark_lgl %>% mutate_if(is.character, as.logical) %>% optimizedPlan
Applying predicate on the first 100 rows
<jobj[1074]>
org.apache.spark.sql.catalyst.plans.logical.Project
Project [a#10, cast(b#11 as boolean) AS b#2037]
+- InMemoryRelation [a#10, b#11], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), `test_lgl`
+- *FileScan csv [a#10,b#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/tmp/..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:boolean,b:string>
and Spark casting logic for StringType
-> BooleanType
where:
TRUE
/ T
(case insensitive) and 1
are converted to true
literal.FALSE
/ F
(case insensitive) and 0
are converted to false
literal.NULL
(~NA
).scala> Seq("tRUE", "FALSE", "f", "<na>", "NA", "1", "0").toDF("x").select($"x".cast("boolean")).show
+-----+
| x|
+-----+
| true|
|false|
|false|
| null|
| null|
| true|
|false|
+-----+
Problem seems to be specifically introduced by sparklyr
conversions. See Improve Serialization on GitHub (credits to kevinykuo for pointing out this one).
However if you stick to Spark based logic, without fetching data to R, things should work just fine (for example if you write data to file).
I'm using spark_read_csv() to load the data
In that case you might take a look at nullValue
and nanValue
options
of the CSV reader. For example:
spark_read_csv(..., options=list(nullValue="<na>"))
or
spark_read_csv(..., options=list(nanValue="<na>"))
but please keep in mind, that Spark semantics of NULL
/ NaN
is not the same as R NA
/ NaN
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With