Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert a string to logical in R with sparklyr

Tags:

I have 100 million rows stored in many .csv files in a distributed file system. I'm using spark_read_csv() to load the data without issue. Many of my columns are stored as character logical values: "true", "false", "<na>". I do not have control over this.

When I attempt to convert the values to logical, the "<na>" values are converted to FALSE with the "false" values. Any thoughts on how to overcome this?

test_lgl <- 
  tibble(a = c(TRUE, TRUE, NA, NA, FALSE, FALSE),
         b = c("true", "true", "na", "<na>", "false", "f"))

test_lgl %>% mutate_if(is.character, as.logical)

# this works
   a     b
  <lgl> <lgl>
1  TRUE  TRUE
2  TRUE  TRUE
3    NA    NA
4    NA    NA
5 FALSE FALSE
6 FALSE    NA

sc <- spark_connect(master = "local")
spark_lgl <- copy_to(sc, test_lgl)

spark_lgl %>% mutate_if(is.character, as.logical)

# this does not
      a     b
  <lgl> <lgl>
1  TRUE  TRUE
2  TRUE  TRUE
3 FALSE FALSE
4 FALSE FALSE
5 FALSE FALSE
6 FALSE FALSE
like image 831
kputschko Avatar asked Nov 28 '17 19:11

kputschko


1 Answers

When I attempt to convert the values to logical, the "<na>" values are converted to FALSE

Surprisingly there are not. If you inspect the result further:

spark_lgl_boolean <- spark_lgl %>% mutate_if(is.character, as.logical)
spark_lgl_boolean %>% mutate_all(is.na)

Applying predicate on the first 100 rows
# Source:   lazy query [?? x 2]
# Database: spark_connection
      a     b
  <lgl> <lgl>
1 FALSE FALSE
2 FALSE FALSE
3  TRUE  TRUE
4  TRUE  TRUE
5 FALSE FALSE
6 FALSE FALSE

This is consistent with NA count:

spark_lgl_boolean %>%
  mutate_all(is.na) %>% 
  mutate_all(as.numeric) %>%
  summarize_all(sum)
# Source:   lazy query [?? x 2]
# Database: spark_connection
      a     b
  <dbl> <dbl>
1     2     2

Spark execution plan:

spark_lgl %>% mutate_if(is.character, as.logical) %>% optimizedPlan
Applying predicate on the first 100 rows
<jobj[1074]>
  org.apache.spark.sql.catalyst.plans.logical.Project
  Project [a#10, cast(b#11 as boolean) AS b#2037]
+- InMemoryRelation [a#10, b#11], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), `test_lgl`
      +- *FileScan csv [a#10,b#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/tmp/..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:boolean,b:string>

and Spark casting logic for StringType -> BooleanType where:

  • Strings TRUE / T (case insensitive) and 1 are converted to true literal.
  • Strings FALSE / F (case insensitive) and 0 are converted to false literal.
  • Strings not matched above are converted to NULL (~NA).
scala> Seq("tRUE", "FALSE", "f", "<na>", "NA", "1", "0").toDF("x").select($"x".cast("boolean")).show
+-----+
|    x|
+-----+
| true|
|false|
|false|
| null|
| null|
| true|
|false|
+-----+

Problem seems to be specifically introduced by sparklyr conversions. See Improve Serialization on GitHub (credits to kevinykuo for pointing out this one).

However if you stick to Spark based logic, without fetching data to R, things should work just fine (for example if you write data to file).

I'm using spark_read_csv() to load the data

In that case you might take a look at nullValue and nanValue options of the CSV reader. For example:

spark_read_csv(..., options=list(nullValue="<na>"))

or

spark_read_csv(..., options=list(nanValue="<na>"))

but please keep in mind, that Spark semantics of NULL / NaN is not the same as R NA / NaN.

like image 133
zero323 Avatar answered Sep 19 '22 12:09

zero323