Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to deal with nested data with sparklyr?

Tags:

r

tidyr

sparklyr

In the following example I've loaded a parquet file that contains a nested record of map objects in the meta field. sparklyr seems to do a nice job of dealing with these. However tidyr::unnest does not translate to SQL (or HQL - understandably - like LATERAL VIEW explode()) and is thus not usable. Is there a way to unnest data some other way?

tfl <- head(tf)
tfl
Source:   query [?? x 10]
Database: spark connection master=yarn-client app=sparklyr local=FALSE

                            trkKey             meta     sources startTime
                             <chr>           <list>      <list>    <list>
1 3juPe-k0yiMcANNMa_YiAJfJyU7WCQ3Q <S3: spark_jobj> <list [24]> <dbl [1]>
2 3juPe-k0yiAJX3ocJj1fVqru-e0syjvQ <S3: spark_jobj>  <list [1]> <dbl [1]>
3 3juPe-k0yisY7UY_ufUPUo5mE1xGfmNw <S3: spark_jobj>  <list [7]> <dbl [1]>
4 3juPe-k0yikXT5FhqNj87IwBw1Oy-6cw <S3: spark_jobj> <list [24]> <dbl [1]>
5 3juPe-k0yi4MMU63FEWYTNKxvDpYwsRw <S3: spark_jobj>  <list [7]> <dbl [1]>
6 3juPe-k0yiFBz2uPbOQqKibCFwn7Fmlw <S3: spark_jobj> <list [19]> <dbl [1]>
# ... with 6 more variables: endTime <list>, durationInMinutes <dbl>,
#   numPoints <int>, maxSpeed <dbl>, maxAltitude <dbl>, primaryKey <chr>

There is also an issue when the data is collected. E.g.,

tfl <- head(tf) %>% collect()
tfl
# A tibble: 6 × 10
                            trkKey             meta     sources startTime
                             <chr>           <list>      <list>    <list>
1 3juPe-k0yiMcANNMa_YiAJfJyU7WCQ3Q <S3: spark_jobj> <list [24]> <dbl [1]>
2 3juPe-k0yiAJX3ocJj1fVqru-e0syjvQ <S3: spark_jobj>  <list [1]> <dbl [1]>
3 3juPe-k0yisY7UY_ufUPUo5mE1xGfmNw <S3: spark_jobj>  <list [7]> <dbl [1]>
4 3juPe-k0yikXT5FhqNj87IwBw1Oy-6cw <S3: spark_jobj> <list [24]> <dbl [1]>
5 3juPe-k0yi4MMU63FEWYTNKxvDpYwsRw <S3: spark_jobj>  <list [7]> <dbl [1]>
6 3juPe-k0yiFBz2uPbOQqKibCFwn7Fmlw <S3: spark_jobj> <list [19]> <dbl [1]>
# ... with 6 more variables: endTime <list>, durationInMinutes <dbl>,
#   numPoints <int>, maxSpeed <dbl>, maxAltitude <dbl>, primaryKey <chr>

tfl %>% unnest(meta)
Error: Each column must either be a list of vectors or a list of data frames [meta]

In the above, the meta file still contains spark_jobj elements instead of lists, data.frames, or even JSON strings (which is how Hive would return such data). This creates a situation where tidyr doesn't even work on the collected data.

Is there a way to get sparklyr to work more nicely with tidyr that I am missing? If not, is this planned for future sparklyr development?

like image 539
Matt Pollock Avatar asked Sep 01 '16 16:09

Matt Pollock


1 Answers

I finally have my answer to this. See https://mitre.github.io/sparklyr.nested/ (source: https://github.com/mitre/sparklyr.nested)

tf %>% 
  sdf_unnest(meta)

This will behave for Spark data frames similarly to how tidyr::unnest behaves for local data frames. Nested select and explode operations are also implemented.

Update:

as @cem-bilge notes explode can be used inside mutate. This is effective in situations where the array is simple (character or numeric) but less great in other situations.

iris2 <- copy_to(sc, iris, name="iris")
iris_nst <- iris2 %>%
  sdf_nest(Sepal_Length, Sepal_Width, Petal.Length, Petal.Width, .key="data") %>%
  group_by(Species) %>%
  summarize(data=collect_list(data))

Then

iris_nst %>% mutate(data = explode(data)) %>% sdf_schema_viewer()

produces

enter image description here

with the fields still nested (though exploded) whereas sdf_unnest yields

iris_nst %>% sdf_unnest(data) %>% sdf_schema_viewer()

enter image description here

like image 180
Matt Pollock Avatar answered Oct 15 '22 09:10

Matt Pollock