Is there a way to replicate the rows of a Spark's dataframe using the functions of sparklyr/dplyr? <pre class="prettyprint"><code>sc <- spark_connect(master = "spark://####:7077") df_tbl <- copy_to(sc, data.frame(row1 = 1:3, row2 = LETTERS[1:3]), "df") </code></pre> This is the desired output, saved into a new spark tbl: <pre class="prettyprint"><code>> df2_tbl row1 row2 <int> <chr> 1 1 A 2 1 A 3 1 A 4 2 B 5 2 B 6 2 B 7 3 C 8 3 C 9 3 C </code></pre>

With <code>sparklyr</code> you can use <code>array</code> and <code>explode</code> as suggested by @Oli: <pre class="prettyprint"><code>df_tbl %>% mutate(arr = explode(array(1, 1, 1))) %>% select(-arr) # # Source: lazy query [?? x 2] # # Database: spark_connection # row1 row2 # <int> <chr> # 1 1 A # 2 1 A # 3 1 A # 4 2 B # 5 2 B # 6 2 B # 7 3 C # 8 3 C # 9 3 C </code></pre> and generalized <pre class="prettyprint"><code>library(rlang) df_tbl %>% mutate(arr = !!rlang::parse_quo( paste("explode(array(", paste(rep(1, 3), collapse = ","), "))") )) %>% select(-arr) # # Source: lazy query [?? x 2] # # Database: spark_connection # row1 row2 # <int> <chr> # 1 1 A # 2 1 A # 3 1 A # 4 2 B # 5 2 B # 6 2 B # 7 3 C # 8 3 C # 9 3 C </code></pre> where you can easily adjust number of rows.

The idea that comes to mind first is to use the <code>explode</code> function (it is exactly what it is meant for in Spark). Yet arrays do not seem to be supported in SparkR (to the best of my knowledge). <pre class="prettyprint"><code>> structField("a", "array") Error in checkType(type) : Unsupported type for SparkDataframe: array </code></pre> I can however propose two other methods: <ol> <li> A straightforward but not very elegant one: <pre class="prettyprint"><code>head(rbind(df, df, df), n=30) # row1 row2 # 1 1 A # 2 2 B # 3 3 C # 4 1 A # 5 2 B # 6 3 C # 7 1 A # 8 2 B # 9 3 C </code></pre> Or with a for loop for more genericity: <pre class="prettyprint"><code>df2 = df for(i in 1:2) df2=rbind(df, df2) </code></pre> Note that this would also work with <code>union</code>. </li> <li> The second, more elegant method (because it only implies one spark operation) is based on a cross join (Cartesian product) with a dataframe of size 3 (or any other number): <pre class="prettyprint"><code>j <- as.DataFrame(data.frame(s=1:3)) head(drop(crossJoin(df, j), "s"), n=100) # row1 row2 # 1 1 A # 2 1 A # 3 1 A # 4 2 B # 5 2 B # 6 2 B # 7 3 C # 8 3 C # 9 3 C </code></pre> </li> </ol>

R - How to replicate rows in a spark dataframe using sparklyr

Tags:

r

apache-spark

sparklyr

Is there a way to replicate the rows of a Spark's dataframe using the functions of sparklyr/dplyr?

sc <- spark_connect(master = "spark://####:7077")

df_tbl <- copy_to(sc, data.frame(row1 = 1:3, row2 = LETTERS[1:3]), "df")

This is the desired output, saved into a new spark tbl:

> df2_tbl
   row1  row2
  <int> <chr>
1     1     A
2     1     A
3     1     A
4     2     B
5     2     B
6     2     B
7     3     C
8     3     C
9     3     C

778

asked Jun 13 '17 20:06

Igor

2 Answers

With sparklyr you can use array and explode as suggested by @Oli:

df_tbl %>% 
  mutate(arr = explode(array(1, 1, 1))) %>% 
  select(-arr)

# # Source:   lazy query [?? x 2]
# # Database: spark_connection
#    row1 row2 
#   <int> <chr>
# 1     1 A    
# 2     1 A    
# 3     1 A    
# 4     2 B    
# 5     2 B    
# 6     2 B    
# 7     3 C    
# 8     3 C    
# 9     3 C

and generalized

library(rlang)

df_tbl %>%  
  mutate(arr = !!rlang::parse_quo(
    paste("explode(array(", paste(rep(1, 3), collapse = ","), "))")
  )) %>% select(-arr)

# # Source:   lazy query [?? x 2]
# # Database: spark_connection
#    row1 row2 
#   <int> <chr>
# 1     1 A    
# 2     1 A    
# 3     1 A    
# 4     2 B    
# 5     2 B    
# 6     2 B    
# 7     3 C    
# 8     3 C    
# 9     3 C

where you can easily adjust number of rows.

142

answered Sep 23 '22 07:09

Alper t. Turker

The idea that comes to mind first is to use the explode function (it is exactly what it is meant for in Spark). Yet arrays do not seem to be supported in SparkR (to the best of my knowledge).

> structField("a", "array")
Error in checkType(type) : Unsupported type for SparkDataframe: array

I can however propose two other methods:

A straightforward but not very elegant one:

head(rbind(df, df, df), n=30)
#    row1 row2
# 1    1    A
# 2    2    B
# 3    3    C
# 4    1    A
# 5    2    B
# 6    3    C
# 7    1    A
# 8    2    B
# 9    3    C

Or with a for loop for more genericity:

df2 = df
for(i in 1:2) df2=rbind(df, df2)

Note that this would also work with union.

The second, more elegant method (because it only implies one spark operation) is based on a cross join (Cartesian product) with a dataframe of size 3 (or any other number):

j <- as.DataFrame(data.frame(s=1:3))
head(drop(crossJoin(df, j), "s"), n=100)
#    row1 row2
# 1    1    A
# 2    1    A
# 3    1    A
# 4    2    B
# 5    2    B
# 6    2    B
# 7    3    C
# 8    3    C
# 9    3    C

answered Sep 25 '22 07:09

Oli

Related questions
                            
                                Disable hive support in sparklyr
                            
                                How to plot interpolating data on a projected map using ggplot2 in R
                            
                                counting the number of values greater than 0 in R in multiple columns
                            
                                ggplot2 stat_summary mean_sdl not the same as mean +/- sd
                            
                                Reshaping unusual data set [duplicate]
                            
                                R ggmap Error: object 'f' not found
                            
                                Equivalent of abline in plotly
                            
                                Avoid Rate limit with rtweet get_timeline()
                            
                                ggplot2 - changing numeric axis title to vector of strings
                            
                                Plot a table with box size changing
                            
                                Rmarkdown overlapping output
                            
                                Is there a limit to the string length that can be passed to grep() in R?
                            
                                Determining High Density Region for a distribution in R
                            
                                Add new column with name of max column in data frame
                            
                                Shiny doesn't display R plotly plot
                            
                                ggplot2: merge two legends
                            
                                How do I use the ebook functions epub_book and kindlegen() for existing bookdown documents?
                            
                                R dynamic data frame names in Loop
                            
                                DT showing more rows in DT
                            
                                Include .csv filename when reading data into r using list.files

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With