Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Changing column data type to factor with sparklyr

I am pretty new to Spark and am currently using it using the R API through sparkly package. I created a Spark data frame from hive query. The data types are not specified correctly in the source table and I'm trying to reset the data type by leveraging the functions from dplyr package. Below is the code I tried:

prod_dev <- sdf_load_table(...)
num_var <-  c("var1", "var2"....)
cat_var <-  c("var_a","var_b", ...)

pos1 <- which(colnames(prod_dev) %in% num_var)
pos2 <- which(colnames(prod_dev) %in% cat_var)

prod_model_tbl <- prod_dev %>% 
                mutate(age = 2016- as.numeric(substr(dob_yyyymmdd,1,4))) %>%
                mutate(msa_fg = ifelse(is.na(msacode2000), 0, 1)) %>% 
                mutate(csa_fg = ifelse(is.na(csacode), 0, 1)) %>%
                mutate_each(funs(factor), pos2) %>%
                mutate_each(funs(as.numeric), pos1)

The code will work if prod_dev is a R data frame. But using it on a Spark Data frame does not seem to produce the correct result:

> head(prod_model_tbl)


    Source:   query [?? x 99]
    Database: spark connection master=yarn-client app=sparklyr_test local=FALSE

    Error: org.apache.spark.sql.AnalysisException: undefined function     FACTOR; line 97 pos 2248 at org.apache.spark.sql.hive.HiveFunctionRegistry....

Can someone please advise how to make the desired changes to the Spark Data Frame?

like image 550
b396958 Avatar asked Dec 21 '16 02:12

b396958


People also ask

How do I change the datatype of a column in a spark?

To change the Spark SQL DataFrame column type from one data type to another data type you should use cast() function of Column class, you can use this on withColumn(), select(), selectExpr(), and SQL expression.

How do I convert a datatype to a column in R?

To convert columns of an R data frame from integer to numeric we can use lapply function. For example, if we have a data frame df that contains all integer columns then we can use the code lapply(df,as. numeric) to convert all of the columns data type into numeric data type.

What is the difference between SparkR and Sparklyr?

Sparklyr provides a range of functions that allow you to access the Spark tools for transforming/pre-processing data. SparkR is basically a tool for running R on Spark. In order to use SparkR, we just import it into our environment and run our code.

What is Rstudio Sparklyr?

Sparklyr is an R interface for Apache Spark that allows you to: Install and connect to Spark using YARN, Mesos, Livy or Kubernetes. Use dplyr to filter and aggregate Spark datasets and streams then bring them into R for analysis and visualization. Use MLlib, H2O, XGBoost and GraphFrames to train models at scale in ...

How to change column type in spark dataframe?

Spark – How to Change Column Type? To change the Spark SQL DataFrame column type from one data type to another data type you should use cast () function of Column class, you can use this on withColumn (), select (), selectExpr (), and SQL expression.

What is sparklyr in R?

sparklyris an R interface for Apache Spark™, it provides a complete dplyr backend and the option to query directly using Spark SQL statement. With sparklyr, you can orchestrate distributed machine learning using either Spark’s MLlib or H2O Sparkling Water. Import Tidy Transform Model Visualize Communicate R for Data Science, Grolemund & Wickham

What data types does sparklyr support in Scala?

When using the invoke family of functions, R data types map to Scala data types, but sparklyr currently only handles certain R data storage type mappings. The below table shows the data mappings currently handled by sparklyr: So Scala functions with parameters that require a List or a Seq, for example, need to be handled in a different way.

When to use different data storage types when extending sparklyr?

When working to extend the sparklyr package, for example to call custom Scala libraries, oftentimes you will come across Scala methods which require you to use different data storage types to those automatically handled by sparklyr.


1 Answers

In general you can use standard R generic functions for type casting. For example:

df <- data.frame(x=c(1, NA), y=c("-1", "2"))

copy_to(sc, df, "df", overwrite=TRUE) %>% 
  mutate(x_char = as.character(x)) %>% 
  mutate(y_numeric = as.numeric(y))
Source:   query [2 x 4]
Database: spark connection master=...

      x     y x_char y_numeric
  <dbl> <chr>  <chr>     <dbl>
1     1    -1    1.0        -1
2   NaN     2   <NA>         2

The problem is Spark doesn't provide any direct equivalent of R factor.

In Spark SQL we use double type and column metadata to represent categorical variables and ML Transformers, which are not a part of SQL, for encoding. Therefore there is no place for factor / as.factor. SparkR provides some automatic conversions when working with ML but I am not sure if there is similar mechanism in sparklyr (the closest thing I am aware of is ml_create_dummy_variables).

like image 177
zero323 Avatar answered Oct 22 '22 04:10

zero323