Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert the datasets of Spark Row into string?

I have written the code to access the Hive table using SparkSQL. Here is the code:

SparkSession spark = SparkSession
        .builder()
        .appName("Java Spark Hive Example")
        .master("local[*]")
        .config("hive.metastore.uris", "thrift://localhost:9083")
        .enableHiveSupport()
        .getOrCreate();
Dataset<Row> df =  spark.sql("select survey_response_value from health").toDF();
df.show();

I would like to know how I can convert the complete output to String or String array? As I am trying to work with another module where only I can pass String or String type Array values.
I have tried other methods like .toString or typecast to String values. But did not worked for me.
Kindly let me know how I can convert the DataSet values to String?

like image 757
Jaffer Wilson Avatar asked Feb 22 '17 10:02

Jaffer Wilson


People also ask

How do I convert a row to a DataFrame in Spark Scala?

Using RDD Row type RDD[Row] to DataFrame Spark createDataFrame() has another signature which takes the RDD[Row] type and schema for column names as arguments. To use this first, we need to convert our “rdd” object from RDD[T] to RDD[Row]. To define a schema, we use StructType that takes an array of StructField.

What is Dataset row in Spark?

A Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each Dataset also has an untyped view called a DataFrame , which is a Dataset of Row . Operations available on Datasets are divided into transformations and actions.

How do you transpose rows into columns in Spark?

Spark SQL provides a pivot() function to rotate the data from one column into multiple columns (transpose row to column). It is an aggregation where one of the grouping columns values is transposed into individual columns with distinct data.


1 Answers

Here is the sample code in Java.

public class SparkSample {
    public static void main(String[] args) {
        SparkSession spark = SparkSession
            .builder()
            .appName("SparkSample")
            .master("local[*]")
            .getOrCreate();
    //create df
    List<String> myList = Arrays.asList("one", "two", "three", "four", "five");
    Dataset<Row> df = spark.createDataset(myList, Encoders.STRING()).toDF();
    df.show();
    //using df.as
    List<String> listOne = df.as(Encoders.STRING()).collectAsList();
    System.out.println(listOne);
    //using df.map
    List<String> listTwo = df.map(row -> row.mkString(), Encoders.STRING()).collectAsList();
    System.out.println(listTwo);
  }
}

"row" is java 8 lambda parameter. Please check developer.com/java/start-using-java-lambda-expressions.html

like image 151
abaghel Avatar answered Oct 02 '22 11:10

abaghel