Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating a simple 1-row Spark DataFrame with Java API

In Scala, I can create a single-row DataFrame from an in-memory string like so:

val stringAsList = List("buzz")
val df = sqlContext.sparkContext.parallelize(jsonValues).toDF("fizz")
df.show()

When df.show() runs, it outputs:

+-----+
| fizz|
+-----+
| buzz|
+-----+

Now I'm trying to do this from inside a Java class. Apparently JavaRDDs don't have a toDF(String) method. I've tried:

List<String> stringAsList = new ArrayList<String>();
stringAsList.add("buzz");
SQLContext sqlContext = new SQLContext(sparkContext);
DataFrame df = sqlContext.createDataFrame(sparkContext
    .parallelize(stringAsList), StringType);
df.show();

...but still seem to be coming up short. Now when df.show(); executes, I get:

++
||
++
||
++

(An empty DF.) So I ask: Using the Java API, how do I read an in-memory string into a DataFrame that has only 1 row and 1 column in it, and also specify the name of that column? (So that the df.show() is identical to the Scala one above)?

like image 709
smeeb Avatar asked Oct 10 '16 21:10

smeeb


People also ask

How do I create a DataFrame in Java Spark?

Lets create a dataframe from list of row object . First populate the list with row object and then we create the structfield and add it to the list. Pass the list into the createStructType function and pass this into the createDataFrame function.

How do I create a row in Spark?

To create a new Row, use RowFactory. create() in Java or Row. apply() in Scala. A Row object can be constructed by providing field values.


1 Answers

Building on what @jgp suggested. If you want to do this for mixed types you can do:

List<Tuple2<Integer, Boolean>> mixedTypes = Arrays.asList(
                new Tuple2<>(1, false),
                new Tuple2<>(1, false),
                new Tuple2<>(1, false));

JavaRDD<Row> rowRDD = sparkContext.parallelize(mixedTypes).map(row -> RowFactory.create(row._1, row._2));

StructType mySchema = new StructType()
                .add("id", DataTypes.IntegerType, false)
                .add("flag", DataTypes.BooleanType, false);

Dataset<Row> df = spark.sqlContext().createDataFrame(rowRDD, mySchema).toDF();

This might help with the @jdk2588 's question.

like image 106
cauchy_cat Avatar answered Sep 22 '22 14:09

cauchy_cat