In Scala, I can create a single-row DataFrame from an in-memory string like so:
val stringAsList = List("buzz")
val df = sqlContext.sparkContext.parallelize(jsonValues).toDF("fizz")
df.show()
When df.show()
runs, it outputs:
+-----+
| fizz|
+-----+
| buzz|
+-----+
Now I'm trying to do this from inside a Java class. Apparently JavaRDD
s don't have a toDF(String)
method. I've tried:
List<String> stringAsList = new ArrayList<String>();
stringAsList.add("buzz");
SQLContext sqlContext = new SQLContext(sparkContext);
DataFrame df = sqlContext.createDataFrame(sparkContext
.parallelize(stringAsList), StringType);
df.show();
...but still seem to be coming up short. Now when df.show();
executes, I get:
++
||
++
||
++
(An empty DF.) So I ask: Using the Java API, how do I read an in-memory string into a DataFrame that has only 1 row and 1 column in it, and also specify the name of that column? (So that the df.show()
is identical to the Scala one above)?
Lets create a dataframe from list of row object . First populate the list with row object and then we create the structfield and add it to the list. Pass the list into the createStructType function and pass this into the createDataFrame function.
To create a new Row, use RowFactory. create() in Java or Row. apply() in Scala. A Row object can be constructed by providing field values.
Building on what @jgp suggested. If you want to do this for mixed types you can do:
List<Tuple2<Integer, Boolean>> mixedTypes = Arrays.asList(
new Tuple2<>(1, false),
new Tuple2<>(1, false),
new Tuple2<>(1, false));
JavaRDD<Row> rowRDD = sparkContext.parallelize(mixedTypes).map(row -> RowFactory.create(row._1, row._2));
StructType mySchema = new StructType()
.add("id", DataTypes.IntegerType, false)
.add("flag", DataTypes.BooleanType, false);
Dataset<Row> df = spark.sqlContext().createDataFrame(rowRDD, mySchema).toDF();
This might help with the @jdk2588 's question.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With