How do I convert this one row to a dataframe?
val oneRowDF = myDF.first // gives Array[Row]
Thanks
Since a DataFrame is similar to a 2D Numpy array, we can create one from a Numpy ndarray . You should remember that the input Numpy array must be 2D, otherwise you will get a ValueError. If you pass a raw Numpy ndarray , the index and column names start at 0 by default.
columns() to Convert Row to Column Header. You can use df. columns=df. iloc[0] to set the column labels by extracting the first row.
To convert a numpy array to pandas dataframe, we use pandas. DataFrame() function of Python Pandas library.
In my answer, df1 is a DataFrame [text: string, y : int], just for testing - val df1 = sc.parallelize(List("a", 1")).toDF("text", "y")
.
val schema = StructType(
StructField("text", StringType, false) ::
StructField("y", IntegerType, false) :: Nil)
val arr = df1.head(3); // Array[Row]
val dfFromArray = sqlContext.createDataFrame(sparkContext.parallelize(arr), schema);
You can also map parallelized array and cast every row:
val dfFromArray = sparkContext.parallelize(arr).map(row => (row.getString(0), row.getInt(1)))
.toDF("text", "y");
In case of one row, you can run:
val dfFromArray = sparkContext.parallelize(Seq(row)).map(row => (row.getString(0), row.getInt(1)))
.toDF("text", "y");
In Spark 2.0 use SparkSession instead of SQLContext.
You do not want to do that :
If you want a subpart of the whole dataFrame just use limit
api.
Example:
scala> val d=sc.parallelize(Seq((1,3),(2,4))).toDF
d: org.apache.spark.sql.DataFrame = [_1: int, _2: int]
scala> d.show
+---+---+
| _1| _2|
+---+---+
| 1| 3|
| 2| 4|
+---+---+
scala> d.limit(1)
res1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_1: int, _2: int]
scala> d.limit(1).show
+---+---+
| _1| _2|
+---+---+
| 1| 3|
+---+---+
Still if you want to explicitly convert an Array[Row] to DataFrame , you can do something like
scala> val value=d.take(1)
value: Array[org.apache.spark.sql.Row] = Array([1,3])
scala> val asTuple=value.map(a=>(a.getInt(0),a.getInt(1)))
asTuple: Array[(Int, Int)] = Array((1,3))
scala> sc.parallelize(asTuple).toDF
res6: org.apache.spark.sql.DataFrame = [_1: int, _2: int]
And hence now you can show it accordingly !
If you have List<Row>
, then it can directly be used to create a dataframe
or dataset<Row>
using spark.createDataFrame(List<Row> rows, StructType schema)
. Where spark is SparkSession in spark 2.x
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With