Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

get specific row from spark dataframe

Is there any alternative for df[100, c("column")] in scala spark data frames. I want to select specific row from a column of spark data frame. for example 100th row in above R equivalent code

like image 649
nareshbabral Avatar asked Feb 06 '16 16:02

nareshbabral


People also ask

How do I select a row in PySpark DataFrame?

Selecting rows using the filter() function The first option you have when it comes to filtering DataFrame rows is pyspark. sql. DataFrame. filter() function that performs filtering based on the specified conditions.

How do I get the first row of a DataFrame in Spark?

In Spark/PySpark, you can use show() action to get the top/first N (5,10,100 ..) rows of the DataFrame and display them on a console or a log, there are also several Spark Actions like take() , tail() , collect() , head() , first() that return top and last n rows as a list of Rows (Array[Row] for Scala).


2 Answers

Firstly, you must understand that DataFrames are distributed, that means you can't access them in a typical procedural way, you must do an analysis first. Although, you are asking about Scala I suggest you to read the Pyspark Documentation, because it has more examples than any of the other documentations.

However, continuing with my explanation, I would use some methods of the RDD API cause all DataFrames have one RDD as attribute. Please, see my example bellow, and notice how I take the 2nd record.

df = sqlContext.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["letter", "name"]) myIndex = 1 values = (df.rdd.zipWithIndex()             .filter(lambda ((l, v), i): i == myIndex)             .map(lambda ((l,v), i): (l, v))             .collect())  print(values[0]) # (u'b', 2) 

Hopefully, someone gives another solution with fewer steps.

like image 120
Alberto Bonsanto Avatar answered Sep 23 '22 06:09

Alberto Bonsanto


This is how I achieved the same in Scala. I am not sure if it is more efficient than the valid answer, but it requires less coding

val parquetFileDF = sqlContext.read.parquet("myParquetFule.parquet")  val myRow7th = parquetFileDF.rdd.take(7).last 
like image 28
Ignacio Alorre Avatar answered Sep 23 '22 06:09

Ignacio Alorre