Is there any alternative for df[100, c("column")]
in scala spark data frames. I want to select specific row from a column of spark data frame. for example 100th
row in above R equivalent code
Selecting rows using the filter() function The first option you have when it comes to filtering DataFrame rows is pyspark. sql. DataFrame. filter() function that performs filtering based on the specified conditions.
In Spark/PySpark, you can use show() action to get the top/first N (5,10,100 ..) rows of the DataFrame and display them on a console or a log, there are also several Spark Actions like take() , tail() , collect() , head() , first() that return top and last n rows as a list of Rows (Array[Row] for Scala).
Firstly, you must understand that DataFrames
are distributed, that means you can't access them in a typical procedural way, you must do an analysis first. Although, you are asking about Scala
I suggest you to read the Pyspark Documentation, because it has more examples than any of the other documentations.
However, continuing with my explanation, I would use some methods of the RDD
API cause all DataFrame
s have one RDD
as attribute. Please, see my example bellow, and notice how I take the 2nd record.
df = sqlContext.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["letter", "name"]) myIndex = 1 values = (df.rdd.zipWithIndex() .filter(lambda ((l, v), i): i == myIndex) .map(lambda ((l,v), i): (l, v)) .collect()) print(values[0]) # (u'b', 2)
Hopefully, someone gives another solution with fewer steps.
This is how I achieved the same in Scala. I am not sure if it is more efficient than the valid answer, but it requires less coding
val parquetFileDF = sqlContext.read.parquet("myParquetFule.parquet") val myRow7th = parquetFileDF.rdd.take(7).last
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With