Is there any alternative for <code>df[100, c("column")]</code> in scala spark data frames. I want to select specific row from a column of spark data frame. for example <code>100th</code> row in above R equivalent code

Firstly, you must understand that <code>DataFrames</code> are distributed, that means you can't access them in a typical procedural way, you must do an analysis first. Although, you are asking about <code>Scala</code> I suggest you to read the Pyspark Documentation, because it has more examples than any of the other documentations. However, continuing with my explanation, I would use some methods of the <code>RDD</code> API cause all <code>DataFrame</code>s have one <code>RDD</code> as attribute. Please, see my example bellow, and notice how I take the 2nd record. <pre class="prettyprint lang-py prettyprint-override"><code>df = sqlContext.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["letter", "name"]) myIndex = 1 values = (df.rdd.zipWithIndex() .filter(lambda ((l, v), i): i == myIndex) .map(lambda ((l,v), i): (l, v)) .collect()) print(values[0]) # (u'b', 2) </code></pre> Hopefully, someone gives another solution with fewer steps.

This is how I achieved the same in Scala. I am not sure if it is more efficient than the valid answer, but it requires less coding <pre class="prettyprint"><code>val parquetFileDF = sqlContext.read.parquet("myParquetFule.parquet") val myRow7th = parquetFileDF.rdd.take(7).last </code></pre>

get specific row from spark dataframe

2 Answers

Firstly, you must understand that DataFrames are distributed, that means you can't access them in a typical procedural way, you must do an analysis first. Although, you are asking about Scala I suggest you to read the Pyspark Documentation, because it has more examples than any of the other documentations.

However, continuing with my explanation, I would use some methods of the RDD API cause all DataFrames have one RDD as attribute. Please, see my example bellow, and notice how I take the 2nd record.

df = sqlContext.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["letter", "name"]) myIndex = 1 values = (df.rdd.zipWithIndex()             .filter(lambda ((l, v), i): i == myIndex)             .map(lambda ((l,v), i): (l, v))             .collect())  print(values[0]) # (u'b', 2)

Hopefully, someone gives another solution with fewer steps.

120

answered Sep 23 '22 06:09

Alberto Bonsanto

This is how I achieved the same in Scala. I am not sure if it is more efficient than the valid answer, but it requires less coding

val parquetFileDF = sqlContext.read.parquet("myParquetFule.parquet")  val myRow7th = parquetFileDF.rdd.take(7).last

answered Sep 23 '22 06:09

Ignacio Alorre

Related questions
                            
                                Schema evolution in parquet format
                            
                                Spark Error:expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)
                            
                                Spark SQL Row_number() PartitionBy Sort Desc
                            
                                Filtering a spark dataframe based on date
                            
                                Reading csv files with quoted fields containing embedded commas
                            
                                multiple SparkContexts error in tutorial
                            
                                Applying UDFs on GroupedData in PySpark (with functioning python example)
                            
                                DataFrame equality in Apache Spark
                            
                                How to bootstrap installation of Python modules on Amazon EMR?
                            
                                GroupBy column and filter rows with maximum value in Pyspark
                            
                                How do I read a Parquet in R and convert it to an R DataFrame?
                            
                                AttributeError: 'DataFrame' object has no attribute 'map'
                            
                                Number of partitions in RDD and performance in Spark
                            
                                Spark cluster full of heartbeat timeouts, executors exiting on their own
                            
                                spark submit add multiple jars in classpath
                            
                                Optimal way to create a ml pipeline in Apache Spark for dataset with high number of columns
                            
                                How to get other columns when using Spark DataFrame groupby?
                            
                                Fetching distinct values on a column using Spark DataFrame
                            
                                How to run a Spark Java program
                            
                                How to convert DataFrame to RDD in Scala?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

get specific row from spark dataframe

Tags:

apache-spark

apache-spark-sql

nareshbabral

People also ask

2 Answers

Alberto Bonsanto

Ignacio Alorre

Recent Activity

Donate For Us