From a PySpark SQL dataframe like <pre class="prettyprint"><code>name age city abc 20 A def 30 B </code></pre> How to get the last row.(Like by df.limit(1) I can get first row of dataframe into new dataframe). And how can I access the dataframe rows by index.like row no. 12 or 200 . In pandas I can do <pre class="prettyprint"><code>df.tail(1) # for last row df.ix[rowno or index] # by index df.loc[] or by df.iloc[] </code></pre> I am just curious how to access pyspark dataframe in such ways or alternative ways. Thanks

<blockquote> How to get the last row. </blockquote> If you have a column that you can use to order dataframe, for example "index", then one easy way to get the last record is using SQL: 1) order your table by descending order and 2) take 1st value from this order <pre class="prettyprint"><code>df.createOrReplaceTempView("table_df") query_latest_rec = """SELECT * FROM table_df ORDER BY index DESC limit 1""" latest_rec = self.sqlContext.sql(query_latest_rec) latest_rec.show() </code></pre> <blockquote> And how can I access the dataframe rows by index.like row no. 12 or 200 . </blockquote> Similar way you can get record in any line <pre class="prettyprint"><code>row_number = 12 df.createOrReplaceTempView("table_df") query_latest_rec = """SELECT * FROM (select * from table_df ORDER BY index ASC limit {0}) ord_lim ORDER BY index DESC limit 1""" latest_rec = self.sqlContext.sql(query_latest_rec.format(row_number)) latest_rec.show() </code></pre> If you do not have "index" column you can create it using <pre class="prettyprint"><code>from pyspark.sql.functions import monotonically_increasing_id df = df.withColumn("index", monotonically_increasing_id()) </code></pre>

How to select last row and also how to access PySpark dataframe by index?

Tags:

python

apache-spark

apache-spark-sql

pyspark

pyspark-sql

From a PySpark SQL dataframe like

name age city
abc   20  A
def   30  B

How to get the last row.(Like by df.limit(1) I can get first row of dataframe into new dataframe).

And how can I access the dataframe rows by index.like row no. 12 or 200 .

In pandas I can do

df.tail(1) # for last row
df.ix[rowno or index] # by index
df.loc[] or by df.iloc[]

I am just curious how to access pyspark dataframe in such ways or alternative ways.

Thanks

623

asked Sep 17 '16 08:09

Satya

3 Answers

How to get the last row.

Long and ugly way which assumes that all columns are oderable:

from pyspark.sql.functions import (
    col, max as max_, struct, monotonically_increasing_id
)

last_row = (df
    .withColumn("_id", monotonically_increasing_id())
    .select(max(struct("_id", *df.columns))
    .alias("tmp")).select(col("tmp.*"))
    .drop("_id"))

If not all columns can be order you can try:

with_id = df.withColumn("_id", monotonically_increasing_id())
i = with_id.select(max_("_id")).first()[0]

with_id.where(col("_id") == i).drop("_id")

Note. There is last function in pyspark.sql.functions/ `o.a.s.sql.functions but considering description of the corresponding expressions it is not a good choice here.

how can I access the dataframe rows by index.like

You cannot. Spark DataFrame and accessible by index. You can add indices using zipWithIndex and filter later. Just keep in mind this O(N) operation.

answered Oct 29 '22 23:10

zero323

How to get the last row.

If you have a column that you can use to order dataframe, for example "index", then one easy way to get the last record is using SQL: 1) order your table by descending order and 2) take 1st value from this order

df.createOrReplaceTempView("table_df")
query_latest_rec = """SELECT * FROM table_df ORDER BY index DESC limit 1"""
latest_rec = self.sqlContext.sql(query_latest_rec)
latest_rec.show()

And how can I access the dataframe rows by index.like row no. 12 or 200 .

Similar way you can get record in any line

row_number = 12
df.createOrReplaceTempView("table_df")
query_latest_rec = """SELECT * FROM (select * from table_df ORDER BY index ASC limit {0}) ord_lim ORDER BY index DESC limit 1"""
latest_rec = self.sqlContext.sql(query_latest_rec.format(row_number))
latest_rec.show()

If you do not have "index" column you can create it using

from pyspark.sql.functions import monotonically_increasing_id

df = df.withColumn("index", monotonically_increasing_id())

answered Oct 29 '22 23:10

Danylo Zherebetskyy

from pyspark.sql import functions as F

expr = [F.last(col).alias(col) for col in df.columns]

df.agg(*expr)

Just a tip: Looks like you still have the mindset of someone who is working with pandas or R. Spark is a different paradigm in the way we work with data. You don't access data inside individual cells anymore, now you work with whole chunks of it. If you keep collecting stuff and doing actions, like you just did, you lose the whole concept of parallelism that spark provide. Take a look on the concept of transformations vs actions in Spark.

answered Oct 29 '22 22:10

Henrique Florencio

Related questions
                            
                                Vectorizing a function in pandas
                            
                                Sklearn TFIDF vectorizer to run as parallel jobs
                            
                                Unable to start App Engine application after updating it via Google Cloud SDK
                            
                                Geany - How to execute code in terminal pane instead of external terminal
                            
                                SettingWithCopyWarning even when using .loc[row_indexer,col_indexer] = value
                            
                                How can i count occurrence of each word in document using Dictionary comprehension
                            
                                Confusing about a Python min quiz
                            
                                Getting attribute error: 'map' object has no attribute 'sort'
                            
                                How to count number of space in given string in python [duplicate]
                            
                                Parallel file writing is it efficient?
                            
                                Using Concurrent Futures without running out of RAM
                            
                                Converting an array dict to xml in python?
                            
                                Building a StructType from a dataframe in pyspark
                            
                                Appropriate Deep Learning Structure for multi-class classification
                            
                                Theano CNN on CPU: AbstractConv2d Theano optimization failed
                            
                                asterisk in tuple, list and set definitions, double asterisk in dict definition
                            
                                Jupyter: disable restart kernel warning
                            
                                Python Click: Having the group execute code AFTER a command
                            
                                Docker Django 404 for web static files, but fine for admin static files
                            
                                Can pandas read a transposed CSV?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With