Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract a single (column/row) value from a dataframe using PySpark?

Here's my spark code. It works fine and returns 2517. All I want to do is to print "2517 degrees"...but I'm not sure how to extract that 2517 into a variable. I can only display the dataframe but not extract values from it. Sounds super easy but unfortunately I'm stuck! Any help will be appreciated. Thanks!

df = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").option("delimiter", "\t").load("dbfs:/databricks-datasets/power-plant/data")

df.createOrReplaceTempView("MyTable")

df = spark.sql("SELECT COUNT (DISTINCT AP) FROM MyTable")

display(df)
like image 535
Chadee Fouad Avatar asked Mar 28 '19 20:03

Chadee Fouad


People also ask

How do you extract values from a column in a DataFrame PySpark?

In PySpark, the substring() function is used to extract the substring from a DataFrame string column by providing the position and length of the string you wanted to extract. In this tutorial, I have explained with an example of getting substring of a column using substring() from pyspark. sql.

How do I get a particular row in DataFrame PySpark?

Method 6: Using select() with collect() method This method is used to select a particular row from the dataframe, It can be used with collect() function. where, dataframe is the pyspark dataframe. Columns is the list of columns to be displayed in each row.

How do I select a single column in PySpark?

You can select the single or multiple columns of the DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with selected columns. show() function is used to show the Dataframe contents.


2 Answers

here is the alternative:

df.first()['column name']

it will give you the desired output. you can store it in a variable.

like image 93
Jagdish Singh Avatar answered Nov 10 '22 22:11

Jagdish Singh


I think you're looking for collect. Something like this should get you the value:

df.collect()[0]['count(DISTINCT AP)']

assuming the column name is 'count(DISTINCT AP)'

like image 43
Patrick Avatar answered Nov 10 '22 23:11

Patrick