Hello I use Spark with Python, I performed a basic count(*) query on a dataframe as follow
myquery = sqlContext.sql("SELECT count(*) FROM myDF")
Result is
+--------+
|count(1)|
+--------+
| 3469|
+--------+
How can I save this value in order to perform futher operation.
For instance divide 3469 by 24 [whatever 24 means...]
Spark SQL is not a database but a module that is used for structured data processing. It majorly works on DataFrames which are the programming abstraction and usually act as a distributed SQL query engine.
The COUNT(*) function counts the total rows in the table, including the NULL values.
In Spark, the Count function returns the number of elements present in the dataset.
The sql function on a SparkSession enables applications to run SQL queries programmatically and returns the result as a DataFrame . Find full example code at "examples/src/main/python/sql/basic.py" in the Spark repo.
Given that your query returns dataframe
as
+-----+
|count|
+-----+
|3469 |
+-----+
You need to get the first (and only) row, and then its (only) field 'count'
count = dataframe.first()['count']
Given that you have dataframe
as
+-----+
|count|
+-----+
|3469 |
+-----+
You can perform mathematical operation on columns and create new columns or overwrite on the same using .withColumn
api
df.withColumn('devided', df.count/24).show(false)
You should get
+-----+------------------+
|count|devided |
+-----+------------------+
|3469 |144.54166666666666|
+-----+------------------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With