I have a Spark DataFrame built through pyspark from a JSON file as
sc = SparkContext()
sqlc = SQLContext(sc)
users_df = sqlc.read.json('users.json')
Now, I want to access a chosen_user data, where this is its _id field. I can do
print users_df[users_df._id == chosen_user].show()
and this gives me the full Row of the user. But suppose I just want one specific field in the Row, say the user gender, how would I obtain it?
Just filter and select:
result = users_df.where(users_df._id == chosen_user).select("gender")
or with col
from pyspark.sql.functions import col
result = users_df.where(col("_id") == chosen_user).select(col("gender"))
Finally PySpark Row
is just a tuple
with some extensions so you can for example flatMap
:
result.rdd.flatMap(list).first()
or map
with something like this:
result.rdd.map(lambda x: x.gender).first()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With