Querying json object in dataframe using Pyspark

Question

I have a MySql table with following schema:

id-int
path-varchar
info-json {"name":"pat", "address":"NY, USA"....}

I used JDBC driver to connect pyspark to MySql. I can retrieve data from mysql using

df = sqlContext.sql("select * from dbTable")

This query works all fine. My question is, how can I query on "info" column? For example, below query works all fine in MySQL shell and retrieve data but this is not supported in Pyspark (2+).

select id, info->"$.name" from dbTable where info->"$.name"='pat'

Zhang Tong · Accepted Answer

from pyspark.sql.functions import *
res = df.select(get_json_object(df['info'],"$.name").alias('name'))
res = df.filter(get_json_object(df['info'], "$.name") == 'pat')

There is already a function named get_json_object

For your situation:

df = spark.read.jdbc(url='jdbc:mysql://localhost:3306', table='test.test_json',
                     properties={'user': 'hive', 'password': '123456'})
df.createOrReplaceTempView('test_json')
res = spark.sql("""
select col_json,get_json_object(col_json,'$.name') from test_json
""")
res.show()

Spark sql is almost like HIVE sql, you can see

https://cwiki.apache.org/confluence/display/Hive/Home

Querying json object in dataframe using Pyspark

Tags:

python

json

mysql

apache-spark

pyspark

ciri

1 Answers

Zhang Tong

Recent Activity

Donate For Us

Querying json object in dataframe using Pyspark

Tags:

python

json

mysql

apache-spark

pyspark

ciri

1 Answers

Zhang Tong

Related questions

Recent Activity

Donate For Us