I have a MySql table with following schema:
id-int
path-varchar
info-json {"name":"pat", "address":"NY, USA"....}
I used JDBC driver to connect pyspark to MySql. I can retrieve data from mysql using
df = sqlContext.sql("select * from dbTable")
This query works all fine. My question is, how can I query on "info" column? For example, below query works all fine in MySQL shell and retrieve data but this is not supported in Pyspark (2+).
select id, info->"$.name" from dbTable where info->"$.name"='pat'
from pyspark.sql.functions import *
res = df.select(get_json_object(df['info'],"$.name").alias('name'))
res = df.filter(get_json_object(df['info'], "$.name") == 'pat')
There is already a function named get_json_object
For your situation:
df = spark.read.jdbc(url='jdbc:mysql://localhost:3306', table='test.test_json',
properties={'user': 'hive', 'password': '123456'})
df.createOrReplaceTempView('test_json')
res = spark.sql("""
select col_json,get_json_object(col_json,'$.name') from test_json
""")
res.show()
Spark sql is almost like HIVE sql, you can see
https://cwiki.apache.org/confluence/display/Hive/Home
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With