Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Querying json object in dataframe using Pyspark

I have a MySql table with following schema:

id-int
path-varchar
info-json {"name":"pat", "address":"NY, USA"....}

I used JDBC driver to connect pyspark to MySql. I can retrieve data from mysql using

df = sqlContext.sql("select * from dbTable")

This query works all fine. My question is, how can I query on "info" column? For example, below query works all fine in MySQL shell and retrieve data but this is not supported in Pyspark (2+).

select id, info->"$.name" from dbTable where info->"$.name"='pat'
like image 281
ciri Avatar asked Jan 10 '17 02:01

ciri


1 Answers

from pyspark.sql.functions import *
res = df.select(get_json_object(df['info'],"$.name").alias('name'))
res = df.filter(get_json_object(df['info'], "$.name") == 'pat')

There is already a function named get_json_object


For your situation:

df = spark.read.jdbc(url='jdbc:mysql://localhost:3306', table='test.test_json',
                     properties={'user': 'hive', 'password': '123456'})
df.createOrReplaceTempView('test_json')
res = spark.sql("""
select col_json,get_json_object(col_json,'$.name') from test_json
""")
res.show()

Spark sql is almost like HIVE sql, you can see

https://cwiki.apache.org/confluence/display/Hive/Home

like image 105
Zhang Tong Avatar answered Oct 12 '22 09:10

Zhang Tong