Spark 2.0.0 reading json data with variable schema

Tags:

I am trying to process a month's worth of website traffic, which is stored in an S3 bucket as json (one json object per line/website traffic hit). The amount of data is big enough that I can't ask Spark to infer the schema (OOM errors). If I specify the schema it loads fine obviously. But, the issue is that the fields contained in each json object differ, so even if I build a schema using one day's worth of traffic, the monthly schema will be different (more fields) and so my Spark job fails.

So I'm curious to understand how others deal with this issue. I can for example use a traditional RDD mapreduce job to extract the fields I'm interested in, export and then load everything into a dataframe. But this is slow and seems a bit like self-defeating.

I've found a similar question here but no relevant info for me.

Thanks.

443

asked Aug 22 '16 15:08

xv70

1 Answers

If you know the fields you're interested in just provide a subset of schema. JSON reader can gracefully ignore unexpected fields. Let's say your data looks like this:

import json
import tempfile

object = {"foo": {"bar": {"x": 1, "y": 1}, "baz": [1, 2, 3]}}

_, f = tempfile.mkstemp()
with open(f, "w") as fw:
    json.dump(object, fw)

and you're interested only in foo.bar.x and foo.bar.z (non-existent):

from pyspark.sql.types import StructType

schema = StructType.fromJson({'fields': [{'metadata': {},
   'name': 'foo',
   'nullable': True,
   'type': {'fields': [
       {'metadata': {}, 'name': 'bar', 'nullable': True, 'type': {'fields': [
           {'metadata': {}, 'name': 'x', 'nullable': True, 'type': 'long'},
           {'metadata': {}, 'name': 'z', 'nullable': True, 'type': 'double'}],
       'type': 'struct'}}],
    'type': 'struct'}}],
 'type': 'struct'})

df = spark.read.schema(schema).json(f)
df.show()

## +----------+
## |       foo|
## +----------+
## |[[1,null]]|
## +----------+

df.printSchema()
## root
##  |-- foo: struct (nullable = true)
##  |    |-- bar: struct (nullable = true)
##  |    |    |-- x: long (nullable = true)
##  |    |    |-- z: double (nullable = true)

You can also reduce sampling ratio for schema inference to improve overall performance.

answered Sep 22 '22 23:09

zero323

Related questions
                            
                                Pandas DataFrame datetime index doesn't survive JSON conversion and reconversion
                            
                                JMeter: How to count JSON objects in an Array using jsonpath
                            
                                Monitoring JSON wire protocol logs
                            
                                Convert Pandas DataFrame to JSON as element of larger data structure
                            
                                Flask: how to send a dynamically generate zipfile to the client
                            
                                opencart 2.0 SyntaxError: JSON.parse: unexpected end of data at line 1 column 1 of the JSON data OK
                            
                                Three.js JSONLoader blender model error: property 'length' undefined
                            
                                Retrofit: Handling JSON object that dynamically changes its name
                            
                                verify that a json field exists with jq and bash?
                            
                                get json object from stringify object using jq filter in shell script
                            
                                Custom data source property dataSrc and pagination issue
                            
                                How to parse nested FB API response from Python SDK
                            
                                Use JSONArray in another class?
                            
                                node js with gmail api, The API returned an error: Error: unauthorized_client
                            
                                Conditional JSON decoding based on a field value
                            
                                Download object as formatted json file
                            
                                Class 'App\Http\Controllers\Response' not found error in Laravel
                            
                                Using JSON data retrieved with AJAX outside success function
                            
                                How to mask sensitive values in JSON for logging purposes
                            
                                AFNetworking 3.x what is the different between AFHTTPSessionManager and AFURLSessionManager?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark 2.0.0 reading json data with variable schema

Tags:

json

schema

apache-spark

pyspark

xv70

People also ask

1 Answers

zero323

Recent Activity

Donate For Us