Issue while parsing mongo collection which has few schemas in spark

Tags:

I'm moving data from one collection to another in other cluster using Spark. the data's schema is not consistent(I mean that has few schema's in a single collection with different data types with little variations). When I try to read data from spark, the sampling is unable to get all the schema's of the data and throwing the below error.(I have a complex schema which I can't explicitly mention instead of spark gets by sampling.)

com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast ARRAY into a NullType (value: BsonArray{values=[{ "type" : "GUEST_FEE", "appliesPer" : "GUEST_PER_NIGHT", "description" : null, "minAmount" : 33, "maxAmount" : 33 }]})

I tried reading the collection as an RDD and write as an RDD still the issue persists.

Any help on this.!

Thanks.

306

asked Jun 20 '18 18:06

knowledge_seeker

1 Answers

All these com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast SOME_TYPE into a NullType come from incorrect schema inference. For schema-less data sources such as JSON file or mongodb, Spark does a scan of small fraction of the data to determine the types. If some particular field has lots of NULL's you can get unlucky and type will be set as NullType.

One thing you can do is increase the number of entries scanned for schema inference.

Another - get the inferred schema first, fix it, and reload dataframe with fixed schema:

def fix_spark_schema(schema):
  if schema.__class__ == pyspark.sql.types.StructType:
    return pyspark.sql.types.StructType([fix_spark_schema(f) for f in schema.fields])
  if schema.__class__ == pyspark.sql.types.StructField:
    return pyspark.sql.types.StructField(schema.name, fix_spark_schema(schema.dataType), schema.nullable)
  if schema.__class__ == pyspark.sql.types.NullType:
    return pyspark.sql.types.StringType()
  return schema

collection_schema = sqlContext.read \
    .format("com.mongodb.spark.sql") \
    .options(...) \
    .load() \
    .schema

collection = sqlContext.read \
    .format("com.mongodb.spark.sql") \
    .options(...) \
    .load(schema=fix_spark_schema(collection_schema))

In my case all problematic fields could be represented with StringType, you might make the logic more complex if needed.

132

answered Nov 02 '22 06:11

vlyubin

Related questions
                            
                                root cant perform listCollections command on db
                            
                                Python mongoengine - retrieve _id after saving
                            
                                how to use Mongodb Single instance with multiple dbs in spring-boot
                            
                                Mongodb c++ regex query
                            
                                How to execute MongoDB native query (JSON) using mongo-java-driver only?
                            
                                Change default Mongo connection pool size in spring-boot
                            
                                Dump remote database: Failed: error connecting to db server: no reachable servers
                            
                                Retrieve unstructured array from mongodb in golang
                            
                                Properly using Electron
                            
                                How to set Docker Backing Filesystem to XFS?
                            
                                Change MongoDb _id from string to ObjectId
                            
                                MongoDB listCollections filter
                            
                                MongoDB server doesn't start at gitlab runner using gitlab-ci
                            
                                Aggregation pipeline slow with large collection
                            
                                Limit MongoDB aggregation $lookup to only 1 match
                            
                                Bulk update array of matching sub document in Mongodb
                            
                                React - how to rerender component after POST data to database (mongodb)
                            
                                What's faster: `find().limit(1)` or `findOne()` in MongoDB/Mongoose?
                            
                                Meteor `Deps.autorun` vs `Collection.observe`
                            
                                uninstalling mongo

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Issue while parsing mongo collection which has few schemas in spark

Tags:

mongodb

apache-spark

apache-spark-sql

knowledge_seeker

People also ask

1 Answers

vlyubin

Recent Activity

Donate For Us