Write PySpark dataframe to MongoDB inserting field as ObjectId

Question

I was trying to write to MongoDB a Spark dataframe that contains the string representation of an ObjectId that is the _id of another collection.

The point is that PySpark has no support for ObjectId (Scala and Java ObjectId support explained here: https://github.com/mongodb/mongo-spark/blob/master/doc/1-sparkSQL.md) so, how could I insert an ObjectId into MongoDB from PySpark using the Spark Connector?

Victor · Accepted Answer

The accepted answer seem to be outdated as of today. It really led me to a working version, thank you.

Here is my working version of the code :

import pyspark.sql.functions as sfunc
from pyspark.sql.types import *

# This user defined function creates from an str ID like "5b8f7fe430c49e04fdb91599"
# the following Object : { "oid" : "5b8f7fe430c49e04fdb91599"}
# which will be recognized as an ObjectId by MongoDB
udf_struct_id = sfunc.udf(
    lambda x: tuple((str(x),)), 
    StructType([StructField("oid",  StringType(), True)])
)

df = df.withColumn('future_object_id_field', udf_struct_id('string_object_id_column'))

My setup : MongoDB 4.0, Docker image for Spark gettyimages/spark:2.3.1-hadoop-3.0, python 3.6

The documentation for the pyspark mongo connector gave me the idea to call the field oid, which is needed for mongo to recognize fields as ObjectId type.

Write PySpark dataframe to MongoDB inserting field as ObjectId

Tags:

python

mongodb

pyspark

Luis A.G.

1 Answers

Victor

Recent Activity

Donate For Us

Write PySpark dataframe to MongoDB inserting field as ObjectId

Tags:

python

mongodb

pyspark

Luis A.G.

1 Answers

Victor

Related questions

Recent Activity

Donate For Us