I want to use word2vec with PySpark to process some data.
I was previously using Google trained model GoogleNews-vectors-negative300.bin
with gensim
in Python.
Is there a way I can load this bin file with mllib.word2vec
?
Or does it make sense to export the data as a dictionary from Python {word : [vector]}
(or .csv
file) and then load it in PySpark
?
Thanks
Binary import is supported in Spark 3.x:
spark.read.format("binaryFile").option("pathGlobFilter", "*.png").load("/path/to/data")
However, this would require processing the binary data. Hence a gensim
export is rather recommended:
# Save gensim model
filename = "stored_model.csv"
trained_model.save(filename)
Then load the model in pyspark:
df = spark.read.load("stored_model.csv",
format="csv",
sep=";",
inferSchema="true",
header="true")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With