I have a Noob Question on spark and pandas. I would like to use pandas, numpy etc.. with spark but when i import a lib i have an error. can you help me plz? This is my code
from pyspark import SparkContext, SQLContext
from pyspark import SparkConf
import pandas
# Config
conf = SparkConf().setAppName("Script")
sc = SparkContext(conf=conf)
log4j = sc._jvm.org.apache.log4j
log4j.LogManager.getRootLogger().setLevel(log4j.Level.ERROR)
sqlCtx = SQLContext(sc)
# Importation of csv out of HDFS
data_name = "file_on_hdfs.csv"
data_textfile = sc.textFile(data_name)
This is the error:
ImportError: No module named pandas
How can i use pandas? It's not a local mode.
Spark has it's own Dataframe object that can be created from RDDs.
You can still use libraries such as numpy but you must install them first.
You can use Apache Arrow for this problem.
Apache Arrow
It's initial version but will be more powerful in future(will see).
For installation: click
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With