Use pandas with Spark

Question

I have a Noob Question on spark and pandas. I would like to use pandas, numpy etc.. with spark but when i import a lib i have an error. can you help me plz? This is my code

from pyspark import SparkContext, SQLContext
from pyspark import SparkConf
import pandas

# Config
conf = SparkConf().setAppName("Script")
sc = SparkContext(conf=conf)
log4j = sc._jvm.org.apache.log4j
log4j.LogManager.getRootLogger().setLevel(log4j.Level.ERROR)
sqlCtx = SQLContext(sc)

# Importation of csv out of HDFS
data_name = "file_on_hdfs.csv"
data_textfile = sc.textFile(data_name)

This is the error:

ImportError: No module named pandas

How can i use pandas? It's not a local mode.

AndreyF · Accepted Answer

Spark has it's own Dataframe object that can be created from RDDs.

You can still use libraries such as numpy but you must install them first.

Beyhan Gul · Answer

You can use Apache Arrow for this problem.

Apache Arrow

It's initial version but will be more powerful in future(will see).

For installation: click

Use pandas with Spark

Tags:

python

pandas

importerror

pyspark

Zop

2 Answers

AndreyF

Beyhan Gul

Recent Activity

Donate For Us

Use pandas with Spark

Tags:

python

pandas

importerror

pyspark

Zop

2 Answers

AndreyF

Beyhan Gul

Related questions

Recent Activity

Donate For Us