Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use pandas with Spark

I have a Noob Question on spark and pandas. I would like to use pandas, numpy etc.. with spark but when i import a lib i have an error. can you help me plz? This is my code

from pyspark import SparkContext, SQLContext
from pyspark import SparkConf
import pandas

# Config
conf = SparkConf().setAppName("Script")
sc = SparkContext(conf=conf)
log4j = sc._jvm.org.apache.log4j
log4j.LogManager.getRootLogger().setLevel(log4j.Level.ERROR)
sqlCtx = SQLContext(sc)

# Importation of csv out of HDFS
data_name = "file_on_hdfs.csv"
data_textfile = sc.textFile(data_name)

This is the error:

ImportError: No module named pandas

How can i use pandas? It's not a local mode.

like image 850
Zop Avatar asked Jan 23 '17 14:01

Zop


2 Answers

Spark has it's own Dataframe object that can be created from RDDs.

You can still use libraries such as numpy but you must install them first.

like image 127
AndreyF Avatar answered Nov 13 '22 07:11

AndreyF


You can use Apache Arrow for this problem.

Apache Arrow

It's initial version but will be more powerful in future(will see).

For installation: click

like image 27
Beyhan Gul Avatar answered Nov 13 '22 09:11

Beyhan Gul