I am trying to use the sc.textFile() function to read a csv file. However I am getting the "unbound method textFile() must be called with SparkContext instance as first argument (got str instance instead)" error. I checked within stackoverflow for possible answers but couldnt find any answers. Please help
I am using the below script in iPython Notebook
import os.path
from pyspark import SparkContext
import csv
basedir = os.path.join('data')
inputpath = os.path.join('train_set.csv')
filename = os.path.join(basedir,inputpath)
numpart = 2
sc = SparkContext
train_data = sc.textFile(filename,numpart)
First, thanks for your question, I used the answer to find a solution to the same kind of problem. I want to answer to your 11 Aug 2015 comment.
You might have a look at what happens in your command shell behind the notebook for getting an explanation.
When you first call:
sc=SparkContext()
you will see Spark
being initialized, pretty as much as when you just launch Spark
from the command shell.
So you initialize sc
(by default at the launch of Spark
)
If you call it a second time, you have the error you speak of in your comment. So I would say you already had a Spark
context sc
initialized the first time you tried the suggestion (or maybe you ran the suggestion twice).
The reason why it is working when you remove the SparkContext
definition is because sc
is defined, but the next time you launch your IPython
notebook, I think you will have to run sc = SparkContext()
one time.
To make it clearer, I would say that the code, from an IPython Notebook
point of view, would be organized as follow:
One cell that is run once each time you restart the kernel to customize your Python
environment and initialize your Spark
environment:
import os.path
import csv
from pyspark import SparkContext
sc = SparkContext()
A second cell that you run multiple times for testing purpose:
basedir = os.path.join('data')
inputpath = os.path.join('train_set.csv')
filename = os.path.join(basedir,inputpath)
numpart = 2
train_data = sc.textFile(filename,numpart)
But if you want one cell, you could also call the stop()
method on the SparkContext
object at the end of your code:
#your initialization
import os.path
import csv
from pyspark import SparkContext
sc = SparkContext()
#your job
basedir = os.path.join('data')
inputpath = os.path.join('train_set.csv')
filename = os.path.join(basedir,inputpath)
numpart = 2
train_data = sc.textFile(filename,numpart)
#closing the SparkContext
sc.stop()
I really advise the O'Reilly book Learning Spark Lightning-fast data analysis by Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia. Particularly the Chapter 2 for this kind of question about understanding Core Spark
Concepts.
Maybe you know this now, apologize if it is but it might help someone else !
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With