Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"unbound method textFile() must be called with SparkContext instance as first argument (got str instance instead)"

I am trying to use the sc.textFile() function to read a csv file. However I am getting the "unbound method textFile() must be called with SparkContext instance as first argument (got str instance instead)" error. I checked within stackoverflow for possible answers but couldnt find any answers. Please help

I am using the below script in iPython Notebook

import os.path

from pyspark import SparkContext


import csv


basedir = os.path.join('data') 

inputpath = os.path.join('train_set.csv') 

filename = os.path.join(basedir,inputpath)

numpart = 2

sc = SparkContext

train_data = sc.textFile(filename,numpart)

Just to clarify, basedir ('data') is the folder where the csv file resides. Please help

like image 413
Thomas Joseph Avatar asked Aug 10 '15 13:08

Thomas Joseph


1 Answers

First, thanks for your question, I used the answer to find a solution to the same kind of problem. I want to answer to your 11 Aug 2015 comment.

You might have a look at what happens in your command shell behind the notebook for getting an explanation.

When you first call:

sc=SparkContext() 

you will see Spark being initialized, pretty as much as when you just launch Spark from the command shell. So you initialize sc (by default at the launch of Spark)

If you call it a second time, you have the error you speak of in your comment. So I would say you already had a Spark context sc initialized the first time you tried the suggestion (or maybe you ran the suggestion twice).

The reason why it is working when you remove the SparkContext definition is because sc is defined, but the next time you launch your IPython notebook, I think you will have to run sc = SparkContext() one time.

To make it clearer, I would say that the code, from an IPython Notebook point of view, would be organized as follow:

One cell that is run once each time you restart the kernel to customize your Python environment and initialize your Spark environment:

import os.path
import csv
from pyspark import SparkContext
sc = SparkContext()

A second cell that you run multiple times for testing purpose:

basedir = os.path.join('data') 
inputpath = os.path.join('train_set.csv') 
filename = os.path.join(basedir,inputpath)
numpart = 2
train_data = sc.textFile(filename,numpart)

But if you want one cell, you could also call the stop() method on the SparkContext object at the end of your code:

#your initialization
import os.path
import csv
from pyspark import SparkContext
sc = SparkContext()
#your job
basedir = os.path.join('data') 
inputpath = os.path.join('train_set.csv') 
filename = os.path.join(basedir,inputpath)
numpart = 2
train_data = sc.textFile(filename,numpart)
#closing the SparkContext
sc.stop()

I really advise the O'Reilly book Learning Spark Lightning-fast data analysis by Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia. Particularly the Chapter 2 for this kind of question about understanding Core Spark Concepts.

Maybe you know this now, apologize if it is but it might help someone else !

like image 133
Pierre Cordier Avatar answered Sep 21 '22 04:09

Pierre Cordier