Logo Questions Linux Laravel Mysql Ubuntu Git Menu

reading a file in hdfs from pyspark

I'm trying to read a file in my hdfs. Here's a showing of my hadoop file structure.

hduser@GVM:/usr/local/spark/bin$ hadoop fs -ls -R /
drwxr-xr-x   - hduser supergroup          0 2016-03-06 17:28 /inputFiles
drwxr-xr-x   - hduser supergroup          0 2016-03-06 17:31 /inputFiles/CountOfMonteCristo
-rw-r--r--   1 hduser supergroup    2685300 2016-03-06 17:31 /inputFiles/CountOfMonteCristo/BookText.txt

Here's my pyspark code:

from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName("myFirstApp").setMaster("local")
sc = SparkContext(conf=conf)

textFile = sc.textFile("hdfs://inputFiles/CountOfMonteCristo/BookText.txt")

The error I get is:

Py4JJavaError: An error occurred while calling o64.partitions.
: java.lang.IllegalArgumentException: java.net.UnknownHostException: inputFiles

Is this because I'm setting my sparkContext incorrectly? I'm running this in a ubuntu 14.04 virtual machine through virtual box.

I'm not sure what I'm doing wrong here....

like image 701
user1357015 Avatar asked Mar 07 '16 03:03


People also ask

How do I read a CSV file from HDFS in PySpark?

You can read this easily with spark using csv method or by specifying format("csv") . In your case either you should not specify hdfs:// or you should specify complete path hdfs://localhost:8020/input/housing.csv . Here is a snippet of code that can read csv.

How do I write to HDFS in PySpark?

Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.

Video Answer

3 Answers

There are two general way to read files in Spark, one for huge-distributed files to process them in parallel, one for reading small files like lookup tables and configuration on HDFS. For the latter, you might want to read a file in the driver node or workers as a single read (not a distributed read). In that case, you should use SparkFiles module like below.

# spark is a SparkSession instance
from pyspark import SparkFiles

with open(SparkFiles.get('myfile.json'), 'rb') as handle:
    j = json.load(handle)
like image 79
bekce Avatar answered Oct 26 '22 03:10


You could access HDFS files via full path if no configuration provided.(namenodehost is your localhost if hdfs is located in local environment).

like image 13
Shawn Guo Avatar answered Oct 26 '22 04:10

Shawn Guo

Since you don't provide authority URI should look like this:


otherwise inputFiles is interpreted as a hostname. With correct configuration you shouldn't need scheme at all an use:



like image 7
zero323 Avatar answered Oct 26 '22 05:10
