reading a file in hdfs from pyspark

Tags:

I'm trying to read a file in my hdfs. Here's a showing of my hadoop file structure.

hduser@GVM:/usr/local/spark/bin$ hadoop fs -ls -R /
drwxr-xr-x   - hduser supergroup          0 2016-03-06 17:28 /inputFiles
drwxr-xr-x   - hduser supergroup          0 2016-03-06 17:31 /inputFiles/CountOfMonteCristo
-rw-r--r--   1 hduser supergroup    2685300 2016-03-06 17:31 /inputFiles/CountOfMonteCristo/BookText.txt

Here's my pyspark code:

from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName("myFirstApp").setMaster("local")
sc = SparkContext(conf=conf)

textFile = sc.textFile("hdfs://inputFiles/CountOfMonteCristo/BookText.txt")
textFile.first()

The error I get is:

Py4JJavaError: An error occurred while calling o64.partitions.
: java.lang.IllegalArgumentException: java.net.UnknownHostException: inputFiles

Is this because I'm setting my sparkContext incorrectly? I'm running this in a ubuntu 14.04 virtual machine through virtual box.

I'm not sure what I'm doing wrong here....

701

asked Mar 07 '16 03:03

user1357015

Video Answer

3 Answers

There are two general way to read files in Spark, one for huge-distributed files to process them in parallel, one for reading small files like lookup tables and configuration on HDFS. For the latter, you might want to read a file in the driver node or workers as a single read (not a distributed read). In that case, you should use SparkFiles module like below.

# spark is a SparkSession instance
from pyspark import SparkFiles

spark.sparkContext.addFile('hdfs:///user/bekce/myfile.json')
with open(SparkFiles.get('myfile.json'), 'rb') as handle:
    j = json.load(handle)
    or_do_whatever_with(handle)

answered Oct 26 '22 03:10

bekce

You could access HDFS files via full path if no configuration provided.(namenodehost is your localhost if hdfs is located in local environment).

hdfs://namenodehost/inputFiles/CountOfMonteCristo/BookText.txt

answered Oct 26 '22 04:10

Shawn Guo

Since you don't provide authority URI should look like this:

hdfs:///inputFiles/CountOfMonteCristo/BookText.txt

otherwise inputFiles is interpreted as a hostname. With correct configuration you shouldn't need scheme at all an use:

/inputFiles/CountOfMonteCristo/BookText.txt

instead.

answered Oct 26 '22 05:10

zero323

Related questions
                            
                                What is the =!= operator in Scala?
                            
                                Broadcast hash join - Iterative
                            
                                Spark non-serializable exception when parsing JSON with json4s
                            
                                How to select a same-size stratified sample from a dataframe in Apache Spark?
                            
                                PySpark: Subtract Two Timestamp Columns and Give Back Difference in Minutes (Using F.datediff gives back only whole days)
                            
                                KafkaUtils class not found in Spark streaming
                            
                                Write RDD as textfile using Apache Spark
                            
                                How can I efficiently join a large rdd to a very large rdd in spark?
                            
                                Apache Spark Running Locally Giving Refused Connection Error
                            
                                Spark: persist and repartition order
                            
                                Getting specific field from chosen Row in Pyspark DataFrame
                            
                                Spark: how to get the number of written rows?
                            
                                Converting epoch to datetime in PySpark data frame using udf
                            
                                How to speed up spark df.write jdbc to postgres database?
                            
                                Spark dataframe reducebykey like operation
                            
                                Date difference between consecutive rows - Pyspark Dataframe
                            
                                Spark-Csv Write quotemode not working
                            
                                selecting a range of elements in an array spark sql
                            
                                Py4J error when creating a spark dataframe using pyspark
                            
                                Error:'java.lang.UnsupportedOperationException' for Pyspark pandas_udf documentation code

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

reading a file in hdfs from pyspark

Tags:

apache-spark

pyspark

hdfs