Load local file (not HDFS) fails at Spark

Tags:

I have one question - how to load local file (not on HDFS, not on S3) with sc.textFile at PySpark. I read this article, then copied sales.csv to master node's local (not HDFS), finally executed following

sc.textFile("file:///sales.csv").count()

but it returns following error, saying file:/click_data_sample.csv does not exist

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 10, ip-17x-xx-xx-xxx.ap-northeast-1.compute.internal): java.io.FileNotFoundException: File file:/sales.csv does not exist

I tryed file://sales.csv and file:/sales.csv but both also failed.

It is very helpful you give me kind advice how to load local file.

Noted1:

My envrionment is Amazon emr-4.2.0 + Spark 1.5.2.
All ports are opened

Noted2:

I confirmed load file from HDFS or S3 works.

Here is the code of loading from HDFS - download csv, copy to hdfs in advance then load with sc.textFile("/path/at/hdfs")

commands.getoutput('wget -q https://raw.githubusercontent.com/phatak-dev/blog/master/code/DataSourceExamples/src/main/resources/sales.csv')
commands.getoutput('hadoop fs -copyFromLocal -f ./sales.csv /user/hadoop/')
sc.textFile("/user/hadoop/sales.csv").count()  # returns "15" which is number of the line of csv file

Here is the code of loading from S3 - put csv file at S3 in advance then load with sc.textFile("s3n://path/at/hdfs") with "s3n://" flag.

sc.textFile("s3n://my-test-bucket/sales.csv").count() # also returns "15"

577

asked Feb 01 '16 04:02

Taka4Sato

1 Answers

The file read occurs on the executor node. In order for your code to work, you should distribute your file over all nodes.

In case the Spark driver program is run on the same machine where the file is located, what you could try is read the file (e.g. with f=open("file").read() for python), and then call sc.parallelize to convert the file content to an RDD.

121

answered Oct 06 '22 16:10

facha

Related questions
                            
                                Can I preload the web content for Safari View Controller?
                            
                                Invoke pytest from python for current module only
                            
                                Is there a Set literal in JavaScript?
                            
                                Merging Key-Value Pairings in Dictionary
                            
                                How to read child_process.spawnSync stdout with stdio option 'inherit'
                            
                                What is the suitable HTTP status code when request is successful but has warning messages?
                            
                                Find best substring match
                            
                                How to update redis after updating database?
                            
                                PHP MySQL over SSL. Peer certificate did not match
                            
                                How to make CORS-enabled HTTP requests in Angular 2?
                            
                                Paypal payment : How to get success request when loading the paypal in webview
                            
                                Prevent react-router history.push from reloading current route

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With