I have one question - how to load local file (not on HDFS, not on S3) with sc.textFile at PySpark.
I read this article, then copied sales.csv
to master node's local (not HDFS), finally executed following
sc.textFile("file:///sales.csv").count()
but it returns following error, saying file:/click_data_sample.csv does not exist
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 10, ip-17x-xx-xx-xxx.ap-northeast-1.compute.internal): java.io.FileNotFoundException: File file:/sales.csv does not exist
I tryed file://sales.csv
and file:/sales.csv
but both also failed.
It is very helpful you give me kind advice how to load local file.
I confirmed load file from HDFS or S3 works.
Here is the code of loading from HDFS - download csv, copy to hdfs in advance then load with sc.textFile("/path/at/hdfs")
commands.getoutput('wget -q https://raw.githubusercontent.com/phatak-dev/blog/master/code/DataSourceExamples/src/main/resources/sales.csv')
commands.getoutput('hadoop fs -copyFromLocal -f ./sales.csv /user/hadoop/')
sc.textFile("/user/hadoop/sales.csv").count() # returns "15" which is number of the line of csv file
Here is the code of loading from S3 - put csv file at S3 in advance then load with sc.textFile("s3n://path/at/hdfs") with "s3n://" flag.
sc.textFile("s3n://my-test-bucket/sales.csv").count() # also returns "15"
One of the spark application depends on a local file for some of its business logics. We can read the file by referring to it as file:///. But for this to work, the copy of the file needs to be on every worker or every worker need to have access to common shared drive as in a NFS mount.
The file read occurs on the executor node. In order for your code to work, you should distribute your file over all nodes.
In case the Spark driver program is run on the same machine where the file is located, what you could try is read the file (e.g. with f=open("file").read()
for python), and then call sc.parallelize
to convert the file content to an RDD.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With