How to load local file in sc.textFile, instead of HDFS

People also ask

How do I access local files in spark?

To access the file in Spark jobs, use SparkFiles. get(fileName) to find its download location. A directory can be given if the recursive option is set to true. Currently directories are only supported for Hadoop-supported filesystems.

How do I read a local file in Databricks?

If you use the Databricks Connect client library you can read local files into memory on a remote Databricks Spark cluster. See details here. The alternative is to use the Databricks CLI (or REST API) and push local data to a location on DBFS, where it can be read into Spark from within a Databricks notebook.

Try explicitly specify sc.textFile("file:///path to the file/"). The error occurs when Hadoop environment is set.

SparkContext.textFile internally calls org.apache.hadoop.mapred.FileInputFormat.getSplits, which in turn uses org.apache.hadoop.fs.getDefaultUri if schema is absent. This method reads "fs.defaultFS" parameter of Hadoop conf. If you set HADOOP_CONF_DIR environment variable, the parameter is usually set as "hdfs://..."; otherwise "file://".

gonbe's answer is excellent. But still I want to mention that file:/// = ~/../../, not $SPARK_HOME. Hope this could save some time for newbs like me.

While Spark supports loading files from the local filesystem, it requires that the files are available at the same path on all nodes in your cluster.

Some network filesystems, like NFS, AFS, and MapR’s NFS layer, are exposed to the user as a regular filesystem.

If your data is already in one of these systems, then you can use it as an input by just specifying a file:// path; Spark will handle it as long as the filesystem is mounted at the same path on each node. Every node needs to have the same path

 rdd = sc.textFile("file:///path/to/file")

If your file isn’t already on all nodes in the cluster, you can load it locally on the driver without going through Spark and then call parallelize to distribute the contents to workers

Take care to put file:// in front and the use of "/" or "\" according to OS.

Attention:

Make sure that you run spark in local mode when you load data from local(sc.textFile("file:///path to the file/")) or you will get error like this Caused by: java.io.FileNotFoundException: File file:/data/sparkjob/config2.properties does not exist. Becasuse executors which run on different workers will not find this file in it's local path.

Related questions
                            
                                Overload constructor for Scala's Case Classes?
                            
                                How to convert a java.util.List to a Scala list
                            
                                Scala actors: receive vs react
                            
                                Spark - Error "A master URL must be set in your configuration" when submitting an app
                            
                                What are the precise rules for when you can omit parenthesis, dots, braces, = (functions), etc.?
                            
                                How can I use map and receive an index as well in Scala?
                            
                                Abstract attributes in Python [duplicate]
                            
                                comparing sbt and Gradle [closed]
                            
                                Private and protected constructor in Scala
                            
                                Understanding why Zipper is a Comonad
                            
                                When to use actors instead of messaging solutions such as WebSphere MQ or Tibco Rendezvous?
                            
                                What is the rationale behind having companion objects in Scala?
                            
                                How to get the last date of a particular month with JodaTime?
                            
                                Idiomatic way to convert an InputStream to a String in Scala
                            
                                How to convert an Int to a String of a given length with leading zeros to align?
                            
                                Scala how can I count the number of occurrences in a list
                            
                                Any reason why scala does not explicitly support dependent types?
                            
                                What are the relationships between Any, AnyVal, AnyRef, Object and how do they map when used in Java code?
                            
                                How to do an instanceof check with Scala(Test)
                            
                                Scala: what is the best way to append an element to an Array?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to load local file in sc.textFile, instead of HDFS

Tags:

scala

apache-spark

People also ask

Recent Activity

Donate For Us