I am trying to perform data transformations using PySpark.
I have a text file with data(CHANGES.txt)
I am able to execute the command:
RDDread = sc.textFile("file:///home/test/desktop/CHANGES.txt")
but when I run:
RDDread.first()
then I get the error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/srv/spark/python/pyspark/rdd.py", line 1328, in first
rs = self.take(1)
File "/srv/spark/python/pyspark/rdd.py", line 1280, in take
totalParts = self.getNumPartitions()
File "/srv/spark/python/pyspark/rdd.py", line 356, in getNumPartitions
return self._jrdd.partitions().size()
File "/srv/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/srv/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/srv/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o256.partitions.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/test/desktop/CHANGES.txt
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:60)
at org.apache.spark.api.java.AbstractJavaRDDLike.partitions(JavaRDDLike.scala:45)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
It seems that is mentions the file path does not exist. How can I resolve this. I have python, Java, and spark installed on my linux machine.
If your running in a clustered mode you need to copy the file across all the nodes of same shared file system. Then spark reads that file otherwise you should use HDFS
I copied txt file into HDFS and spark takes file from HDFS.
I copied txt file on the shared filesystem of all nodes then spark read that file.
Both worked for me
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With