I am new to Pyspark and I am trying to do a simple count. However it is giving me this error. The text file is inside hdfs.
CODE:
>>> mydata = sc.textFile("hdfs://user/poem.txt")
>>> mydata.count()
ERROR:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/spark-2.0.1-bin-hadoop2.7/python/pyspark/rdd.py", line 1008, in count
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "/usr/local/lib/spark-2.0.1-bin-hadoop2.7/python/pyspark/rdd.py", line 999, in sum
return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
File "/usr/local/lib/spark-2.0.1-bin-hadoop2.7/python/pyspark/rdd.py", line 873, in fold
vals = self.mapPartitions(func).collect()
File "/usr/local/lib/spark-2.0.1-bin-hadoop2.7/python/pyspark/rdd.py", line 776, in collect
port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
File "/usr/local/lib/spark-2.0.1-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/local/lib/spark-2.0.1-bin-hadoop2.7/python/pyspark/sql/utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u'java.net.UnknownHostException: user'
You are missing a "/"
r = sc.textFile("hdfs://user/myFile")
r.count()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p1464.1349/lib/spark/python/pyspark/rdd.py", line 1004, in count
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p1464.1349/lib/spark/python/pyspark/rdd.py", line 995, in sum
return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
File "/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p1464.1349/lib/spark/python/pyspark/rdd.py", line 869, in fold
vals = self.mapPartitions(func).collect()
File "/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p1464.1349/lib/spark/python/pyspark/rdd.py", line 771, in collect
port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
File "/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p1464.1349/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p1464.1349/lib/spark/python/pyspark/sql/utils.py", line 53, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u'java.net.UnknownHostException: user'
However, if you do
>>> r = sc.textFile("hdfs:///user/myFile")
>>> r.count()
318199
it is because hdfs:// is the URI. And in Fullly qualified syntax, it should be hdfs:///. Hence, Spark is thinking the token "user" as NN-Host
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With