<p>I am trying to use a WEB URL from spark-shell using textFile method, but getting error. Probably this is not the right way. So can someone please tell me how to access a web URL from spark context.</p> <p>I am using spark version 1.3.0 ; Scala version 2.10.4 and Java 1.7.0_21</p> <p></p> <div class="snippet" data-lang="js" data-hide="false"> <div class="snippet-code"> <pre class="prettyprint snippet-code-html lang-html prettyprint-override"><code>hduser@ubuntu:~$ spark-shell Spark assembly has been built with Hive, including Datanucleus jars on classpath Welcome to __ / / / / \ \/ \/ `/ _/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.3.0 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_21) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc. SQL context available as sqlContext. scala> val pagecount = sc.textFile( "https://www.google.co.in/?gws_rd=ssl" ) pagecount: org.apache.spark.rdd.RDD[String] = https://www.google.co.in/?gws_rd=ssl MapPartitionsRDD[1] at textFile at <console>:21 scala> pagecount.count() java.io.IOException: No FileSystem for scheme: https at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1383) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:176) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1511) at org.apache.spark.rdd.RDD.count(RDD.scala:1006) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:24) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:29) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:31) at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:33) at $iwC$$iwC$$iwC$$iwC.<init>(<console>:35) at $iwC$$iwC$$iwC.<init>(<console>:37) at $iwC$$iwC.<init>(<console>:39) at $iwC.<init>(<console>:41) at <init>(<console>:43) at .<init>(<console>:47) at .<clinit>(<console>) at .<init>(<console>:7) at .<clinit>(<console>) at $print(<console>) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)</code></pre> </div> </div>

<p>You cannot get url content using <code>textFile</code> directly. <code>textFile</code> is to :</p> <blockquote> <p>Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI</p> </blockquote> <p>You see, <code>HTTP/HTTPS</code> url is not included.</p> <p>You can get the content first, and then make it as <code>RDDs</code>.</p> <pre class="prettyprint"><code>val html = scala.io.Source.fromURL("https://spark.apache.org/").mkString val list = html.split("\n").filter(_ != "") val rdds = sc.parallelize(list) val count = rdds.filter(_.contains("Spark")).count() </code></pre>

How to access a web URL using a spark context

Tags:

apache-spark

I am trying to use a WEB URL from spark-shell using textFile method, but getting error. Probably this is not the right way. So can someone please tell me how to access a web URL from spark context.

I am using spark version 1.3.0 ; Scala version 2.10.4 and Java 1.7.0_21

hduser@ubuntu:~$ spark-shell
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Welcome to
      __              
     / /   / /
    \ \/  \/  `/ _/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.3.0
      /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_21)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
SQL context available as sqlContext.

scala> val pagecount = sc.textFile( "https://www.google.co.in/?gws_rd=ssl" )
pagecount: org.apache.spark.rdd.RDD[String] = https://www.google.co.in/?gws_rd=ssl MapPartitionsRDD[1] at textFile at <console>:21

scala> pagecount.count()
java.io.IOException: No FileSystem for scheme: https
 at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1383)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
 at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
 at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:176)
 at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1511)
 at org.apache.spark.rdd.RDD.count(RDD.scala:1006)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:24)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:29)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:31)
 at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:33)
 at $iwC$$iwC$$iwC$$iwC.<init>(<console>:35)
 at $iwC$$iwC$$iwC.<init>(<console>:37)
 at $iwC$$iwC.<init>(<console>:39)
 at $iwC.<init>(<console>:41)
 at <init>(<console>:43)
 at .<init>(<console>:47)
 at .<clinit>(<console>)
 at .<init>(<console>:7)
 at .<clinit>(<console>)
 at $print(<console>)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:601)
 at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
 at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
 at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
 at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856)
 at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901)
 at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813)
 at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656)
 at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664)
 at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669)
 at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996)
 at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
 at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
 at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
 at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944)
 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058)
 at org.apache.spark.repl.Main$.main(Main.scala:31)
 at org.apache.spark.repl.Main.main(Main.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:601)
 at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
 at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
 at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

504

asked Apr 20 '15 06:04

Koushik Chandra

1 Answers

You cannot get url content using textFile directly. textFile is to :

Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI

You see, HTTP/HTTPS url is not included.

You can get the content first, and then make it as RDDs.

val html = scala.io.Source.fromURL("https://spark.apache.org/").mkString
val list = html.split("\n").filter(_ != "")
val rdds = sc.parallelize(list)
val count = rdds.filter(_.contains("Spark")).count()

192

answered Sep 30 '22 16:09

chenzhongpu

Related questions
                            
                                Ignoring non-spark config property: hive.exec.dynamic.partition.mode
                            
                                How to CREATE TABLE USING delta with Spark 2.4.4?
                            
                                Write and read raw byte arrays in Spark - using Sequence File SequenceFile
                            
                                How to check if Spark RDD is in memory?
                            
                                Can Spark code be run on cluster without spark-submit?
                            
                                How to save a spark RDD in gzip format through pyspark
                            
                                Parquet predicate pushdown
                            
                                How to map variable names to features after pipeline
                            
                                Find minimum for a timestamp through Spark groupBy dataframe
                            
                                Config file to define JSON Schema Structure in PySpark
                            
                                Spark Context is not automatically created in Scala Spark Shell
                            
                                Number of Executors in Spark Local Mode
                            
                                How to convert a string column with milliseconds to a timestamp with milliseconds in Spark 2.1 using Scala?
                            
                                Spark: converting GMT time stamps to Eastern taking daylight savings into account
                            
                                How many SparkSessions can a single application have?
                            
                                How to get a string representation of DataFrame (as does Dataset.show)?
                            
                                spark.sql.shuffle.partitions of 200 default partitions conundrum
                            
                                Ambiguous schema in Spark Scala
                            
                                Capturing the result of explain() in pyspark
                            
                                How to connect master and slaves in Apache-Spark? (Standalone Mode)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With