My developing environment:
Dependencies:
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>2.2.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-mllib_2.10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.10</artifactId>
<version>2.2.0</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>2.2.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.scala-lang/scala-library -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.10.6</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.scala-lang/scala-reflect -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-reflect</artifactId>
<version>2.10.6</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.4</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.7.4</version>
</dependency>
</dependencies>
problem :
I want read remote csv file into dataframe.
I tried next:
val weburl = "http://myurl.com/file.csv"
val tfile = spark.read.option("header","true").option("inferSchema","true").csv(weburl)
It returns next Error:
Exception in thread "main" java.io.IOException: No FileSystem for scheme: http
I tried next following internet searching(include stackoverflow)
val content = scala.io.Source.fromURL(weburl).mkString
val list = content.split("\n")
//...doing something to string and typecase, seperate each lows to make dataframe format.
it works fine, but I think more smart way to loading web source csv file.
Is there any way to DataframeReader can read HTTP csv?
I think setting SparkContext.hadoopConfiguration is some key, so I tried many codes in internet. but it didn't work and I don't know how to set and each meaning of code lines.
Next is one of my trying and it didn't work.(same error message on accessing "http")
val sc = new SparkContext(spark_conf)
val spark = SparkSession.builder.appName("Test").getOrCreate()
val hconf = sc.hadoopConfiguration
hconf.set("fs.hdfs.impl", classOf[org.apache.hadoop.hdfs.DistributedFileSystem].getName)
hconf.set("fs.file.impl", classOf[org.apache.hadoop.fs.LocalFileSystem].getName)
hconf.set("fs.file.impl", classOf[org.apache.hadoop.fs.LocalFileSystem].getName)
Is setting this is key? or not?
Or DataframeReader can't read directly from remote source? than how can i do it?
I need import some special library for http format?
The thing I want to know :
Is there any way to dataframereader can read HTTP source?
Without using their own parsing data. (like Best way to convert online csv to dataframe scala.)
I need to read CSV format. CSV is formal format. I think more general way to read data like dataframereader.csv("local file"
).
I know this question level too low. I'm sorry for my low-level of understanding.
As far as I know it is not possible to read HTTP data directly. Probably the simplest thing you can do is to download file using SparkFiles
, but it will duplicate data to each worker:
import org.apache.spark.SparkFiles
spark.sparkContext.addFile("http://myurl.com/file.csv")
spark.read.csv(SparkFiles.get("file.csv"))
Personally I'd just download the file upfront and put in a distributed storage.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With