reading data from URL using spark databricks platform

Tags:

trying to read data from url using spark on databricks community edition platform i tried to use spark.read.csv and using SparkFiles but still, i am missing some simple point

url = "https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv"
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
# sc.addFile(url)
# sqlContext = SQLContext(sc)
# df = sqlContext.read.csv(SparkFiles.get("adult.csv"), header=True, inferSchema= True) 

df = spark.read.csv(SparkFiles.get("adult.csv"), header=True, inferSchema= True)

got path related error:

Path does not exist: dbfs:/local_disk0/spark-9f23ed57-133e-41d5-91b2-12555d641961/userFiles-d252b3ba-499c-42c9-be48-96358357fb75/adult.csv;'

i also tried someother way

val content = scala.io.Source.fromURL("https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv").mkString

 # val list = content.split("\n").filter(_ != "")
   val rdd = sc.parallelize(content)
   val df = rdd.toDF

SyntaxError: invalid syntax
  File "<command-332010883169993>", line 16
    val content = scala.io.Source.fromURL("https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv").mkString
              ^
SyntaxError: invalid syntax

data should be loaded directly to databricks folder or i should be able load directly from url using spark.read, any suggestions

471

asked Jul 12 '19 21:07

arya

1 Answers

Try this.

url = "https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv"
from pyspark import SparkFiles
spark.sparkContext.addFile(url)

**df = spark.read.csv("file://"+SparkFiles.get("adult.csv"), header=True, inferSchema= True)**

Just fetching few columns of your csv url.

df.select("age","workclass","fnlwgt","education").show(10);
>>> df.select("age","workclass","fnlwgt","education").show(10);
+---+----------------+------+---------+
|age|       workclass|fnlwgt|education|
+---+----------------+------+---------+
| 39|       State-gov| 77516|Bachelors|
| 50|Self-emp-not-inc| 83311|Bachelors|
| 38|         Private|215646|  HS-grad|
| 53|         Private|234721|     11th|
| 28|         Private|338409|Bachelors|
| 37|         Private|284582|  Masters|
| 49|         Private|160187|      9th|
| 52|Self-emp-not-inc|209642|  HS-grad|
| 31|         Private| 45781|  Masters|
| 42|         Private|159449|Bachelors|
+---+----------------+------+---------+

SparkFiles get the absolute path of the file which is local to your driver or worker. That's the reason why it was not able to find it.

110

answered Sep 30 '22 20:09

vikrant rana

Related questions
                            
                                About lower bound in Scala
                            
                                How to handle Long and different formats in a Play! Scala form?
                            
                                Scalacheck: Generate list corresponding to list of generators
                            
                                ReactiveMongo 0.9: Joda Datetime Implicit Conversion for Macros.handler
                            
                                scala.Some cannot be cast to java.lang.String
                            
                                Quite confused about this code snippet return types with & without =
                            
                                Producing no artifact for root project with package under multi-project build in SBT
                            
                                Wait until all Future.onComplete callbacks are executed
                            
                                scala duplicate elements in list
                            
                                Converting empty string to a None in Scala
                            
                                Get whole HttpResponse body as a String with Akka-Streams HTTP
                            
                                Apache Spark: how to transform Data Frame column with regex to another Data Frame?
                            
                                How to efficiently delete all elements from ListBuffer in Scala?
                            
                                sbt: publish generated sources
                            
                                Looking for an FP ranking implementation which handles ties (i.e. equal values)
                            
                                Is using Optional in Scala's case classes and classes fields a code smell?
                            
                                How to convert Any number to Double?
                            
                                Get the row corresponding to the latest timestamp in a Spark Dataset using Scala
                            
                                How to check the number of partitions of a Spark DataFrame without incurring the cost of .rdd
                            
                                Shapeless - Deduplicating types in Coproduct

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

reading data from URL using spark databricks platform

Tags:

scala

apache-spark

apache-spark-sql

pyspark

databricks

arya

People also ask

1 Answers

vikrant rana

Recent Activity

Donate For Us