Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

reading data from URL using spark databricks platform

trying to read data from url using spark on databricks community edition platform i tried to use spark.read.csv and using SparkFiles but still, i am missing some simple point

url = "https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv"
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
# sc.addFile(url)
# sqlContext = SQLContext(sc)
# df = sqlContext.read.csv(SparkFiles.get("adult.csv"), header=True, inferSchema= True) 

df = spark.read.csv(SparkFiles.get("adult.csv"), header=True, inferSchema= True)

got path related error:

Path does not exist: dbfs:/local_disk0/spark-9f23ed57-133e-41d5-91b2-12555d641961/userFiles-d252b3ba-499c-42c9-be48-96358357fb75/adult.csv;'

i also tried someother way

val content = scala.io.Source.fromURL("https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv").mkString

 # val list = content.split("\n").filter(_ != "")
   val rdd = sc.parallelize(content)
   val df = rdd.toDF

SyntaxError: invalid syntax
  File "<command-332010883169993>", line 16
    val content = scala.io.Source.fromURL("https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv").mkString
              ^
SyntaxError: invalid syntax

data should be loaded directly to databricks folder or i should be able load directly from url using spark.read, any suggestions

like image 471
arya Avatar asked Jul 12 '19 21:07

arya


People also ask

What is a sparksession in Databricks?

In order to do anything with Spark, you need a SparkSession. Don’t worry if you don’t know what this means — you can read more in depth about this as you become more familiar with Spark, but for now it is just an object that points to your cluster that allows you to run Spark commands. In Databricks, it is just called spark.

What is the difference between Spark and Databricks?

They call it Databricks. Unlike the free Spark, Databricks is usually charged by the cluster size and usage. Be careful, choose the right size when creating your first instance. One more thing to note, please do remember the Databricks runtime version you selected.

What is the best Python tool for Spark and Databricks?

I found Visual Studio Code with Python and Databricks extension is a wonderful tool that fully supports Databricks and Spark. Hope this helpful to you. Your home for data science. A Medium publication sharing concepts, ideas and codes.

Is there a better way to run Spark on local data?

There has to be a better way to run a few Spark commands on some local data, and fortunately there is. Databricks offers the “easy as SageMaker” option for Spark that AWS decided not to provide.


1 Answers

Try this.

url = "https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv"
from pyspark import SparkFiles
spark.sparkContext.addFile(url)

**df = spark.read.csv("file://"+SparkFiles.get("adult.csv"), header=True, inferSchema= True)**

Just fetching few columns of your csv url.

df.select("age","workclass","fnlwgt","education").show(10);
>>> df.select("age","workclass","fnlwgt","education").show(10);
+---+----------------+------+---------+
|age|       workclass|fnlwgt|education|
+---+----------------+------+---------+
| 39|       State-gov| 77516|Bachelors|
| 50|Self-emp-not-inc| 83311|Bachelors|
| 38|         Private|215646|  HS-grad|
| 53|         Private|234721|     11th|
| 28|         Private|338409|Bachelors|
| 37|         Private|284582|  Masters|
| 49|         Private|160187|      9th|
| 52|Self-emp-not-inc|209642|  HS-grad|
| 31|         Private| 45781|  Masters|
| 42|         Private|159449|Bachelors|
+---+----------------+------+---------+

SparkFiles get the absolute path of the file which is local to your driver or worker. That's the reason why it was not able to find it.

like image 110
vikrant rana Avatar answered Sep 30 '22 20:09

vikrant rana