trying to read data from url using spark on databricks community edition platform i tried to use spark.read.csv and using SparkFiles but still, i am missing some simple point
url = "https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv"
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
# sc.addFile(url)
# sqlContext = SQLContext(sc)
# df = sqlContext.read.csv(SparkFiles.get("adult.csv"), header=True, inferSchema= True)
df = spark.read.csv(SparkFiles.get("adult.csv"), header=True, inferSchema= True)
got path related error:
Path does not exist: dbfs:/local_disk0/spark-9f23ed57-133e-41d5-91b2-12555d641961/userFiles-d252b3ba-499c-42c9-be48-96358357fb75/adult.csv;'
i also tried someother way
val content = scala.io.Source.fromURL("https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv").mkString
# val list = content.split("\n").filter(_ != "")
val rdd = sc.parallelize(content)
val df = rdd.toDF
SyntaxError: invalid syntax
File "<command-332010883169993>", line 16
val content = scala.io.Source.fromURL("https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv").mkString
^
SyntaxError: invalid syntax
data should be loaded directly to databricks folder or i should be able load directly from url using spark.read, any suggestions
In order to do anything with Spark, you need a SparkSession. Don’t worry if you don’t know what this means — you can read more in depth about this as you become more familiar with Spark, but for now it is just an object that points to your cluster that allows you to run Spark commands. In Databricks, it is just called spark.
They call it Databricks. Unlike the free Spark, Databricks is usually charged by the cluster size and usage. Be careful, choose the right size when creating your first instance. One more thing to note, please do remember the Databricks runtime version you selected.
I found Visual Studio Code with Python and Databricks extension is a wonderful tool that fully supports Databricks and Spark. Hope this helpful to you. Your home for data science. A Medium publication sharing concepts, ideas and codes.
There has to be a better way to run a few Spark commands on some local data, and fortunately there is. Databricks offers the “easy as SageMaker” option for Spark that AWS decided not to provide.
Try this.
url = "https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv"
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
**df = spark.read.csv("file://"+SparkFiles.get("adult.csv"), header=True, inferSchema= True)**
Just fetching few columns of your csv url.
df.select("age","workclass","fnlwgt","education").show(10);
>>> df.select("age","workclass","fnlwgt","education").show(10);
+---+----------------+------+---------+
|age| workclass|fnlwgt|education|
+---+----------------+------+---------+
| 39| State-gov| 77516|Bachelors|
| 50|Self-emp-not-inc| 83311|Bachelors|
| 38| Private|215646| HS-grad|
| 53| Private|234721| 11th|
| 28| Private|338409|Bachelors|
| 37| Private|284582| Masters|
| 49| Private|160187| 9th|
| 52|Self-emp-not-inc|209642| HS-grad|
| 31| Private| 45781| Masters|
| 42| Private|159449|Bachelors|
+---+----------------+------+---------+
SparkFiles get the absolute path of the file which is local to your driver or worker. That's the reason why it was not able to find it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With