How can I read a csv at a url into a dataframe in Pyspark without writing it to disk?
I've tried the following with no luck:
import urllib.request
from io import StringIO
url = "https://raw.githubusercontent.com/pandas-dev/pandas/master/pandas/tests/data/iris.csv"
response = urllib.request.urlopen(url)
data = response.read()
text = data.decode('utf-8')
f = StringIO(text)
df1 = sqlContext.read.csv(f, header = True, schema=customSchema)
df1.show()
TL;DR It is not possible and in general transferring data through driver is a dead-end.
csv
reader can read only from URI (and http is not supported).In Spark 2.3 you use RDD
:
spark.read.csv(sc.parallelize(text.splitlines()))
but data will be written to disk.
You can createDataFrame
from Pandas:
spark.createDataFrame(pd.read_csv(url)))
but this once again writes to disk
If file is small I'd just use sparkFiles
:
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
spark.read.csv(SparkFiles.get("iris.csv"), header=True))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With