Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyspark csv at url to dataframe, without writing to disk

How can I read a csv at a url into a dataframe in Pyspark without writing it to disk?

I've tried the following with no luck:

import urllib.request
from io import StringIO

url = "https://raw.githubusercontent.com/pandas-dev/pandas/master/pandas/tests/data/iris.csv"
response = urllib.request.urlopen(url)
data = response.read()      
text = data.decode('utf-8')  


f = StringIO(text)

df1 = sqlContext.read.csv(f, header = True, schema=customSchema)
df1.show()
like image 278
RobinL Avatar asked Dec 16 '17 11:12

RobinL


1 Answers

TL;DR It is not possible and in general transferring data through driver is a dead-end.

  • Before Spark 2.3 csv reader can read only from URI (and http is not supported).
  • In Spark 2.3 you use RDD:

    spark.read.csv(sc.parallelize(text.splitlines()))
    

    but data will be written to disk.

  • You can createDataFrame from Pandas:

    spark.createDataFrame(pd.read_csv(url)))
    

    but this once again writes to disk

If file is small I'd just use sparkFiles:

from pyspark import SparkFiles

spark.sparkContext.addFile(url)

spark.read.csv(SparkFiles.get("iris.csv"), header=True))
like image 155
Alper t. Turker Avatar answered Oct 13 '22 19:10

Alper t. Turker