pyspark csv at url to dataframe, without writing to disk

Question

How can I read a csv at a url into a dataframe in Pyspark without writing it to disk?

I've tried the following with no luck:

import urllib.request
from io import StringIO

url = "https://raw.githubusercontent.com/pandas-dev/pandas/master/pandas/tests/data/iris.csv"
response = urllib.request.urlopen(url)
data = response.read()      
text = data.decode('utf-8')  


f = StringIO(text)

df1 = sqlContext.read.csv(f, header = True, schema=customSchema)
df1.show()

Alper t. Turker · Accepted Answer

TL;DR It is not possible and in general transferring data through driver is a dead-end.

Before Spark 2.3 csv reader can read only from URI (and http is not supported).
In Spark 2.3 you use RDD:
```
spark.read.csv(sc.parallelize(text.splitlines()))
```
but data will be written to disk.
You can createDataFrame from Pandas:
```
spark.createDataFrame(pd.read_csv(url)))
```
but this once again writes to disk

If file is small I'd just use sparkFiles:

from pyspark import SparkFiles

spark.sparkContext.addFile(url)

spark.read.csv(SparkFiles.get("iris.csv"), header=True))

pyspark csv at url to dataframe, without writing to disk

Tags:

csv

apache-spark

pyspark

RobinL

1 Answers

Alper t. Turker

Recent Activity

Donate For Us

pyspark csv at url to dataframe, without writing to disk

Tags:

csv

apache-spark

pyspark

RobinL

1 Answers

Alper t. Turker

Related questions

Recent Activity

Donate For Us