Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Load a Amazon S3 file which has colons within the filename through pyspark

I have a S3 bucket which contains multiples files which have colon within their file names.

Example :

s3://my_bucket/my_data/en/2015120/batch:222:111:00000.jl.gz

I am trying to load this in to a spark RDD and access the first line as follows.

my_data = sc.textFile("s3://my_bucket/my_data/en/2015120/batch:222:111:00000.jl.gz")
my_data.take(1)

But this throws,

llegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 

Any suggestions to load these files individually or more preferably as the whole folder

like image 798
rclakmal Avatar asked Sep 21 '25 01:09

rclakmal


1 Answers

I got it to work by replacing the colons to url encoded format.

i.e.

: would be replaced with %3A

To double check, click on one of the objects in S3 and see the "link"

S3 Screenshot

like image 195
Dominic Cabral Avatar answered Sep 22 '25 13:09

Dominic Cabral