Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scala code doesnt fetch s3 file

I am trying to run an EMR scalding job and the Scala code is suppose to fetch the content of a text file located in an S3 bucket. The scala.io.source library is messing up with the correct location of the S3 path.

I am giving the parameter runidfile to the EMR job :

--runidfile s3://my-bucket/input.txt

The scala code does the following :

val runid_path = args("runidfile")
val runid = Source.fromFile(runid_path).getLines().mkString

The code somehow doesn't accept the "//" in the S3 path and I get an error:

Caused by: java.io.FileNotFoundException: s3:/my-bucket/input.txt (No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.(FileInputStream.java:146)
at scala.io.Source$.fromFile(Source.scala:90)
at scala.io.Source$.fromFile(Source.scala:75)
at scala.io.Source$.fromFile(Source.scala:53)
at com.move.scalding.userEvents.RecommenderValidator.(RecommenderValidator.scala:37)

Is there any solution or a workaround to this? I tried using Source.fromURL, but S3 is not a valid protocol so it doesn't accept it.

like image 467
Rachit Raut Avatar asked Sep 16 '15 23:09

Rachit Raut


People also ask

How do I access S3 bucket files?

In the Amazon S3 console, choose your S3 bucket, choose the file that you want to open or download, choose Actions, and then choose Open or Download. If you are downloading an object, specify where you want to save it. The procedure for saving the object depends on the browser and operating system that you are using.

Can I read S3 file without downloading?

Reading objects without downloading them Similarly, if you want to upload and read small pieces of textual data such as quotes, tweets, or news articles, you can do that using the S3 resource method put(), as demonstrated in the example below (Gist).

What is S3 byte range Fetch?

Byte-Range Fetches Amazon S3 allows you to fetch different byte ranges from within the same object, helping to achieve higher aggregate throughput versus a single-whole object request. Fetching smaller ranges of a large object also allows an application to improve the retry times when these requests are interrupted.


1 Answers

The scala.io.Source library is not meant to access files directly from Amazon S3. You need another library for that.

You can use the offical Amazon S3 Java Library. Here is some sample code (copied together from this question and its answers)

val credentials = new BasicAWSCredentials("myKey", "mySecretKey")
val s3Client = new AmazonS3Client(credentials)
val s3Object = s3Client.getObject(new GetObjectRequest("my-bucket", "input.txt"))
val myData = Source.fromInputStream(s3Object.getObjectContent())

val runid = myData.getLines().mkString
like image 168
Sven Koschnicke Avatar answered Sep 18 '22 21:09

Sven Koschnicke