I am loading a csv text file from s3 into spark, filtering and mapping the records and writing the result to s3.
I have tried several input sizes: 100k rows, 1M rows & 3.5M rows.
The former two finish successfully while the latter (3.5M rows) hangs in some weird state in which the job stages monitor web app (the one in port 4040) stops , and the command line console gets stuck and does not even respond to ctrl-c. The Master's web monitoring app still responds and shows the state as FINISHED
.
In s3, I see an empty directory with a single zero-sized entry _temporary_$folder$
. The s3 url is given using the s3n://
protocol.
I did not see any error in the logs in the web console. I also tried several cluster sizes (1 master + 1 worker, 1 master + 5 workers) and got to the same state.
Has anyone encountered such an issue? Any idea what's going on?
It's possible you are running up against the 5GB object limitation of the s3n FileSystem
. You may be able to get around this by using s3 FileSystem
(not s3n
), or by partitioning your output.
Here's what the AmazonS3 - Hadoop Wiki says:
S3 Native FileSystem (URI scheme: s3n) A native filesystem for reading and writing regular files on S3. The advantage of this filesystem is that you can access files on S3 that were written with other tools. [...] The disadvantage is the 5GB limit on file size imposed by S3.
...
S3 Block FileSystem (URI scheme: s3) A block-based filesystem backed by S3. Files are stored as blocks, just like they are in HDFS. This permits efficient implementation of renames. This filesystem requires you to dedicate a bucket for the filesystem [...] The files stored by this filesystem can be larger than 5GB, but they are not interoperable with other S3 tools.
...
AmazonS3 (last edited 2014-07-01 13:27:49 by SteveLoughran)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With