Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Write to S3 V4 SignatureDoesNotMatch Error

I encounter S3 SignatureDoesNotMatch while trying to write Dataframe to S3 with Spark.

The symptom/things have tried:

  • The code fail sometimes but works sometimes;
  • The code can read from S3 without any problem, and be able to write to S3 from time to time, which rules out wrong config settings like S3A / enableV4 / Wrong Key / Region Endpoint etc.
  • The S3A endpoint had been set according to the S3 docs S3 Endpoint;
  • Made sure the AWS_SECRETY_KEY does not contain any non-alphanumeric as per suggested here;
  • Made sure server time is in-sync by using NTP;
  • The following was tested on EC2 m3.xlarge with spark-2.0.2-bin-hadoop2.7 running on Local mode;
  • The issue is gone when the files are written to local fs;
  • right now the workaround was to mount the bucket with s3fs and write to there; however this is not ideal as s3fs dies quite often from the stress Spark put to it;

The code can be boiled down to:

spark-submit\
    --verbose\
    --conf spark.hadoop.fs.s3n.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem \
    --conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3.S3FileSystem \
    --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem\
    --packages org.apache.hadoop:hadoop-aws:2.7.3\
    --driver-java-options '-Dcom.amazonaws.services.s3.enableV4'\
    foobar.py


# foobar.py
sc = SparkContext.getOrCreate()
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", 'xxx')
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", 'xxx')
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", 's3.dualstack.ap-southeast-2.amazonaws.com')

hc = SparkSession.builder.enableHiveSupport().getOrCreate()
dataframe = hc.read.parquet(in_file_path)

dataframe.write.csv(
    path=out_file_path,
    mode='overwrite',
    compression='gzip',
    sep=',',
    quote='"',
    escape='\\',
    escapeQuotes='true',
)

Spark spills the following error.


Set log4j to verbose, it appears the following had happened:

  • Each individual will be output to staing location on S3 /_temporary/foorbar.part-xxx;
  • A PUT call will move the partitions into final location;
  • After a few successfully PUT calls, all the subsequent PUT call failed due to 403;
  • As the reuqets were made by aws-java-sdk, not sure what to do on application level; -- The following log were from another event with the exact same error;

 >> PUT XXX/part-r-00025-ae3d5235-932f-4b7d-ae55-b159d1c1343d.gz.parquet HTTP/1.1
 >> Host: XXX.s3-ap-southeast-2.amazonaws.com
 >> x-amz-content-sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
 >> X-Amz-Date: 20161104T005749Z
 >> x-amz-metadata-directive: REPLACE
 >> Connection: close
 >> User-Agent: aws-sdk-java/1.10.11 Linux/3.13.0-100-generic OpenJDK_64-Bit_Server_VM/25.91-b14/1.8.0_91 com.amazonaws.services.s3.transfer.TransferManager/1.10.11
 >> x-amz-server-side-encryption-aws-kms-key-id: 5f88a222-715c-4a46-a64c-9323d2d9418c
 >> x-amz-server-side-encryption: aws:kms
 >> x-amz-copy-source: /XXX/_temporary/0/task_201611040057_0001_m_000025/part-r-00025-ae3d5235-932f-4b7d-ae55-b159d1c1343d.gz.parquet
 >> Accept-Ranges: bytes
 >> Authorization: AWS4-HMAC-SHA256 Credential=AKIAJZCSOJPB5VX2B6NA/20161104/ap-southeast-2/s3/aws4_request, SignedHeaders=accept-ranges;connection;content-length;content-type;etag;host;last-modified;user-agent;x-amz-content-sha256;x-amz-copy-source;x-amz-date;x-amz-metadata-directive;x-amz-server-side-encryption;x-amz-server-side-encryption-aws-kms-key-id, Signature=48e5fe2f9e771dc07a9c98c7fd98972a99b53bfad3b653151f2fcba67cff2f8d
 >> ETag: 31436915380783143f00299ca6c09253
 >> Content-Type: application/octet-stream
 >> Content-Length: 0
DEBUG wire:  << "HTTP/1.1 403 Forbidden[\r][\n]"
DEBUG wire:  << "x-amz-request-id: 849F990DDC1F3684[\r][\n]"
DEBUG wire:  << "x-amz-id-2: 6y16TuQeV7CDrXs5s7eHwhrpa1Ymf5zX3IrSuogAqz9N+UN2XdYGL2FCmveqKM2jpGiaek5rUkM=[\r][\n]"
DEBUG wire:  << "Content-Type: application/xml[\r][\n]"
DEBUG wire:  << "Transfer-Encoding: chunked[\r][\n]"
DEBUG wire:  << "Date: Fri, 04 Nov 2016 00:57:48 GMT[\r][\n]"
DEBUG wire:  << "Server: AmazonS3[\r][\n]"
DEBUG wire:  << "Connection: close[\r][\n]"
DEBUG wire:  << "[\r][\n]"
DEBUG DefaultClientConnection: Receiving response: HTTP/1.1 403 Forbidden
 << HTTP/1.1 403 Forbidden
 << x-amz-request-id: 849F990DDC1F3684
 << x-amz-id-2: 6y16TuQeV7CDrXs5s7eHwhrpa1Ymf5zX3IrSuogAqz9N+UN2XdYGL2FCmveqKM2jpGiaek5rUkM=
 << Content-Type: application/xml
 << Transfer-Encoding: chunked
 << Date: Fri, 04 Nov 2016 00:57:48 GMT
 << Server: AmazonS3
 << Connection: close
DEBUG requestId: x-amzn-RequestId: not available
like image 648
Paul Liang Avatar asked Dec 15 '16 06:12

Paul Liang


2 Answers

I experienced exactly the same problem and found a solution with the help of this article (other resources are pointing in the same direction). After setting these configuration options, writing to S3 succeeded:

spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
spark.speculation false

I am using Spark 2.1.1 with Hadoop 2.7. My final spark-submit command looked like this:

spark-submit
--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3
--conf spark.hadoop.fs.s3a.endpoint=s3.eu-central-1.amazonaws.com
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
--conf spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true
--conf spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true
--conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
--conf spark.speculation=false
...

Additionally, I defined these environment variables:

AWS_ACCESS_KEY_ID=****
AWS_SECRET_ACCESS_KEY=****
like image 131
Christoph Hösler Avatar answered Oct 14 '22 14:10

Christoph Hösler


I had the same issue and resolved it by upgrading from aws-java-sdk:1.7.4 to aws-java-sdk:1.11.199 and hadoop-aws:2.7.7 to hadoop-aws:3.0.0.

However to avoid the dependency mismatches when interacting with AWS I had to rebuild Spark and provide it with my own version of Hadoop 3.0.0.

I speculate that the root cause is the way that the v4 signature algorithm takes in the current timestamp and then all Spark executors are using the same signature to authenticate their PUT requests. But if one slips outside the 'window' of time allowed by the algorithm the request, and all further requests, fail causing Spark to rollback the the changes and error out. This explains why calling .coalesce(1) or .repartition(1) always works but the failure rate climbs in proportion to the number of partitions being written.

like image 45
sid Avatar answered Oct 14 '22 15:10

sid