I encounter S3 SignatureDoesNotMatch
while trying to write Dataframe to S3 with Spark.
The symptom/things have tried:
AWS_SECRETY_KEY
does not contain any non-alphanumeric as per suggested here;m3.xlarge
with spark-2.0.2-bin-hadoop2.7
running on Local mode;The code can be boiled down to:
spark-submit\
--verbose\
--conf spark.hadoop.fs.s3n.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem \
--conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3.S3FileSystem \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem\
--packages org.apache.hadoop:hadoop-aws:2.7.3\
--driver-java-options '-Dcom.amazonaws.services.s3.enableV4'\
foobar.py
# foobar.py
sc = SparkContext.getOrCreate()
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", 'xxx')
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", 'xxx')
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", 's3.dualstack.ap-southeast-2.amazonaws.com')
hc = SparkSession.builder.enableHiveSupport().getOrCreate()
dataframe = hc.read.parquet(in_file_path)
dataframe.write.csv(
path=out_file_path,
mode='overwrite',
compression='gzip',
sep=',',
quote='"',
escape='\\',
escapeQuotes='true',
)
Spark spills the following error.
Set log4j to verbose, it appears the following had happened:
/_temporary/foorbar.part-xxx
; >> PUT XXX/part-r-00025-ae3d5235-932f-4b7d-ae55-b159d1c1343d.gz.parquet HTTP/1.1
>> Host: XXX.s3-ap-southeast-2.amazonaws.com
>> x-amz-content-sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
>> X-Amz-Date: 20161104T005749Z
>> x-amz-metadata-directive: REPLACE
>> Connection: close
>> User-Agent: aws-sdk-java/1.10.11 Linux/3.13.0-100-generic OpenJDK_64-Bit_Server_VM/25.91-b14/1.8.0_91 com.amazonaws.services.s3.transfer.TransferManager/1.10.11
>> x-amz-server-side-encryption-aws-kms-key-id: 5f88a222-715c-4a46-a64c-9323d2d9418c
>> x-amz-server-side-encryption: aws:kms
>> x-amz-copy-source: /XXX/_temporary/0/task_201611040057_0001_m_000025/part-r-00025-ae3d5235-932f-4b7d-ae55-b159d1c1343d.gz.parquet
>> Accept-Ranges: bytes
>> Authorization: AWS4-HMAC-SHA256 Credential=AKIAJZCSOJPB5VX2B6NA/20161104/ap-southeast-2/s3/aws4_request, SignedHeaders=accept-ranges;connection;content-length;content-type;etag;host;last-modified;user-agent;x-amz-content-sha256;x-amz-copy-source;x-amz-date;x-amz-metadata-directive;x-amz-server-side-encryption;x-amz-server-side-encryption-aws-kms-key-id, Signature=48e5fe2f9e771dc07a9c98c7fd98972a99b53bfad3b653151f2fcba67cff2f8d
>> ETag: 31436915380783143f00299ca6c09253
>> Content-Type: application/octet-stream
>> Content-Length: 0
DEBUG wire: << "HTTP/1.1 403 Forbidden[\r][\n]"
DEBUG wire: << "x-amz-request-id: 849F990DDC1F3684[\r][\n]"
DEBUG wire: << "x-amz-id-2: 6y16TuQeV7CDrXs5s7eHwhrpa1Ymf5zX3IrSuogAqz9N+UN2XdYGL2FCmveqKM2jpGiaek5rUkM=[\r][\n]"
DEBUG wire: << "Content-Type: application/xml[\r][\n]"
DEBUG wire: << "Transfer-Encoding: chunked[\r][\n]"
DEBUG wire: << "Date: Fri, 04 Nov 2016 00:57:48 GMT[\r][\n]"
DEBUG wire: << "Server: AmazonS3[\r][\n]"
DEBUG wire: << "Connection: close[\r][\n]"
DEBUG wire: << "[\r][\n]"
DEBUG DefaultClientConnection: Receiving response: HTTP/1.1 403 Forbidden
<< HTTP/1.1 403 Forbidden
<< x-amz-request-id: 849F990DDC1F3684
<< x-amz-id-2: 6y16TuQeV7CDrXs5s7eHwhrpa1Ymf5zX3IrSuogAqz9N+UN2XdYGL2FCmveqKM2jpGiaek5rUkM=
<< Content-Type: application/xml
<< Transfer-Encoding: chunked
<< Date: Fri, 04 Nov 2016 00:57:48 GMT
<< Server: AmazonS3
<< Connection: close
DEBUG requestId: x-amzn-RequestId: not available
I experienced exactly the same problem and found a solution with the help of this article (other resources are pointing in the same direction). After setting these configuration options, writing to S3 succeeded:
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
spark.speculation false
I am using Spark 2.1.1 with Hadoop 2.7. My final spark-submit command looked like this:
spark-submit
--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3
--conf spark.hadoop.fs.s3a.endpoint=s3.eu-central-1.amazonaws.com
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
--conf spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true
--conf spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true
--conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
--conf spark.speculation=false
...
Additionally, I defined these environment variables:
AWS_ACCESS_KEY_ID=****
AWS_SECRET_ACCESS_KEY=****
I had the same issue and resolved it by upgrading from aws-java-sdk:1.7.4
to aws-java-sdk:1.11.199
and hadoop-aws:2.7.7
to hadoop-aws:3.0.0
.
However to avoid the dependency mismatches when interacting with AWS I had to rebuild Spark and provide it with my own version of Hadoop 3.0.0.
I speculate that the root cause is the way that the v4 signature algorithm takes in the current timestamp and then all Spark executors are using the same signature to authenticate their PUT requests. But if one slips outside the 'window' of time allowed by the algorithm the request, and all further requests, fail causing Spark to rollback the the changes and error out. This explains why calling .coalesce(1)
or .repartition(1)
always works but the failure rate climbs in proportion to the number of partitions being written.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With