HTTP Spec: PUT without data transfer, since hash of data is known to server

Question

Does the HTTP/WebDav spec allow this client-server dialog?

client: I want to PUT data to /user1/foo.mkv which has this hash sum: HASH
server: OK, PUT was successful, you don't need to send the data since I already know the data with this hash sum.

Note: This PUT is an initial upload. It is not an update.

If this is possible, a way faster file syncing could be implemented.

Use case: The WebDAV server hosts a directory for each user. The favorite video foo.mkv gets uploaded by several users. In this example the favorite video is already stored at this location: /user2/myfoo.mkv. The second and following uploads don't need to send any data, since the server already knows the content. This would reduce a lot of network load.

Preconditions:

Client and server would need to agree on the hash algorithm beforehand.
The server needs to store the hash-value of already known files.

It would be very easy to implement this in a custom client and server. But that's not what I want.

My question: Is there an RFC or other standard that allows such a dialog?

If there is no standard yet, then how to proceed to get this dream come true?

Security consideration

With the above dialog it would be able to access the content of know hashes. Example an evil client knows that there is a file with the hash sum of 1234567.... He could do the above two steps and after that the client could use a GET to download the data.

A way around this to extend the dialog:

client: I want to PUT data which has this hash sum: HASH
server: OK, PUT would be successful, but to be sure that you have the data, please send me the bytes N up to M. I need this to be sure you have the hash-sum and the data.
client: Bytes N up to M of the data are abcde...
server: OK, your bytes match mine. I trust you. Upload successful, you don't need to send the data any more.

How to get this done?

Since it seems that there is not spec yet, this part of the question remains:

How to proceed to get this dream come true?

user193130 · Accepted Answer

From what you described, it seems like ETags should be used.

It was specifically designed to associate a tag (usually an MD5 hash, but can be anything) with a resource's content (and/or location) so you can later tell whether the resource has changed or not.

PUT requests are supported by ETags and are commonly used with the If-Match header for optimistic concurrency control.

However, your use case is slightly different as you are trying to prevent a PUT to a resource with the same content, whereas the If-Match header is used to only allow the PUT to a resource with the same content.

In your case, you can instead use the If-None-Match header:

The meaning of "If-None-Match: *" is that the method MUST NOT be performed if the representation selected by the origin server (or by a cache, possibly using the Vary mechanism, see section 14.44) exists, and SHOULD be performed if the representation does not exist. This feature is intended to be useful in preventing races between PUT operations.

WebDAV also supports Etags though how it's used may depend on the implementation:

Note that the meaning of an ETag in a PUT response is not clearly defined either in this document or in RFC 2616 (i.e., whether the ETag means that the resource is octet-for-octet equivalent to the body of the PUT request, or whether the server could have made minor changes in the formatting or content of the document upon storage). This is an HTTP issue, not purely a WebDAV issue.

If you are implementing your own client, I would do something like this:

Client sends a HEAD request to the resource check the ETag
- If the client sees that it matches what it has already, do not send anything else
- If it doesn't match, then send the PUT request with the If-None-Matches header

UPDATE

From your updated question, it now seems clear that when a PUT request is received, you want to check ALL resources on the server for the absence of the same content before the request is accepted. That means also checking resources which are in a different location than what was specified as the destination to the PUT request.

AFAIK, there's no existing spec to specifically handle this case. However, the ETag mechanism (and the HTTP protocol) was designed to be generic and flexible enough to handle many cases and this is one of them.

Of course, this just means you can't take advantage of standard HTTP server logic -- you'd need to custom code both the client and server side.

Assumptions

Before I get into possible implementations, there are some assumptions that need to be made.

As mentioned, you need to control both the server and the client
An algorithm needs to be agreed upon for generating the ETag based on the content. This can be MD5, SHA1, SHA2-256, SHA3, a concatenation of a combination of them, etc. I'll just mention the algorithm output as the ETag, but how you do it is up to you.

Possible implementations

These have been ordered from simplest to increasing complexity if the simple case doesn't work for you.

Possible implementation 1

This assumes your server implementation allows you to read the request headers and respond before the entire request is received.

Client computes the ETag for the file/resource to upload.
Client sends a PUT request to the server (location doesn't matter) with the header If-None-Match containing the ETag and continue sending the body normally.
Server checks to see if a resource with the ETag already exists.
Server:
- If ETag already exists, immediately return a 412 response code. Optionally terminate the connection to stop the client from continuing to send the resource (NOTE: This is NOT advisable by the HTTP spec, though not explicitly prohibited. See note 1 below). Yes, a little bandwidth is wasted, but you wouldn't have to wait for the entire request to finish.
- If ETag doesn't exist, wait for the request to finish normally.
Client:
- If the 412 response is received, interpreted it such that the resource already exists and the request needs to be aborted -- stop sending data.

Possible implementation 2

This is slightly more complex, but better adheres to the HTTP spec. Also, this MIGHT work if your server architecture doesn't allow you to read the headers before the entire request is received.

Client computes the ETag for the file/resource to upload.
Client sends a PUT request to the server (location doesn't matter) with the header If-None-Match containing the ETag and an Expect: 100-continue header. The request body is NOT yet sent at this point.
Server checks to see if a resource with the ETag already exists.
Server:
- If ETag already exists, return a 412 response.
- If ETag doesn't exist, send a 100 response and wait for the request to finish normally.
Client:
- If the 412 response is received, interpreted it such that the resource already exists and the request was therefore aborted.
- If the 100 response is received, continue sending the body normally

Possible implementation 3

This implementation probably requires the most work but should be broadly compatible with all major libraries / architectures. There's a small risk of another client uploading a file with the same contents in between the two requests though.

Client computes the ETag for the file/resource to upload.
Client sends a HEAD request (no body) to the server at /check-etag/<etag> where <etag> is the ETag. This checks whether the ETag already exists at the server.
Server code at /check-etag/* checks to see if a resource with that ETag already exists.
Server:
- If ETag already exists, return a 200 response.
- If ETag doesn't exist, send a 404 response.
Client:
- If the 200 response is received, interpreted it such that the resource already exists and do not proceed with a PUT request.
- If the 404 response is received, follow up with a normal PUT request to the intended destination.

Considerations

Although the implementation is up to you, here are some points to consider:

When a resource is added or updated, the ETag and the location should be stored in a database for quick retrieval. It is needlessly inefficient for a server to recompute the hash for every single resource whenever a resource is being uploaded. There should also be an index on the ETag and location fields for quick retrieval.
If two clients upload a resource with the same ETag at the same time, you might want to abort the 2nd one as soon as the 1st one finishes.
Using hashes for ETag means that there's a possibility for collision (where two resource would have the same hash), though in practice, the possibility is extremely slim if a good hash is used. Note that MD5 is known to be weak to intentional collision attacks. If you are paranoid, you can concatenate multiple hashes to make collision a much smaller chance.
In regards to your "security consideration", I still don't see how knowing a hash would lead to retrieval of a resource. The server will only and SHOULD ONLY tell you whether a specific ETag exists or not. Without divulging the location, it's not possible for the client to retrieve the file. And even if the client knows the location, the server SHOULD implement other security controls such as authentication and authorizations to restrict access. Using the resource location solely as a way of restricting access is just security by obscurity, especially since from what you mentioned, the paths seem to follow a pattern by username.

Notes

RFC 2616 indicates this SHOULD NOT be done:

If an origin server receives a request that does not include an Expect request-header field with the "100-continue" expectation, the request includes a request body, and the server responds with a final status code before reading the entire request body from the transport connection, then the server SHOULD NOT close the transport connection until it has read the entire request, or until the client closes the connection. Otherwise, the client might not reliably receive the response message.

Also, DO NOT close the connection from the server side without sending any status codes, as the client will most likely retry the request:

If an HTTP/1.1 client sends a request which includes a request body, but which does not include an Expect request-header field with the "100-continue" expectation, and if the client is not directly connected to an HTTP/1.1 origin server, and if the client sees the connection close before receiving any status from the server, the client SHOULD retry the request.

HTTP Spec: PUT without data transfer, since hash of data is known to server

Tags:

put

hash

rfc

webdav

deduplication

guettli

1 Answers

UPDATE

Assumptions

Possible implementations

Possible implementation 1

Possible implementation 2

Possible implementation 3

Considerations

Notes

user193130

Recent Activity

Donate For Us

HTTP Spec: PUT without data transfer, since hash of data is known to server

Tags:

put

hash

rfc

webdav

deduplication

guettli

1 Answers

UPDATE

Assumptions

Possible implementations

Possible implementation 1

Possible implementation 2

Possible implementation 3

Considerations

Notes

user193130

Related questions

Recent Activity

Donate For Us