Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are some strategies to invalidate the Dockerfile instruction cache while Downloading resources

Some of our Docker images require downloading larger binaries from a Nexus server or from the Internet, which is responsible for distributing Java, Node.js, Mobile (Android and iOS) apps. For instance, using either the ADD or the RUN instruction to download.

RUN curl -o docker https://get.docker.com/builds/Linux/x86_64/docker-latest

Considering that the command "docker build" will be looking at the instructions and caching depending on the mtime of the file, what's the approach that takes advantage of the caching mechanism while building those images, avoiding the re-download an entire binary? https://stackoverflow.com/a/26612694/433814.

Another question is if the resource changes, Docker will not be downloading the latest version.

like image 686
Marcello de Sales Avatar asked Feb 12 '23 11:02

Marcello de Sales


2 Answers

Solution

Docker will NOT look at any caching mechanism before downloading using "RUN curl" nor ADD. It will repeat the step of downloading. However, Docker invalidates the cache if the mtime of the file has been changed https://stackoverflow.com/a/26612694/433814, among other things. https://github.com/docker/docker/blob/master/pkg/tarsum/versioning.go#L84

Here's a strategy that I've been working on to solve this problem when building Dockerfiles with dependencies from File storage or repository such as Nexus, Amazon S3 is to retrieve the ETag from the resource, caching it, and modifying the mdtime of a cache-flag file. (https://gist.github.com/marcellodesales/721694c905dc1a2524bc#file-s3update-py-L18). It follows the approach performed in Python (https://stackoverflow.com/a/25307587), Node.js (http://bitjudo.com/blog/2014/03/13/building-efficient-dockerfiles-node-dot-js/) projects.

Here's what we can do:

  1. Get the ETag of the resource and save it outside of Dockerfile
  2. Use an ADD instruction to add the cacheable file prior to download
    • Docker will check the mtime metadata of the file to whether invalidate the cache or not.
  3. Use a RUN instruction as usual to download the content
    • If the previous instruction was invalidated, Docker will re-download the file. If not, the cache will be used.

Here's a setup to demo this strategy:

Example

  1. Create a Web Server that handles HEAD requests and return an ETag header, usually returned by servers.

    • This simulates the Nexus or S3 storage of files.
  2. Build an image and verify that the dependent layer will download the resource for the first time

    • Caching the current value of the ETag
  3. Rebuild the image and verify that the dependent layer will use the Cached value.

  4. Changing the ETag value returned by Web Server handler to simulate a change.

    • In addition, persist the change IFF the file has changed. In this cause yes...
    • Rebuild the image and verify that the dependent layer will be invalidated, triggering a download.
  5. Rebuild the image again and verify that the cache was used.

1. Node.js server

Suppose you have the following Node.js server serving files. Let's implement a HEAD operation and return a value.

// You'll see the client-side's output on the console when you run it.

var restify = require('restify');

// Server
var server = restify.createServer({
  name: 'myapp',
  version: '1.0.0'
});

server.head("/", function (req, res, next) {
  res.writeHead(200, {'Content-Type': 'application/json; charset=utf-8',
        'ETag': '"{SHA1{465fb0d9b9f143ad691c7c3bcf3801b47284f8555}}"'});
  res.end();
  return next();
});

server.get("/", function (req, res, next) {
  res.writeHead(200, {'Content-Type': 'application/json; charset=utf-8',
        'ETag': '"{SHA1{465fb0d9b9f143ad691c7c3bcf3801b47284f8555}}"'});
  res.write("The file to be downloaded");
  res.end();
  return next();
});

server.listen(80, function () {
  console.log('%s listening at %s', server.name, server.url);
});

// Client
var client = restify.createJsonClient({
  url: 'http://localhost:80',
  version: '~1.0'
});

client.head('/', function (err, req, res, obj) {
  if(err) console.log("An error ocurred:", err);
  else console.log('HEAD    /   returned headers: %j', res.headers);
});

Executing this will give you:

mdesales@ubuntu [11/27/201411:10:49] ~/dev/icode/fuego/interview (feature/supportLogAuditor *) $ node testserver.js 
myapp listening at http://0.0.0.0:8181
HEAD    /   returned headers: {"content-type":"application/json; charset=utf-8",
            "etag":"\"{SHA1{465fb0d9b9f143ad691c7c3bcf3801b47284f8555}}\"",
            "date":"Thu, 27 Nov 2014 19:10:50 GMT","connection":"keep-alive"}

2. Build an image based on ETag value

Consider the following build script that caches the ETag Header in a file.

#!/bin/sh

# Delete the existing first, and get the headers of the server to a file "headers.txt"
# Grep the ETag to a "new-docker.etag" file
# If the file exists, verify if the ETag has changed and/or move/modify the mtime of the file
# Proceed with the "docker build" as usual
rm -f new-docker.etag
curl -I -D headers.txt http://192.168.248.133:8181/ && \
  grep -o 'ETag[^*]*' headers.txt > new-docker.etag && \
  rm -f headers.txt

if [ ! -f docker.etag ]; then
  cp new-docker.etag docker.etag
else
  new=$(cat docker.etag)
  old=$(cat new-docker.etag)
  echo "Old ETag = $old"
  echo "New ETag = $new"
  if [ "$old" != "$new" ]; then
    mv new-docker.etag docker.etag
    touch -t 200001010000.00 docker.etag
  fi
fi

docker build -t platform.registry.docker.corp.intuit.net/container/mule:3.4.1 .

3. Rebuilding and using cache

Building this would result as follows, considering I'm using the current cache.

mdesales@ubuntu [11/27/201411:54:08] ~/dev/github-intuit/docker-images/platform/mule-3.4 (master) $ ./build.sh 
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
ETag: "{SHA1{465fb0d9b9f143ad691c7c3bcf3801b47284f8555}}"
Date: Thu, 27 Nov 2014 19:54:16 GMT
Connection: keep-alive

Old ETag = ETag: "{SHA1{465fb0d9b9f143ad691c7c3bcf3801b47284f8555}}"
New ETag = ETag: "{SHA1{465fb0d9b9f143ad691c7c3bcf3801b47284f8555}}"
Sending build context to Docker daemon 51.71 kB
Sending build context to Docker daemon 
Step 0 : FROM core.registry.docker.corp.intuit.net/runtime/java:7
 ---> 3eb1591273f5
Step 1 : MAINTAINER [email protected]
 ---> Using cache
 ---> 9bb8fff83697
Step 2 : WORKDIR /opt
 ---> Using cache
 ---> 3e3c96d96fc9
Step 3 : ADD docker.etag /tmp/docker.etag
 ---> Using cache
 ---> db3f82289475
Step 4 : RUN cat /tmp/docker.etag
 ---> Using cache
 ---> 0d4147a5f5ee
Step 5 : RUN curl -o docker https://get.docker.com/builds/Linux/x86_64/docker-latest
 ---> Using cache
 ---> 6bd6e75be322
Successfully built 6bd6e75be322

4. Simulating the ETag change

Changing the value of the ETag on the server and restarting the server to simulate the new update will result in updating the cache-flag file and invalidation of the Cache. For instance, the Etag was changed to "465fb0d9b9f143ad691c7c3bcf3801b47284f8333". Rebuilding will trigger a new download because the ETag file was updated, and Docker will verify that during the "ADD" instruction. Here, step #5 will run again.

mdesales@ubuntu [11/27/201411:54:16] ~/dev/github-intuit/docker-images/platform/mule-3.4 (master) $ ./build.sh 
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
ETag: "{SHA1{465fb0d9b9f143ad691c7c3bcf3801b47284f8333}}"
Date: Thu, 27 Nov 2014 19:54:45 GMT
Connection: keep-alive

Old ETag = ETag: "{SHA1{465fb0d9b9f143ad691c7c3bcf3801b47284f8333}}"
New ETag = ETag: "{SHA1{465fb0d9b9f143ad691c7c3bcf3801b47284f8555}}"
Sending build context to Docker daemon 50.69 kB
Sending build context to Docker daemon 
Step 0 : FROM core.registry.docker.corp.intuit.net/runtime/java:7
 ---> 3eb1591273f5
Step 1 : MAINTAINER [email protected]
 ---> Using cache
 ---> 9bb8fff83697
Step 2 : WORKDIR /opt
 ---> Using cache
 ---> 3e3c96d96fc9
Step 3 : ADD docker.etag /tmp/docker.etag
 ---> ac3b200c8cdc
Removing intermediate container 4cf0040dbc43
Step 4 : RUN cat /tmp/docker.etag
 ---> Running in 4dd38d30549a
ETag: "{SHA1{465fb0d9b9f143ad691c7c3bcf3801b47284f8333}}"
 ---> 4fafbeac2180
Removing intermediate container 4dd38d30549a
Step 5 : RUN curl -o docker https://get.docker.com/builds/Linux/x86_64/docker-latest
 ---> Running in de920c7a2e28
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 13.5M  100 13.5M    0     0  1361k      0  0:00:10  0:00:10 --:--:-- 2283k
 ---> 95aff324da85
Removing intermediate container de920c7a2e28
Successfully built 95aff324da85

5. Reusing the Cache again

Considering that the ETag hasn't changed, the cache-flag file will continue being the same and Docker will do a super fast build using the cache.

mdesales@ubuntu [11/27/201411:54:56] ~/dev/github-intuit/docker-images/platform/mule-3.4 (master) $ ./build.sh 
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
ETag: "{SHA1{465fb0d9b9f143ad691c7c3bcf3801b47284f8333}}"
Date: Thu, 27 Nov 2014 19:54:58 GMT
Connection: keep-alive

Old ETag = ETag: "{SHA1{465fb0d9b9f143ad691c7c3bcf3801b47284f8333}}"
New ETag = ETag: "{SHA1{465fb0d9b9f143ad691c7c3bcf3801b47284f8333}}"
Sending build context to Docker daemon 51.71 kB
Sending build context to Docker daemon 
Step 0 : FROM core.registry.docker.corp.intuit.net/runtime/java:7
 ---> 3eb1591273f5
Step 1 : MAINTAINER [email protected]
 ---> Using cache
 ---> 9bb8fff83697
Step 2 : WORKDIR /opt
 ---> Using cache
 ---> 3e3c96d96fc9
Step 3 : ADD docker.etag /tmp/docker.etag
 ---> Using cache
 ---> ac3b200c8cdc
Step 4 : RUN cat /tmp/docker.etag
 ---> Using cache
 ---> 4fafbeac2180
Step 5 : RUN curl -o docker https://get.docker.com/builds/Linux/x86_64/docker-latest
 ---> Using cache
 ---> 95aff324da85
Successfully built 95aff324da85

This strategy has been used to build Node.js, Java and other App servers or pre-built dependencies.

like image 85
Marcello de Sales Avatar answered Feb 15 '23 09:02

Marcello de Sales


I use a similar but simpler approach:

Let's say I want to add a binary named mybin that can be downloaded from: http://www.example.com/pub/mybin

I do the following in my Jenkins job

wget -N http://www.example.com/pub/mybin

And in my Docker File I have:

COPY mybin /usr/local/bin/

The option -N downloads the binary only when it has changed on the server. The second time I run the wget job I get:

...
Length: 12262118 (12M) [application/octet-stream]
Server file no newer than local file ‘mybin’ -- not retrieving.

And docker build uses the cache.

If the binary changes on the server (when the time stamp changes), wget downloads the binary again which invalidates the cache for the COPY command.

like image 25
reen Avatar answered Feb 15 '23 09:02

reen