Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I add big HTTP files in a Dockerfile and exclude them from image layers?

Our Nexus server provides build artifacts for our Java project including its installer. That installer is really big (>1GB). I would like to retrieve and use it in a Dockerfile.

What I did so far is the following:

FROM debian:jessie
...
RUN apt-get install -y curl xmllib-xpath-perl
ENV PROJECT_VERSION x.y.z-SNAPSHOT
...
RUN VERSION=`curl --silent "http://nexus:8081/service/local/artifact/maven/resolve?r=public&g=my.group.id&a=installer&v=${PROJECT_VERSION}&e=sh&c=linux64" | xpath -q -s '' -e '//data/version/text()'` \
    && echo Version:\'${VERSION}\' \
    && curl --silent http://nexus/content/groups/public/my/group/id/installer/${PROJECT_VERSION}/installer-${VERSION}-linux64.sh \
        --create-dirs \
        --output ${INSTALL_DIR}/installer.sh \
    && sh ${INSTALL_DIR}/installer.sh <someArgs> \
    && rm ${INSTALL_DIR}/installer.sh
...

With that approach I am able to:

  • Query Nexus to provide the latest SNAPSHOT version for the provided ${PROJECT_VERSION} which is logged out during docker build
  • Use that version to download the corresponding installer binary
  • Execute the installer binary
  • Delete the installer binary immediately after execution to not have it stored within the created Docker image layer

What is missing:

  • Whenever a new installer gets deployed to Nexus I have to build the Docker image with docker build --no-cache. Otherwise Docker is not able to invalidate its cache and re-run the installation step for a newer installer that was meanwhile deployed to Nexus.

So I tried a different approach using the ADD statement as those have caching capabilities according to the documentation. But that does not work since I need to provide a parameter to the ADD statement that is set by a previous step querying Nexus for the correct SNAPSHOT version:

FROM debian:jessie
...
RUN apt-get install -y curl xmllib-xpath-perl
ENV PROJECT_VERSION x.y.z-SNAPSHOT
...
ADD http://nexus:8081/service/local/artifact/maven/resolve?r=public&g=my.group.id&a=installer&v=${PROJECT_VERSION}&e=sh&c=linux64 ${INSTALL_DIR}/version.xml
RUN cat ${INSTALL_DIR}/version.xml | xpath -q -s '' -e '//data/version/text()' > ${INSTALL_DIR}/version.txt

# FIXME: Somehow do a `cat ${INSTALL_DIR}/version.txt to set the ENV ${VERSION} variable ?!

ADD http://nexus/content/groups/public/my/group/id/installer/${PROJECT_VERSION}/installer-${VERSION}-linux64.sh ${INSTALL_DIR}/installer.sh
RUN ${INSTALL_DIR}/installer.sh <someArgs> && rm ${INSTALL_DIR}/installer.sh
...

That approach does not work because:

  • It is not possible to set the ${VERSION} environment variable within the Dockerfile to the version stored within the version.txt file.
  • It is not possible to prevent having the installer stored within an image layer.

But at least this would use proper caching to re-use existing image layers for old installer versions and create new ones whenever a new installer version on Nexus gets deployed.

So the question is: How do I enable proper caching, cache invalidation and exclusion of the big installer file from the Docker image layers at the same time?

EDIT: I found a way to get the caching of image layers working properly by using an other Nexus API:

FROM debian:jessie
...
ENV PROJECT_VERSION x.y.z-SNAPSHOT
...
ADD http://nexus:8081/service/local/artifact/maven/content?r=public&g=my.group.id&a=installer&v=${PROJECT_VERSION}&e=sh&c=linux64 ${INSTALL_DIR}/installer.sh
RUN sh ${INSTALL_DIR}/installer.sh <someArgs> \
    && rm ${INSTALL_DIR}/installer.sh
...

But still the problem of having a very big installer file included in the image layers remains since in that code snipped the ADD mechanism is used.

Any ideas about how to benefit from the caching and its correct invalidation provided by the ADD statement but at the same time not include the added file into the images history?

like image 826
Henrik Sachse Avatar asked Oct 20 '22 05:10

Henrik Sachse


1 Answers

I accepted Mykola Gurovs answer because in one of his comments he pointed out an idea that helped me to solve this issue.

Here is what I did to have proper caching and cache invalidation as well as having the big installer file excluded:

FROM debian:jessie
...
RUN apt-get install -y curl
ENV PROJECT_VERSION x.y.z-SNAPSHOT
...
ADD http://nexus:8081/service/local/artifact/maven/resolve?r=public&g=my.group.id&a=installer&v=${PROJECT_VERSION}&e=sh&c=linux64 ${INSTALL_DIR}/installer.xml
RUN curl --silent "http://nexus:8081/service/local/artifact/maven/content?r=public&g=my.group.id&a=installer&v=${PROJECT_VERSION}&e=sh&c=linux64" \
        --output ${INSTALL_DIR}/installer.sh \
    && sh ${INSTALL_DIR}/installer.sh <someArgs> \
    && rm ${INSTALL_DIR}/installer.sh
...

The first ADD downloads the Maven metadata for the requested artifact. That XML file is quite small. It uses proper caching so whenever the metadata on the Nexus has been modified the cache gets invalidated.

The ADD and all its following instructions are executed without re-using any cached versions in that case.

If the metadata on the server did not change since the last download the ADD and the following RUN instruction which executes curl are taken from the image layer cache. And in the RUN it is possible to download, execute and remove the temporary big installer file in one step without having it stored in any image layers.

like image 153
Henrik Sachse Avatar answered Oct 22 '22 01:10

Henrik Sachse