_episodes/08-supplement-docker.md

   1 ---
   2 title: "Supplement: Creating Docker Images for Workflows"
   3 teaching: 10
   4 exercises: 1
   5 questions:
   6 - "How do I create Docker images from scratch?"
   7 - "What some best practices for Docker images?"
   8 objectives:
   9 - "Understand how to get started writing Dockerfiles"
  10 keypoints:
  11 - "Docker images contain the initial state of the filesystem for a container"
  12 - "Docker images are made up of layers"
  13 - "Dockerfiles consist of a series of commands to install software into the container."
  14 ---
  15
  16 Common Workflow Language supports running tasks inside software
  17 containers.  Software container systems (such as Docker) create an
  18 execution environment that is isolated from the host system, so that
  19 software installed on the host system does not conflict with the
  20 software installed inside the container.
  21
  22 Programs running inside a software container get a different (and
  23 generally restricted) view of the system than processes running
  24 outside the container.  One of the most important and useful features
  25 is that the containerized program has a different view of the file
  26 system.  A program running inside a container, searching for
  27 libraries, modules, configuration files, data files, etc, only sees
  28 the files defined inside the container.
  29
  30 This means that, usually, a given file _path_ refers to _different
  31 actual files_ depending from the persective of being inside or outside
  32 the container.  It is also possible to have a file from the host
  33 system appear at some location inside the container, meaning that the
  34 _same file_ appears at _different paths_ depending from the persective
  35 of being inside or outside the container.
  36
  37 The complexity of translating between the container and its host
  38 environment is handled by the Common Workflow Language runner.  As a
  39 workflow author, you only need to worry about the environment _inside_
  40 the container.
  41
  42 # What are Docker images?
  43
  44 The Docker image describes the starting conditions for the container.
  45 Most importantly, this includes starting layout and contents of the
  46 container's file system.  This file system is typically a lightweight
  47 POSIX environment, providing a standard set of POSIX utilities like a
  48 `sh`, `ls`, `cat`, etc and organized into standard POSIX directories
  49 like `/bin` and `/lib`.
  50
  51 The image is is made up of multiple "layers".  Each layer modifies the
  52 layer below it by adding, removing or modifying files to produce a new
  53 layer.  This allows lower layers to be re-used.
  54
  55 # Writing a Dockerfile
  56
  57 In this example, we will build a Docker image containing the
  58 Burrows-Wheeler Aligner (BWA) by Heng Li.  This is just for
  59 demonstration, in practice you should prefer to use existing
  60 containers from [BioContainers](https://biocontainers.pro/), which
  61 includes `bwa`.
  62
  63 Each line of the Docker file consists of a COMMAND in all caps,
  64 following by the parameters of that command.
  65
  66 The first line of the file will specify the base image that we are
  67 going to build from.  As mentioned, images are divided up into
  68 "layers", so this tells Docker what to use for the first layer.
  69
  70 ```
  71 FROM debian:10-slim
  72 ```
  73
  74 This starts from the lightweight ("slim") Debian 10 Docker image.
  75
  76 Docker images have a special naming scheme.
  77
  78 A bare name like "debian" or "ubuntu" means it is an official Docker
  79 image.  It has an implied prefix of "library", so you may see the
  80 image referred to as "library/debian".  Official images are published
  81 on [Docker Hub](https://hub.docker.com/search?type=image&image_filter=official).
  82
  83 A name with two parts separated by a slash is published on Docker Hub
  84 by someone else.  For example, `amazon/aws-cli` is published by
  85 Amazon.  These can also be found on [Docker Hub](https://hub.docker.com/search?type=image).
  86
  87 A name with three parts separated by slashes means it is published on
  88 a different container register.  For example,
  89 `quay.io/biocontainers/subread` is published by `quay.io`.
  90
  91 Following image name, separated by a colon is the "tag".  This is
  92 typically the version of the image.  If not provided, the default tag
  93 is "latest".  In this example, the tag is "10-slim" indicating Debian
  94 release 10.
  95
  96 The Docker file should also include a MAINTAINER (this is purely
  97 metadata, it is stored in the image but not used for execution).
  98
  99 ```
 100 MAINTAINER Peter Amstutz <peter.amstutz@curii.com>
 101 ```
 102
 103 Next is the default user inside the image.  By making choosing root,
 104 we can change anything inside the image (but not outside).
 105
 106 The body of the Dockerfile is a series of `RUN` commands.
 107
 108 Each command is run with `/bin/sh` inside the Docker container.
 109
 110 Each `RUN` command creates a new layer.
 111
 112 The `RUN` command can span multiple lines by using a trailing
 113 backslash.
 114
 115 For the first command, we use `apt-get` to install some packages that
 116 will be needed to compile `bwa`.  The `build-essential` package
 117 installs `gcc`, `make`, etc.
 118
 119 ```
 120 RUN apt-get update -qy && \
 121         apt-get install -qy build-essential wget unzip
 122 ```
 123
 124 Now we do everything else: download the source code of bwa, unzip it,
 125 make it, copy the resulting binary to `/usr/bin`, and clean up.
 126
 127 ```
 128 # Install BWA 07.7.17
 129 RUN wget https://github.com/lh3/bwa/archive/v0.7.17.zip && \
 130         unzip v0.7.17 && \
 131         cd bwa-0.7.17 && \
 132         make && \
 133         cp bwa /usr/bin && \
 134         cd .. && \
 135         rm -rf bwa-0.7.17
 136 ```
 137
 138 Because each `RUN` command creates a new layer, having the build and
 139 clean up in separate `RUN` commands would mean creating a layer that
 140 includes the intermediate object files from the build.  These would
 141 then be carried around as part of the container image forever, despite
 142 being useless.  By doing the entire build and clean up in one `RUN`
 143 command, only the final state of the file system, with the binary
 144 copied to `/usr/bin`, is committed to a layer.
 145
 146 To build a Docker image from a Dockerfile, use `docker build`.
 147
 148 Use the `-t` option to specify the name of the image.  Use `-f` if the
 149 file isn't named exactly `Dockerfile`.  The last part is the directory
 150 where it will find the `Dockerfile` and any files that are referenced
 151 by `COPY` (described below).
 152
 153 ```
 154 docker build -t training/bwa -f Dockerfile.single-stage .
 155 ```
 156
 157 > ## Exercise
 158 >
 159 > Create a `Dockerfile` based on this lesson and build it for yourself.
 160 >
 161 {: .challenge}
 162
 163 # Adding files to the image during the build
 164
 165 Using the `COPY` command, you can copy files from the source directory
 166 (this is the directory your Dockerfile was located) into the image
 167 during the build.  For example, you have a `requirements.txt` next to
 168 Dockerfile:
 169
 170 ```
 171 COPY requirements.txt /tmp/
 172 RUN pip install --requirement /tmp/requirements.txt
 173 ```
 174
 175 # Multi-stage builds
 176
 177 As noted, it is good practice to avoiding leaving files in the Docker
 178 image that were required to build the program, but not to run it, as
 179 those files are simply useless bloat.  Docker offers a more
 180 sophisticated way to create clean builds by separating the build steps
 181 from the creation of the final container.  These are called
 182 "multi-stage" builds.
 183
 184 A multi stage build has multiple `FROM` lines.  Each `FROM` line is a
 185 separate container build.  The last `FROM` in the file describes the
 186 final container image that will be created.
 187
 188 The key benefit is that the different stages are independent, but you
 189 can copy files from one stage to another.
 190
 191 Here is an example of the bwa build as a multi-stage build.  It is a
 192 little bit more complicated, but the outcome is a smaller image,
 193 because the "build-essential" tools are not included in the final
 194 image.
 195
 196 ```
 197 # Build the base image.  This is the starting point for both the build
 198 # stage and the final stage.
 199 # the "AS base" names the image within the Dockerfile
 200 FROM debian:10-slim AS base
 201 MAINTAINER Peter Amstutz <peter.amstutz@curii.com>
 202
 203 # Install libz, because the bwa binary will depend on it.
 204 # As it happens, this already included in the base Debian distribution
 205 # because lots of things use libz specifically, but it is good practice
 206 # to explicitly declare that we need it.
 207 RUN apt-get update -qy
 208 RUN apt-get install -qy zlib1g
 209
 210
 211 # This is the builder image.  It has the commands to install the
 212 # prerequisites and then build the bwa binary.
 213 FROM base as builder
 214 RUN apt-get install -qy build-essential wget unzip zlib1g-dev
 215
 216 # Install BWA 07.7.17
 217 RUN wget https://github.com/lh3/bwa/archive/v0.7.17.zip
 218 RUN unzip v0.7.17
 219 RUN cd bwa-0.7.17 && \
 220     make && \
 221     cp bwa /usr/bin
 222
 223
 224 # Build the final image.  It starts from base (where we ensured that
 225 # libz was installed) and then copies the bwa binary from the builder
 226 # image.  The result is the final image only has the compiled bwa
 227 # binary, but not the clutter from build-essentials or from compiling
 228 # the program.
 229 FROM base AS final
 230
 231 # This is the key command, we use the COPY command described earlier,
 232 # but instead of copying from the host, the --from option copies from
 233 # the builder image.
 234 COPY --from=builder /usr/bin/bwa /usr/bin/bwa
 235 ```
 236
 237 # Best practices for Docker images
 238
 239 Docker has published guidelines on building efficient images:
 240
 241 https://docs.docker.com/develop/develop-images/dockerfile_best-practices/
 242
 243 Some additional considerations when building images for use with Workflows:
 244
 245 ## Store Dockerfiles in git, alongside workflow definitions
 246
 247 Dockerfiles are scripts and should be managed with version control
 248 just like other kinds of code.
 249
 250 ## Be specific about software versions
 251
 252 Instead of blindly installing the latest version of a package, or
 253 checking out the `master` branch of a git repository and building from
 254 that, be specific in your Dockerfile about what version of the
 255 software you are installing.  This will greatly aid the
 256 reproducibility of your Docker image builds.
 257
 258 Similarly, be as specific as possible about the version of the base
 259 image you want to use in your `FROM` command.  If you don't specify a
 260 tag, the default tag is called "latest".
 261
 262 ## Tag your builds
 263
 264 Use meaningful tags on your own Docker image so you can tell versions
 265 of your Docker image apart as it is updated over time.  These can
 266 reflect the version of the underlying software, or a version you
 267 assign to the Dockerfile itself.  These can be manually assigned
 268 version numbers (e.g. 1.0, 1.1, 1.2, 2.0), timestamps (e.g. YYYYMMDD
 269 like 20220126) or the hash of a git commit.
 270
 271 ## Avoid putting reference data to Docker images
 272
 273 Bioinformatics tools often require large reference data sets to run.
 274 These should be supplied externally (as workflow inputs) rather than
 275 added to the container image.  This makes it easy to update reference
 276 data instead of having to rebuild a new Docker image every time, which
 277 is much more time consuming.
 278
 279 ## Small scripts can be inputs, too
 280
 281 If you have a small script, e.g. a self-contained single-file Python
 282 script which imports Python modules installed inside the container,
 283 you can supply the script as a workflow input.  This makes it easy to
 284 update the script instead of having to rebuild a new Docker image
 285 every time, which is much more time consuming.
 286
 287 ## Don't use ENTRYPOINT
 288
 289 The `ENTRYPOINT` Dockerfile command modifies the command line that is
 290 executed inside the container.  This can produce confusion when the
 291 command line that supplied to the container and the command that
 292 actually runs are different.
 293
 294 ## Be careful about the build cache
 295
 296 Docker build has a useful feature where if it has a record of the
 297 exact `RUN` command against the exact base layer, it can re-use the
 298 layer from cache instead of re-running it every time.  This is a great
 299 time-saver during development, but can also be a source of
 300 frustration: build steps often download files from the Internet.  If
 301 the file being downloaded changes without the command being used to
 302 download it changing, it will reuse the cached step with the old copy
 303 of the file, instead of re-downloading it.  If this happens, use
 304 `--no-cache` to force it to re-run the steps.