_episodes/08-supplement-docker.md

   1 ---
   2 title: "Supplement: Creating Docker Images"
   3 teaching: 10
   4 exercises: 1
   5 questions:
   6 - "How do I create Docker images from scratch?"
   7 - "What some best practices for Docker images?"
   8 objectives:
   9 - ""
  10 keypoints:
  11 - ""
  12 ---
  13
  14 Common Workflow Language supports running tasks inside software
  15 containers.  Software container systems (such as Docker) create an
  16 execution environment that is isolated from the host system, so that
  17 software installed on the host system does not conflict with the
  18 software installed inside the container.
  19
  20 Programs running inside a software container get a different (and
  21 generally restricted) view of the system than processes running
  22 outside the container.  One of the most important and useful features
  23 is that the containerized program has a different view of the file
  24 system.  A program running inside a container, searching for
  25 libraries, modules, configuration files, data files, etc, only sees
  26 the files defined inside the container.
  27
  28 This means that, usually, a given file _path_ refers to _different
  29 actual files_ depending from the persective of being inside or outside
  30 the container.  It is also possible to have a file from the host
  31 system appear at some location inside the container, meaning that the
  32 _same file_ appears at _different paths_ depending from the persective
  33 of being inside or outside the container.
  34
  35 The complexity of translating between the container and its host
  36 environment is handled by the Common Workflow Language runner.  As a
  37 workflow author, you only need to worry about the environment _inside_
  38 the container.
  39
  40 # What are Docker images?
  41
  42 The Docker image describes the starting conditions for the container.
  43 Most importantly, this includes starting layout and contents of the
  44 container's file system.  This file system is typically a lightweight
  45 POSIX environment, providing a standard set of POSIX utilities like a
  46 `sh`, `ls`, `cat`, etc and organized into standard POSIX directories
  47 like `/bin` and `/lib`.
  48
  49 The image is is made up of multiple "layers".  Each layer modifies the
  50 layer below it by adding, removing or modifying files to produce a new
  51 layer.  This allows lower layers to be re-used.
  52
  53 # Writing a Dockerfile
  54
  55 In this example, we will build a Docker image containing the
  56 Burrows-Wheeler Aligner (BWA) by Heng Li.  This is just for
  57 demonstration, in practice you should prefer to use existing
  58 containers from [BioContainers](https://biocontainers.pro/), which
  59 includes `bwa`.
  60
  61 Each line of the Docker file consists of a COMMAND in all caps,
  62 following by the parameters of that command.
  63
  64 The first line of the file will specify the base image that we are
  65 going to build from.  As mentioned, images are divided up into
  66 "layers", so this tells Docker what to use for the first layer.
  67
  68
  69 ```
  70 FROM debian:10-slim
  71 ```
  72
  73 This starts from the lightweight ("slim") Debian 10 Docker image.
  74
  75 Docker images have a special naming scheme.
  76
  77 A bare name like "debian" or "ubuntu" means it is an official Docker
  78 image.  It has an implied prefix of "library", so you may see the
  79 image referred to as "library/debian".  Official images are published
  80 on [Docker Hub](https://hub.docker.com/search?type=image&image_filter=official).
  81
  82 A name with two parts separated by a slash is published on Docker Hub
  83 by someone else.  For example, `amazon/aws-cli` is published by
  84 Amazon.  These can also be found on [Docker Hub](https://hub.docker.com/search?type=image).
  85
  86 A name with three parts separated by slashes means it is published on
  87 a different container register.  For example,
  88 `quay.io/biocontainers/subread` is published by `quay.io`.
  89
  90 Following image name, separated by a colon is the "tag".  This is
  91 typically the version of the image.  If not provided, the default tag
  92 is "latest".  In this example, the tag is "10-slim" indicating Debian
  93 release 10.
  94
  95 > ## Best practice
  96 >
  97 > You should always include the tag to refer to a specific image
  98 > version, or you might run into problems when "latest" changes.
  99
 100 The Docker file should also include a MAINTAINER (this is purely
 101 metadata, it is stored in the image but not used for execution).
 102
 103 ```
 104 MAINTAINER Peter Amstutz <peter.amstutz@curii.com>
 105 ```
 106
 107 Next is the default user inside the image.  By making choosing root,
 108 we can change anything inside the image (but not outside).
 109
 110 The body of the Dockerfile is a series of `RUN` commands.
 111
 112 Each command is run with `/bin/sh` inside the Docker container.
 113
 114 Each `RUN` command creates a new layer.
 115
 116 The `RUN` command can span multiple lines by using a trailing
 117 backslash.
 118
 119 For the first command, we use `apt-get` to install some packages that
 120 will be needed to compile `bwa`.  The `build-essential` package
 121 installs `gcc`, `make`, etc.
 122
 123 ```
 124 RUN apt-get update -qy && \
 125         apt-get install -qy build-essential wget unzip
 126 ```
 127
 128 Now we do everything else: download the source code of bwa, unzip it,
 129 make it, copy the resulting binary to `/usr/bin`, and clean up.
 130
 131 ```
 132 # Install BWA 07.7.17
 133 RUN wget https://github.com/lh3/bwa/archive/v0.7.17.zip && \
 134         unzip v0.7.17 && \
 135         cd bwa-0.7.17 && \
 136         make && \
 137         cp bwa /usr/bin && \
 138         cd .. && \
 139         rm -rf bwa-0.7.17
 140 ```
 141
 142 Because each `RUN` command creates a new layer, having the build and
 143 clean up in separate `RUN` commands would mean creating a layer that
 144 includes the intermediate object files from the build.  These would
 145 then be carried around as part of the container image forever, despite
 146 being useless.  By doing the entire build and clean up in one `RUN`
 147 command, only the final state of the file system, with the binary
 148 copied to `/usr/bin`, is committed to a layer.
 149
 150 To build a Docker image from a Dockerfile, use `docker build`.
 151
 152 This command takes the name to use for the image with `-t`, and the
 153 directory that it should find the `Dockerfile`:
 154
 155 ```
 156 docker build -t training/bwa .
 157 ```
 158
 159 > ## Exercise
 160 >
 161 > Create a `Dockerfile` based on this lesson and build it for yourself.
 162 >
 163 {: .challenge}
 164
 165 # Adding files to the image during the build
 166
 167 Using the `COPY` command, you can copy files from the source directory
 168 (this is the directory your Dockerfile was located) into the image
 169 during the build.  For example, you have a `requirements.txt` next to
 170 Dockerfile:
 171
 172 ```
 173 COPY requirements.txt /tmp/
 174 RUN pip install --requirement /tmp/requirements.txt
 175 ```
 176
 177 # Best practices for Docker images
 178
 179 Docker has published guidelines on building efficient images:
 180
 181 https://docs.docker.com/develop/develop-images/dockerfile_best-practices/
 182
 183 Some additional considerations when building images for use with Workflows:
 184
 185 ## Store Dockerfiles in git, alongside workflow definitions
 186
 187 Dockerfiles are scripts and should be managed with version control
 188 just like other kinds of code.
 189
 190 ## Be specific about software versions
 191
 192 Instead of blindly installing the latest version of a package, or
 193 checking out the `master` branch of a git repository and building from
 194 that, be specific in your Dockerfile about what version of the
 195 software you are installing.  This will greatly aid the
 196 reproducibility of your Docker image builds.
 197
 198 ## Tag your builds
 199
 200 Use meaningful tags on the Docker image so you can tell versions of
 201 your Docker image apart as it is updated over time.  These can reflect
 202 the version of the underlying software, or the version of the
 203 Dockerfile itself.  These can be manually assigned version numbers
 204 (e.g. 1.0, 1.1, 1.2, 2.0), timestamps (e.g. YYYYMMDD like 20220126) or
 205 the hash of a git commit.
 206
 207 ## Avoid putting reference data to Docker images
 208
 209 Bioinformatics tools often require large reference data sets to run.
 210 These should be supplied externally (as workflow inputs) rather than
 211 added to the container image.  This makes it easy to update reference
 212 data instead of having to rebuild and re-upload a new Docker image
 213 every time, which is much more time consuming.
 214
 215 ## Small scripts can be inputs, too
 216
 217 If you have a small script, e.g. a self-contained Python script which
 218 relies on modules installed inside the container, but is itself
 219 contained in a single file, you can supply the script as a workflow
 220 input.  This makes it easy to update the script instead of having to
 221 rebuild and re-upload a new Docker image every time, which is much
 222 more time consuming.
 223
 224 ## Don't use ENTRYPOINT
 225
 226 The `ENTRYPOINT` Dockerfile command modifies the command line that is executed
 227 inside the container.  This can result in confusion when the command
 228 line that was supplied to the container and the command that actually
 229 runs are different.