doc/user/topics/run-command.html.textile.liquid

   1 ---
   2 layout: default
   3 navsection: userguide
   4 title: "run-command reference"
   5 ...
   6
   7 The @run-command@ crunch script enables you run command line programs.
   8
   9 h1. Using run-command
  10
  11 The basic @run-command@ process evaluates its inputs and builds a command line, executes the command, and saves the contents of the output directory back to Keep.  For large datasets, @run-command@ can schedule concurrent tasks to execute the wrapped program over a range of inputs (see @task.foreach@ below.)
  12
  13 @run-command@ is controlled through the @script_parameters@ section of a pipeline component.  @script_parameters@ is a JSON object consisting of key-value pairs.  There are three categories of keys that are meaningful to run-command:
  14 * The @command@ section defining the template to build the command line of task
  15 * Special processing directives such as @task.foreach@ @task.cwd@ @task.vwd@ @task.stdin@ @task.stdout@
  16 * User-defined parameters (everything else)
  17
  18 In the following examples, you can use "dry run mode" to determine the command line that @run-command@ will use without actually running the command.  For example:
  19
  20 <notextile>
  21 <pre><code>~$ <span class="userinput">cd $HOME/arvados/crunch_scripts</span>
  22 ~$ <span class="userinput">./run-command --dry-run --script-parameters '{
  23   "command": ["echo", "hello world"]
  24 }'</span>
  25 run-command: echo hello world
  26 </code></pre>
  27 </notextile>
  28
  29 h2. Command template
  30
  31 The value of the "command" key is a list.  The first parameter of the list is the actual program to invoke, followed by the command arguments.  The simplest @run-command@ invocation simply runs a program with static parameters.  In this example, run "echo" with the first argument "hello world":
  32
  33 <pre>
  34 {
  35   "command": ["echo", "hello world"]
  36 }
  37 </pre>
  38
  39 Running this job will print "hello world" to the job log.
  40
  41 By default, the command will start with the current working directory set to the output directory.  Anything written to the output directory will be saved to Keep when the command is finished.  You can change the default working directory using @task.cwd@ and get the path to the output directory using @$(task.outdir)@ as explained below.
  42
  43 Items in the "command" list may include lists and objects in addition to strings.  Lists are flattened to produce the final command line.  JSON objects are evaluated as list item functions (see below).  For example, the following evaluates to @["echo", "hello", "world"]@:
  44
  45 <pre>
  46 {
  47   "command": ["echo", ["hello", "world"]]
  48 }
  49 </pre>
  50
  51 Finally, if "command" is a list of lists, it specifies a Unix pipeline where the standard output of the previous command is piped into the standard input of the next command.  The following example describes the Unix pipeline @cat foo | grep bar@:
  52
  53 <pre>
  54 {
  55   "command": [["cat", "foo"], ["grep", "bar"]]
  56 }
  57 </pre>
  58
  59 h2. Parameter substitution
  60
  61 The "command" list can include parameter substitutions.  Substitutions are enclosed in "$(...)" and may contain the name of a user-defined parameter.  In the following example, the value of "a" is "hello world"; so when "command" is evaluated, it will substitute "hello world" for "$(a)":
  62
  63 <pre>
  64 {
  65   "a": "c1bad4b39ca5a924e481008009d94e32+210/var-GS000016015-ASM.tsv.bz2",
  66   "command": ["echo", "$(file $(a))"]
  67 }
  68 </pre>
  69
  70 table(table table-bordered table-condensed).
  71 |_. Function|_. Action|
  72 |$(file ...)       | Takes a reference to a file within an Arvados collection and evaluates to a file path on the local file system where that file can be accessed by your command.  Will raise an error if the file is not accessible.|
  73 |$(dir ...)        | Takes a reference to an Arvados collection or directory within an Arvados collection and evaluates to a directory path on the local file system where that directory can be accessed by your command.  The path may include a file name, in which case it will evaluate to the parent directory of the file.  Uses Python's os.path.dirname(), so "/foo/bar" will evaluate to "/foo" but "/foo/bar/" will evaluate to "/foo/bar".  Will raise an error if the directory is not accessible. |
  74 |$(basename&nbsp;...)   | Strip leading directory and trailing file extension from the path provided.  For example, $(basename /foo/bar.baz.txt) will evaluate to "bar.baz".|
  75 |$(glob ...)       | Take a Unix shell path pattern (supports @*@ @?@ and @[]@) and search the local filesystem, returning the first match found.  Use together with $(dir ...) to get a local filesystem path for Arvados collections.  For example: $(glob $(dir $(mycollection)/*.bam)) will find the first .bam file in the collection specified by the user parameter "mycollection".  If there is more than one match, which one is returned is undefined.  Will raise an error if no matches are found.|
  76
  77 h2. List context
  78
  79 Where specified by the documentation, parameters may be evaluated in a "list context".  That means the value will evaluate to a list instead of a string.  Parameter values can be a static list, a path to a file, a path to a directory, or a JSON object describing a list context function.
  80
  81 If the value is a string, it is interpreted as a path.  If the path specifies a regular file, that file will be opened as a text file and produce a list with one item for each line in the file (end-of-line characters will be stripped).  If the path specifies a directory, produce a list containing all of the entries in the directory.  Note that parameter expansion is not performed on list items produced this way.
  82
  83 If the value is a static list, it will evaluate each item and return the expanded list.  Each item may be a string (evaluated for parameter substitution), a list (recursively evaluated), or a JSON object (indicating a list function, described below).
  84
  85 If the value is a JSON object, it is evaluated as a list function described below.
  86
  87 h2. List functions
  88
  89 When @run-command@ is evaluating a list (such as "command"), in addition to string parameter substitution, you can use list item functions.  In the following functions, you specify the name of a user parameter to act on (@"$(a)"@ in the first example); the value of that user parameter will be evaluated in a list context (as described above) to get the list value. Alternately, you can provide list value directly in line.  As an example, the following two fragments yield the same result:
  90
  91 <pre>
  92 {
  93   "a": ["alice", "bob"],
  94   "command": ["echo", {"foreach": "$(a)",
  95                        "var": "a_var",
  96                        "command": ["--something", "$(a_var)"]}]
  97 }
  98 </pre>
  99
 100 <pre>
 101 {
 102   "command": ["echo", {"foreach": ["alice", "bob"],
 103                        "var": "a_var",
 104                        "command": ["--something", "$(a_var)"]}]
 105 }
 106 </pre>
 107
 108 Note: when you provide the list inline with "foreach" or "index", you must include the "var" parameter to specify the substitution variable name to use when evaluating the command fragment.
 109
 110 You can also nest functions.  This filters @["alice", "bob", "betty"]@ on the regular expression @"b.*"@ to get the list @["bob", "betty"]@, assigns @a_var@ to each value of the list, then expands @"command"@ to get @["--something", "bob", "--something", "betty"]@.
 111
 112 <pre>
 113 {
 114   "command": ["echo", {"foreach": {"filter": ["alice", "bob", "betty"],
 115                                    "regex": "b.*"},
 116                        "var": "a_var",
 117                        "command": ["--something", "$(a_var)"]}]
 118 }
 119 </pre>
 120
 121 h3. foreach
 122
 123 The @foreach@ list item function (not to be confused with the @task.foreach@ directive) expands a command template for each item in the specified user parameter (the value of the user parameter is evaluated in a list context, as described above).  The following example will evaluate "command" to @["echo", "--something", "alice", "--something", "bob"]@:
 124
 125 <pre>
 126 {
 127   "a": ["alice", "bob"],
 128   "command": ["echo", {"foreach": "$(a)",
 129                        "var": "a_var",
 130                        "command": ["--something", "$(a_var)"]}]
 131 }
 132 </pre>
 133
 134 h3. index
 135
 136 This function extracts a single item from a list.  The value of @index@ is zero-based (i.e. the first item is at index 0, the second item index 1, etc).  The following example will evaluate "command" to @["echo", "--something", "bob"]@:
 137
 138 <pre>
 139 {
 140   "a": ["alice", "bob"],
 141   "command": ["echo", {"list": "$(a)",
 142                        "var": "a_var",
 143                        "index": 1,
 144                        "command": ["--something", "$(a_var)"]}]
 145 }
 146 </pre>
 147
 148 h3. filter
 149
 150 Filter the list so that it only includes items that match a regular expression.  The following example will evaluate to @["echo", "bob"]@
 151
 152 <pre>
 153 {
 154   "a": ["alice", "bob"],
 155   "command": ["echo", {"filter": "$(a)",
 156                        "regex": "b.*"}]
 157 }
 158 </pre>
 159
 160 h3. group
 161
 162 Generate a list of lists, where items are grouped on common subexpression match.  Items which don't match the regular expression are excluded.  In the following example, the subexpression is @(a?)@, resulting in two groups, strings that contain the letter 'a' and strings that do not.  The following example evaluates to @["echo", "--group", "alice", "carol", "dave", "--group", "bob", "betty"]@:
 163
 164 <pre>
 165 {
 166   "a": ["alice", "bob", "betty", "carol", "dave"],
 167   "b": {"group": "$(a)",
 168         "regex": "[^a]*(a?).*"},
 169   "command": ["echo", {"foreach": "$(b)",
 170                        "var": "b_var",
 171                        "command": ["--group", "$(b_var)"]}]
 172 }
 173 </pre>
 174
 175 h3. extract
 176
 177 Generate a list of lists, where items are split by subexpression match.  Items which don't match the regular expression are excluded.  The following example evaluates to @["echo", "--something", "c", "a", "rol", "--something", "d", "a", "ve"]@:
 178
 179 <pre>
 180 {
 181   "a": ["alice", "bob", "carol", "dave"],
 182   "b": {"extract": "$(a)",
 183         "regex": "(.+)(a)(.*)"},
 184   "command": ["echo", {"foreach": "$(b)",
 185                        "var": "b_var",
 186                        "command": ["--something", "$(b_var)"]}]
 187 }
 188 </pre>
 189
 190 h3. batch
 191
 192 Generate a list of lists, where items are split into a batch size.  If the list does not divide evenly into batch sizes, the last batch will be short.  The following example evaluates to @["echo", "--something", "alice", "bob", "--something", "carol", "dave"]@
 193
 194 <pre>
 195 {
 196   "a": ["alice", "bob", "carol", "dave"],
 197   "command": ["echo", {"foreach":{"batch": "$(a)",
 198                                   "size": 2},
 199                        "var": "a_var",
 200                        "command": ["--something", "$(a_var)"]}]
 201 }
 202 </pre>
 203
 204 h2. Directives
 205
 206 Directives alter the behavior of run-command.  All directives are optional.
 207
 208 h3. task.cwd
 209
 210 This directive sets the initial current working directory in which your command will run.  If @task.cwd@ is not specified, the default current working directory is @task.outdir@.
 211
 212 h3. task.ignore_rcode
 213
 214 By Unix convention a task which exits with a non-zero return code is considered failed.  However, some programs (such as @grep@) return non-zero codes for conditions that should not be considered fatal errors.  Set @"task.ignore_rcode": true@ to indicate the task should always be considered a success regardless of the return code.
 215
 216 h3. task.stdin and task.stdout
 217
 218 Provide standard input and standard output redirection.
 219
 220 @task.stdin@ must evaluate to a path to a file to be bound to the standard input stream of the command.  When command describes a Unix pipeline, this goes into the first command.
 221
 222 @task.stdout@ specifies the desired file name in the output directory to save the content of standard output.  When command describes a Unix pipeline, this captures the output of the last command.
 223
 224 h3. task.vwd
 225
 226 Background: because Keep collections are read-only, this does not play well with certain tools that expect to be able to write their outputs alongside their inputs (such as tools that generate indexes that are closely associated with the original file.)  The run-command's solution to this is the "virtual working directory".
 227
 228 @task.vwd@ specifies a Keep collection with the starting contents of the directory.  @run-command@ will then populate @task.outdir@ with directories and symlinks to mirror the contents of the @task.vwd@ collection.  Your command will then be able to both access its input files and write its output files in @task.outdir@.  When the command completes, the output collection will merge the output of your command with the contents of the starting collection.  Note that files in the starting collection remain read-only and cannot be altered or deleted.
 229
 230 h3. task.foreach
 231
 232 Using @task.foreach@, you can run your command concurrently over large datasets.
 233
 234 @task.foreach@ takes the names of one or more user-defined parameters.  The value of these parameters are evaluated in a list context.  @run-command@ then generates tasks based on the Cartesian product (i.e. all combinations) of the input lists.  The outputs of all tasks are merged to create the final output collection.  Note that if two tasks output a file in the same directory with the same name, that file will be concatenated in the final output.  In the following example, three tasks will be created for the "grep" command, based on the contents of user parameter "a":
 235
 236 <pre>
 237 {
 238   "command": ["echo", "$(a)"],
 239   "task.foreach": "a",
 240   "a": ["alice", "bob", "carol"]
 241 }
 242 </pre>
 243
 244 This evaluates to the commands:
 245 <notextile>
 246 <pre>
 247 ["echo", "alice"]
 248 ["echo", "bob"]
 249 ["echo", "carol"]
 250 </pre>
 251 </notextile>
 252
 253 You can also specify multiple parameters:
 254
 255 <pre>
 256 {
 257   "a": ["alice", "bob"],
 258   "b": ["carol", "dave"],
 259   "task.foreach": ["a", "b"],
 260   "command": ["echo", "$(a)", "$(b)"]
 261 }
 262 </pre>
 263
 264 This evaluates to the commands:
 265
 266 <pre>
 267 ["echo", "alice", "carol"]
 268 ["echo", "alice", "dave"]
 269 ["echo", "bob", "carol"]
 270 ["echo", "bob", "dave"]
 271 </pre>
 272
 273 h1. Examples
 274
 275 The following is a single task pipeline using @run-command@ to run the bwa alignment tool to align a single paired-end read fastq sample.  The input to this pipeline is the reference genome and a collection consisting of two fastq files for the read pair.
 276
 277 <notextile>{% code 'run_command_simple_example' as javascript %}</notextile>
 278
 279 The following is a concurrent task pipeline using @run-command@ to run the bwa alignment tool to align a set of fastq reads over multiple samples.  The input to this pipeline is the reference genome and a collection consisting subdirectories for each sample, with each subdirectory containing pairs of fastq files for each set of reads.
 280
 281 <notextile>{% code 'run_command_foreach_example' as javascript %}</notextile>