Documentation: make the arvados-dispatch-cloud installation instructions
[arvados.git] / doc / install / crunch2-cloud / install-dispatch-cloud.html.textile.liquid
1 ---
2 layout: default
3 navsection: installguide
4 title: Install the cloud dispatcher
5 ...
6 {% comment %}
7 Copyright (C) The Arvados Authors. All rights reserved.
8
9 SPDX-License-Identifier: CC-BY-SA-3.0
10 {% endcomment %}
11
12 {% include 'notebox_begin_warning' %}
13 arvados-dispatch-cloud is only relevant for cloud installations. Skip this section if you are installing an on premises cluster that will spool jobs to Slurm.
14 {% include 'notebox_end' %}
15
16 # "Introduction":#introduction
17 # "Create compute node VM image":#create-image
18 # "Update config.yml":#update-config
19 # "Install arvados-dispatch-cloud":#install-packages
20 # "Start the service":#start-service
21 # "Restart the API server and controller":#restart-api
22 # "Confirm working installation":#confirm-working
23
24 h2(#introduction). Introduction
25
26 The cloud dispatch service is for running containers on cloud VMs. It works with Microsoft Azure and Amazon EC2; future versions will also support Google Compute Engine.
27
28 The cloud dispatch service can run on any node that can connect to the Arvados API service, the cloud provider's API, and the SSH service on cloud VMs.  It is not resource-intensive, so you can run it on the API server node.
29
30 h2(#update-config). Update config.yml
31
32 h3. Configure CloudVMs
33
34 Add or update the following portions of your cluster configuration file, @config.yml@. Refer to "config.defaults.yml":{{site.baseurl}}/admin/config.html for information about additional configuration options. The @DispatchPrivateKey@ should be the *private* key generated in "the previous section":install-compute-node.html#sshkeypair.
35
36 <notextile>
37 <pre><code>    Services:
38       DispatchCloud:
39         InternalURLs:
40           "http://localhost:9006": {}
41     Containers:
42       CloudVMs:
43         # BootProbeCommand is a shell command that succeeds when an instance is ready for service
44         BootProbeCommand: "sudo systemctl status docker"
45
46         <b># --- driver-specific configuration goes here --- see Amazon and Azure examples below ---</b>
47
48       DispatchPrivateKey: |
49         -----BEGIN RSA PRIVATE KEY-----
50         MIIEpQIBAAKCAQEAqXoCzcOBkFQ7w4dvXf9B++1ctgZRqEbgRYL3SstuMV4oawks
51         ttUuxJycDdsPmeYcHsKo8vsEZpN6iYsX6ZZzhkO5nEayUTU8sBjmg1ZCTo4QqKXr
52         FJ+amZ7oYMDof6QEdwl6KNDfIddL+NfBCLQTVInOAaNss7GRrxLTuTV7HcRaIUUI
53         jYg0Ibg8ZZTzQxCvFXXnjseTgmOcTv7CuuGdt91OVdoq8czG/w8TwOhymEb7mQlt
54         lXuucwQvYgfoUgcnTgpJr7j+hafp75g2wlPozp8gJ6WQ2yBWcfqL2aw7m7Ll88Nd
55         [...]
56         oFyAjVoexx0RBcH6BveTfQtJKbktP1qBO4mXo2dP0cacuZEtlAqW9Eb06Pvaw/D9
57         foktmqOY8MyctzFgXBpGTxPliGjqo8OkrOyQP2g+FL7v+Km31Xs61P8=
58         -----END RSA PRIVATE KEY-----
59     InstanceTypes:
60       x1md:
61         ProviderType: x1.medium
62         VCPUs: 8
63         RAM: 64GiB
64         IncludedScratch: 64GB
65         Price: 0.62
66       x1lg:
67         ProviderType: x1.large
68         VCPUs: 16
69         RAM: 128GiB
70         IncludedScratch: 128GB
71         Price: 1.23
72 </code></pre>
73 </notextile>
74
75 h4. Minimal configuration example for Amazon EC2
76
77 The <span class="userinput">ImageID</span> value is the compute node image that was built in "the previous section":install-compute-node.html#aws.
78
79 <notextile>
80 <pre><code>    Containers:
81       CloudVMs:
82         ImageID: <span class="userinput">ami-01234567890abcdef</span>
83         Driver: ec2
84         DriverParameters:
85           AccessKeyID: XXXXXXXXXXXXXXXXXXXX
86           SecretAccessKey: YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY
87           SecurityGroupIDs:
88           - sg-0123abcd
89           SubnetID: subnet-0123abcd
90           Region: us-east-1
91           EBSVolumeType: gp2
92           AdminUsername: arvados
93 </code></pre>
94 </notextile>
95
96 h4. Minimal configuration example for Azure
97
98 Using managed disks:
99
100 The <span class="userinput">ImageID</span> value is the compute node image that was built in "the previous section":install-compute-node.html#azure.
101
102 <notextile>
103 <pre><code>    Containers:
104       CloudVMs:
105         ImageID: <span class="userinput">"zzzzz-compute-v1597349873"</span>
106         Driver: azure
107         # (azure) managed disks: set MaxConcurrentInstanceCreateOps to 20 to avoid timeouts, cf
108         # https://docs.microsoft.com/en-us/azure/virtual-machines/linux/capture-image
109         MaxConcurrentInstanceCreateOps: 20
110         DriverParameters:
111           # Credentials.
112           SubscriptionID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
113           ClientID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
114           ClientSecret: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
115           TenantID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
116
117           # Data center where VMs will be allocated
118           Location: centralus
119
120           # The resource group where the VM and virtual NIC will be
121           # created.
122           ResourceGroup: zzzzz
123           NetworkResourceGroup: yyyyy   # only if different from ResourceGroup
124           Network: xxxxx
125           Subnet: xxxxx-subnet-private
126
127           # The resource group where the disk image is stored, only needs to
128           # be specified if it is different from ResourceGroup
129           ImageResourceGroup: aaaaa
130
131 </code></pre>
132 </notextile>
133
134 Azure recommends using managed images. If you plan to start more than 20 VMs simultaneously, Azure recommends using a shared image gallery instead to avoid slowdowns and timeouts during the creation of the VMs.
135
136 Using an image from a shared image gallery:
137
138 <notextile>
139 <pre><code>    Containers:
140       CloudVMs:
141         ImageID: <span class="userinput">"shared_image_gallery_image_definition_name"</span>
142         Driver: azure
143         DriverParameters:
144           # Credentials.
145           SubscriptionID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
146           ClientID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
147           ClientSecret: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
148           TenantID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
149
150           # Data center where VMs will be allocated
151           Location: centralus
152
153           # The resource group where the VM and virtual NIC will be
154           # created.
155           ResourceGroup: zzzzz
156           NetworkResourceGroup: yyyyy   # only if different from ResourceGroup
157           Network: xxxxx
158           Subnet: xxxxx-subnet-private
159
160           # The resource group where the disk image is stored, only needs to
161           # be specified if it is different from ResourceGroup
162           ImageResourceGroup: aaaaa
163
164           # (azure) shared image gallery: the name of the gallery
165           SharedImageGalleryName: "shared_image_gallery_1"
166           # (azure) shared image gallery: the version of the image definition
167           SharedImageGalleryImageVersion: "0.0.1"
168
169 </code></pre>
170 </notextile>
171
172 Using unmanaged disks (deprecated):
173
174 The <span class="userinput">ImageID</span> value is the compute node image that was built in "the previous section":install-compute-node.html#azure.
175
176 <notextile>
177 <pre><code>    Containers:
178       CloudVMs:
179         ImageID: <span class="userinput">"https://zzzzzzzz.blob.core.windows.net/system/Microsoft.Compute/Images/images/zzzzz-compute-osDisk.55555555-5555-5555-5555-555555555555.vhd"</span>
180         Driver: azure
181         DriverParameters:
182           # Credentials.
183           SubscriptionID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
184           ClientID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
185           ClientSecret: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
186           TenantID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
187
188           # Data center where VMs will be allocated
189           Location: centralus
190
191           # The resource group where the VM and virtual NIC will be
192           # created.
193           ResourceGroup: zzzzz
194           NetworkResourceGroup: yyyyy   # only if different from ResourceGroup
195           Network: xxxxx
196           Subnet: xxxxx-subnet-private
197
198           # Where to store the VM VHD blobs
199           StorageAccount: example
200           BlobContainer: vhds
201
202 </code></pre>
203 </notextile>
204
205 Get the @SubscriptionID@ and @TenantID@:
206
207 <pre>
208 $ az account list
209 [
210   {
211     "cloudName": "AzureCloud",
212     "id": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXX",
213     "isDefault": true,
214     "name": "Your Subscription",
215     "state": "Enabled",
216     "tenantId": "YYYYYYYY-YYYY-YYYY-YYYYYYYY",
217     "user": {
218       "name": "you@example.com",
219       "type": "user"
220     }
221   }
222 ]
223 </pre>
224
225 You will need to create a "service principal" to use as a delegated authority for API access.
226
227 <notextile><pre><code>$ az ad app create --display-name "Arvados Dispatch Cloud (<span class="userinput">ClusterID</span>)" --homepage "https://arvados.org" --identifier-uris "https://<span class="userinput">ClusterID.example.com</span>" --end-date 2299-12-31 --password <span class="userinput">Your_Password</span>
228 $ az ad sp create "<span class="userinput">appId</span>"
229 (appId is part of the response of the previous command)
230 $ az role assignment create --assignee "<span class="userinput">objectId</span>" --role Owner --scope /subscriptions/{subscriptionId}/
231 (objectId is part of the response of the previous command)
232 </code></pre></notextile>
233
234 Now update your @config.yml@ file:
235
236 @ClientID@ is the 'appId' value.
237
238 @ClientSecret@ is what was provided as <span class="userinput">Your_Password</span>.
239
240 h3. Test your configuration
241
242 Run the @cloudtest@ tool to verify that your configuration works. This creates a new cloud VM, confirms that it boots correctly and accepts your configured SSH private key, and shuts it down.
243
244 <notextile>
245 <pre><code>~$ <span class="userinput">arvados-server cloudtest && echo "OK!"</span>
246 </code></pre>
247 </notextile>
248
249 Refer to the "cloudtest tool documentation":../../admin/cloudtest.html for more information.
250
251 {% assign arvados_component = 'arvados-dispatch-cloud' %}
252
253 {% include 'install_packages' %}
254
255 {% include 'start_service' %}
256
257 {% include 'restart_api' %}
258
259 h2(#confirm-working). Confirm working installation
260
261 On the dispatch node, start monitoring the arvados-dispatch-cloud logs:
262
263 <notextile>
264 <pre><code>~$ <span class="userinput">sudo journalctl -o cat -fu arvados-dispatch-cloud.service</span>
265 </code></pre>
266 </notextile>
267
268 "Make sure to install the arvados/jobs image.":../install-jobs-image.html
269
270 Submit a simple container request:
271
272 <notextile>
273 <pre><code>shell:~$ <span class="userinput">arv container_request create --container-request '{
274   "name":            "test",
275   "state":           "Committed",
276   "priority":        1,
277   "container_image": "arvados/jobs:latest",
278   "command":         ["echo", "Hello, Crunch!"],
279   "output_path":     "/out",
280   "mounts": {
281     "/out": {
282       "kind":        "tmp",
283       "capacity":    1000
284     }
285   },
286   "runtime_constraints": {
287     "vcpus": 1,
288     "ram": 1048576
289   }
290 }'</span>
291 </code></pre>
292 </notextile>
293
294 This command should return a record with a @container_uuid@ field.  Once @arvados-dispatch-cloud@ polls the API server for new containers to run, you should see it dispatch that same container.
295
296 The @arvados-dispatch-cloud@ API provides a list of queued and running jobs and cloud instances. Use your @ManagementToken@ to test the dispatcher's endpoint. For example, when one container is running:
297
298 <notextile>
299 <pre><code>~$ <span class="userinput">curl -sH "Authorization: Bearer $token" http://localhost:9006/arvados/v1/dispatch/containers</span>
300 {
301   "items": [
302     {
303       "container": {
304         "uuid": "zzzzz-dz642-hdp2vpu9nq14tx0",
305         ...
306         "state": "Running",
307         "scheduling_parameters": {
308           "partitions": null,
309           "preemptible": false,
310           "max_run_time": 0
311         },
312         "exit_code": 0,
313         "runtime_status": null,
314         "started_at": null,
315         "finished_at": null
316       },
317       "instance_type": {
318         "Name": "Standard_D2s_v3",
319         "ProviderType": "Standard_D2s_v3",
320         "VCPUs": 2,
321         "RAM": 8589934592,
322         "Scratch": 16000000000,
323         "IncludedScratch": 16000000000,
324         "AddedScratch": 0,
325         "Price": 0.11,
326         "Preemptible": false
327       }
328     }
329   ]
330 }
331 </code></pre>
332 </notextile>
333
334 A similar request can be made to the @http://localhost:9006/arvados/v1/dispatch/instances@ endpoint.
335
336 When the container finishes, the dispatcher will log it.
337
338 After the container finishes, you can get the container record by UUID *from a shell server* to see its results:
339
340 <notextile>
341 <pre><code>shell:~$ <span class="userinput">arv get <b>zzzzz-dz642-hdp2vpu9nq14tx0</b></span>
342 {
343  ...
344  "exit_code":0,
345  "log":"a01df2f7e5bc1c2ad59c60a837e90dc6+166",
346  "output":"d41d8cd98f00b204e9800998ecf8427e+0",
347  "state":"Complete",
348  ...
349 }
350 </code></pre>
351 </notextile>
352
353 You can use standard Keep tools to view the container's output and logs from their corresponding fields.  For example, to see the logs from the collection referenced in the @log@ field:
354
355 <notextile>
356 <pre><code>~$ <span class="userinput">arv keep ls <b>a01df2f7e5bc1c2ad59c60a837e90dc6+166</b></span>
357 ./crunch-run.txt
358 ./stderr.txt
359 ./stdout.txt
360 ~$ <span class="userinput">arv-get <b>a01df2f7e5bc1c2ad59c60a837e90dc6+166</b>/stdout.txt</span>
361 2016-08-05T13:53:06.201011Z Hello, Crunch!
362 </code></pre>
363 </notextile>
364
365 If the container does not dispatch successfully, refer to the @arvados-dispatch-cloud@ logs for information about why it failed.