Tom Clegg [Fri, 26 May 2023 14:10:09 +0000 (10:10 -0400)]
20520: Fix unreleased mutex on error importing SSH key.
Any error listing or importing keys (which, luckily, only happens the
first time a arvados-dispatch-cloud process creates a new instance)
would cause Create() call to fail, and cause all subsequent Create()
calls to hang forever until the service is restarted.
Arvados-DCO-1.1-Signed-off-by: Tom Clegg <tom@curii.com>
Lucas Di Pentima [Thu, 18 May 2023 20:31:37 +0000 (17:31 -0300)]
20482: Re-exports VPC's CIDR.
Previously exported as 'vpc_cidr' and removed when preexisting vpc usage
was added. This config data is used on local.params and was mentioned on the
documentation page.
Now, it's exported as 'cluster_int_cidr' and its value is requested from AWS
so that we get the correct one whether the vpc was just created or a
previously existing one is being in use.
Arvados-DCO-1.1-Signed-off-by: Lucas Di Pentima <lucas.dipentima@curii.com>
Brett Smith [Thu, 18 May 2023 12:36:56 +0000 (08:36 -0400)]
12684: Stop retrying 422 responses in PySDK
The original motivation for this was to retry when the API server was
having database connectivity problems. The feeling eight years later is
that things have changed enough that, on balance, this isn't worth
retrying anymore.
I don't think this will have any real impact on current Arvados
software. In the main branch as I write this,
`check_http_response_status` only gets called in five places. Three of
those are in the main `arvados` module for job and task utilities, which
presumably nobody is using anymore. The other two talk to Keep, which
only returns 422 for hash mismatches, where a retry will definitely
never succeed.
Arvados-DCO-1.1-Signed-off-by: Brett Smith <brett.smith@curii.com>
Brett Smith [Thu, 11 May 2023 15:53:41 +0000 (11:53 -0400)]
Refine PySDK collection walk recipe
Use PurePosixPath to clarify that we're strictly doing path manipulation.
(It will also behave better on Windows, although I'm not sure if the SDK
itself is Windows-ready yet.)
Keep Path objects in the queue to reduce local state.
No issue #
Arvados-DCO-1.1-Signed-off-by: Brett Smith <brett.smith@curii.com>
Lucas Di Pentima [Wed, 10 May 2023 20:38:48 +0000 (17:38 -0300)]
20482: Adds proper compute node instance profile instead of using keepstore's.
We first used keepstore's instance profile because compute nodes run a local
keepstore now.
We also need to give compute nodes permission to change resources related to
the EBS Autoscaler.
Arvados-DCO-1.1-Signed-off-by: Lucas Di Pentima <lucas.dipentima@curii.com>
Brett Smith [Tue, 9 May 2023 15:16:22 +0000 (11:16 -0400)]
12684: Use mock services in arvfile sparse write tests
Without these mocks, Jenkins seems to spend a lot of time retrying
requests—although weirdly, I don't see that in my own development
environment.
I believe the mocks were always intended to be used, since they're
instantiated and already used in other sparse write tests. To me this
looks like an oversight when the previous tests were adapted to write
new collections.
Arvados-DCO-1.1-Signed-off-by: Brett Smith <brett.smith@curii.com>
Brett Smith [Fri, 5 May 2023 13:49:04 +0000 (09:49 -0400)]
12684: Check for no log case in controller integration tests
Without this guard, tests fail with a message "API endpoint not found,"
which sounds scary and makes you think you broke all of Arvados until
you see the test code is just looking up a collection with an empty
UUID.
And by "you," I mean me.
Arvados-DCO-1.1-Signed-off-by: Brett Smith <brett.smith@curii.com>
Brett Smith [Thu, 4 May 2023 20:21:08 +0000 (16:21 -0400)]
12684: Support num_retries in PySDK client constructors
This lets users set their preferred retry strategy once, rather than in
every call to execute(), which is error-prone. The default num_retries
is 10 because we expect most users to care more about eventual success
than responsiveness. See the added release notes for further discussion
and rationale.
Changes to the rest of the code are mostly about supporting this
consistently. Tests that relied on the old no-default-num_retries
behavior now specify that explicitly.
Arvados-DCO-1.1-Signed-off-by: Brett Smith <brett.smith@curii.com>
Peter Amstutz [Thu, 4 May 2023 22:34:01 +0000 (18:34 -0400)]
20470: Fix discovery document generation to drop unpublished fields
Now uses the list of API published fields (selectable_attributes) to
generate discovery doc, this causes some obsolete and nonpublic fields
to disappear from the discovery doc (but actually they were never part
of the public API in the first place).
The immediate reason to do this is because workbench 1 was using the
discovery document to craft a list of fields to select, but the
changes to the way select work in this branch means that asking for
unpublished fields now throws an error.
Arvados-DCO-1.1-Signed-off-by: Peter Amstutz <peter.amstutz@curii.com>
1. Proactively close sockets if they seem likely to be stale
2. Wrap the retry logic in a loop
3. Generalize catching `httplib.BadStatusLine` to `httplib.HTTPException`
(which covers all kinds of malformed HTTP responses)
However, #1 functionally obsoletes the exception handlers added in the
earlier commits. Preemptively closing the sockets prevents httplib/2
from trying to reuse stale ones. So these exception handlers, along with
their retry loops, no longer serve their original purpose.
Remove this logic in favor of using the retry logic built into
googleapiclient. That logic is easier to configure and more refined.
Arvados-DCO-1.1-Signed-off-by: Brett Smith <brett.smith@curii.com>
Tom Clegg [Tue, 2 May 2023 21:16:05 +0000 (17:16 -0400)]
20457: Include delayed supervisor containers in overquota metric.
Previously, supervisor containers that had high enough priority to
run, but weren't scheduled because of SupervisorFraction, were not
counted in the containers_over_quota metric. This caused the
"overquota" metric to show a misleading time series as non-supervisor
containers made their way through the queue and the delayed supervisor
containers flapped between "not allocated because quota" (counted) and
"not allocated because SupervisorFraction" (not counted).
With this change, un-mappable supervisors always count toward the
containers_not_allocated_over_quota metric.
This also applies the "unlock if previously locked but now delayed due
to SupervisorFraction" logic to supervisor processes, which was
previously overlooked. This prevents supervisors from staying in
Locked state after being bumped by higher-priority containers.
Arvados-DCO-1.1-Signed-off-by: Tom Clegg <tom@curii.com>