7713: Node Manager blackholes broken nodes that can't shut down.
We are seeing situations on Azure where some nodes in an UNKNOWN state
cannot be shut down. The API call to destroy them always fails.
There are two related halves to this commit. In the first half,
after a cloud shutdown request fails, ComputeNodeShutdownActor checks
whether the node is broken. If it is, it cancels shutdown retries.
In the second half, the daemon checks for this shutdown outcome. When
it happens, it blacklists the broken node: it will immediately filter
it out of node lists from the cloud. It is no longer monitored in any
way or counted as a live node, so Node Manager will boot a replacement
for it.
This lets Node Manager create cloud nodes above max_nodes, up to the
number of broken nodes. We're reasonably bounded in for now because
only the Azure driver will ever declare a node broken. Other clouds
will never blacklist nodes this way.