Graceful shutdown

You can configure the graceful shutdown as described in Graceful shutdown.

Graceful shutdown only works if you enabled authorization using OPA. See Authorization requirements for details.

Coordinators

As a default, coordinators have 15 minutes to terminate gracefully.

The coordinator process will receive a SIGTERM signal when Kubernetes wants to terminate the Pod. After the graceful shutdown timeout runs out, and the process still didn’t exit, Kubernetes will issue a SIGKILL signal.

When a coordinator gets restarted, all currently running queries will fail and cannot be recovered after the restart process is finished. As of Trino version 442 this can not be prevented (e.g. by using multiple coordinators).

Workers

As a default, workers have 60 minutes to terminate gracefully.

Trino supports gracefully shutting down workers. This operator always adds a PreStop hook to gracefully shut them down. No additional configuration is needed, this guide is intended for users that need to tweak this mechanism.

The default graceful shutdown period is 1 hour, but it can be configured as follows:

apiVersion: trino.stackable.tech/v1alpha1
kind: TrinoCluster
metadata:
  name: trino
spec:
  # ...
  workers:
    config:
      gracefulShutdownTimeout: 1h
    roleGroups:
      default:
        replicas: 1

Implementation

Once a worker Pod is asked to terminate, the PreStop hook is executed and the following timeline occurs:

  1. The worker goes into SHUTTING_DOWN state.

  2. The worker sleeps for 30 seconds to ensure that the coordinator has noticed the shutdown and stops scheduling new tasks on the worker.

  3. The worker now waits till all tasks running on it complete. This will take as long as the longest running query takes.

  4. The worker sleeps for 30 seconds to ensure that the coordinator has noticed that all tasks are complete

  5. The PreStop hook will never return, but the JVM will be shut down by the graceful shutdown mechanism.

  6. If the graceful shutdown doesn’t complete quick enough (e.g. a query runs longer than the graceful shutdown period), after <graceful shutdown period> + 30s of step 2 + 30s of step 4 + 10s safety overhead the Pod gets killed, regardless if it has shut down gracefully or not. This is achieved by setting terminationGracePeriodSeconds on the worker Pods. Currently running queries on the worker will fail and cannot be recovered.

As of SDP version 23.7, the secret-operator issues TLS certificates with a lifetime of 24h. It also adds an annotation to the Pod, indicating it requires a restart 30 minutes before the certificate expires (23.5h hours in this case). Currently, this results in all Pod using HTTPS (both coordinator and workers in a typical setup) to be restarted every 23.5 hours.

The TLS certificate lifetime can be configured using podOverrides by setting secrets.stackable.tech/backend.autotls.cert.lifetime on every secret-operator volume. One sample configuration could look like:

spec:
  workers:
    podOverrides:
      spec:
        volumes:
          - name: server-tls-mount
            ephemeral:
              volumeClaimTemplate:
                metadata:
                  annotations:
                    secrets.stackable.tech/backend.autotls.cert.lifetime: 14d
          - name: internal-tls-mount
            ephemeral:
              volumeClaimTemplate:
                metadata:
                  annotations:
                    secrets.stackable.tech/backend.autotls.cert.lifetime: 14d

Implications

All queries that take less than the minimal graceful shutdown period of all roleGroups (1 hour as a default) are guaranteed to not be disturbed by regular termination of Pods. They can obviously still fail when, for example, a Kubernetes node dies or gets rebooted before it is fully drained.

Because of this, the operator automatically restricts the execution time of queries to the minimal graceful shutdown period of all roleGroups using the Trino configuration query.max-execution-time=3600s. This causes all queries that take longer than 1 hour to fail with the error message Query failed: Query exceeded the maximum execution time limit of 3600s.00s.

In case you need to execute queries that take longer than the configured graceful shutdown period, you need to increase the query.max-execution-time property as follows:

spec:
  coordinators:
    configOverrides:
      config.properties:
        query.max-execution-time: 24h

Please keep in mind, that queries taking longer than the graceful shutdown period are now subject to failure when a Trino worker gets shut down. Running into this issue can be circumvented by using Fault-tolerant execution, which is not supported natively yet. Until native support is added, you will have to use configOverrides to enable it.

Authorization requirements

When you are not using OPA for authorization, the user graceful-shutdown-user is not allowed to gracefully shut down workers. If you need graceful shutdown you need to use OPA or need to make sure graceful-shutdown-user is allowed to gracefully shut down workers (e.g. having you own authorizer or patching Trino).

In case you use OPA to authorize Trino requests, you need to make sure the user graceful-shutdown-user is authorized to trigger a graceful shutdown of the workers.

If you use rules provided by Stackable, this permission is automatically granted. If you use your own custom rego rules, you can achieve this by adding the following rule to grant graceful-shutdown-user the permissions to issue a graceful shutdown.

allow {
  input.action.operation == "WriteSystemInformation"
  input.context.identity.user == "graceful-shutdown-user"
}

In case the user graceful-shutdown-user does not have the permission to gracefully shut down a worker, the error message curl: (22) The requested URL returned error: 403 Forbidden will be shown in the worker log and the worker will shut down immediately.