Skip to main content

Fixed Wall-Time Retries

TLDR

Set ScheduleToCloseTimeout on the Activity call to enforce a hard time budget across all retry attempts. Use this when a business SLA requires the Activity to succeed or fail within a defined window, regardless of how many individual attempts occur.

Overview

The Fixed Wall-Time Retries pattern enforces a maximum total elapsed time across all Activity retry attempts using ScheduleToCloseTimeout. Use it when a business process must succeed or fail within a defined time budget, regardless of how many individual attempts occur.

Problem

StartToCloseTimeout limits how long a single Activity attempt may run before Temporal cancels it and schedules a retry. It does not limit how long retries collectively may run.

A process with StartToCloseTimeout=5m and the default unlimited retry policy can run for days — each attempt times out at 5 minutes, then Temporal waits for the backoff delay and tries again, indefinitely.

When a business SLA exists and violating that SLA is a failure such as a payment must charge in two minutes or less, an authorization check must complete within 30 seconds — you need a hard outer boundary that Temporal enforces automatically without requiring the Workflow to track elapsed time itself.

Solution

Set ScheduleToCloseTimeout on the Activity call options. It starts when the Activity is first scheduled and expires when the clock runs out, regardless of how many attempts have occurred. If the timeout expires during an attempt, that attempt is cancelled. If it expires between retries, the pending retry is abandoned and Temporal delivers an ActivityError to the Workflow.

The following describes each step:

  1. The two minute budget clock starts the moment the Workflow schedules the Activity.
  2. Each attempt runs up to 30 seconds (StartToCloseTimeout). On failure, Temporal waits the backoff delay and retries.
  3. Retries continue until either the Activity succeeds or the two minute budget is exhausted.
  4. When the budget expires, Temporal delivers an ActivityError to the Workflow, which can log, alert, or compensate.

Implementation

Enforcing a 2-minute SLA

Set both schedule_to_close_timeout (the total budget) and start_to_close_timeout (the per-attempt cap). The retry policy controls the interval between attempts. Temporal stops retrying automatically when the budget runs out.

# workflows.py
from datetime import timedelta
from temporalio import workflow
from temporalio.common import RetryPolicy
from temporalio.exceptions import ActivityError, TimeoutError, TimeoutType
import activities

@workflow.defn
class PaymentAuthWorkflow:
@workflow.run
async def run(self, transaction_id: str) -> str:
try:
return await workflow.execute_activity(
activities.authorize_transaction,
transaction_id,
schedule_to_close_timeout=timedelta(minutes=2), # total budget
start_to_close_timeout=timedelta(seconds=30), # per attempt
retry_policy=RetryPolicy(
initial_interval=timedelta(seconds=5),
backoff_coefficient=1.5,
maximum_interval=timedelta(seconds=30),
),
)
except ActivityError as e:
cause = e.__cause__
if isinstance(cause, TimeoutError) and cause.type == TimeoutType.SCHEDULE_TO_CLOSE:
workflow.logger.error(
"Authorization failed — 2-minute SLA breached",
extra={"transaction_id": transaction_id},
)
raise

Short SLA without a per-attempt timeout

For tighter budgets — such as a 30 second authorization window — you may omit StartToCloseTimeout and let ScheduleToCloseTimeout act as the only bound. Temporal requires at least one timeout to be set; ScheduleToCloseTimeout alone satisfies that requirement.

# workflows.py
result = await workflow.execute_activity(
activities.authorize_transaction,
transaction_id,
schedule_to_close_timeout=timedelta(seconds=30),
retry_policy=RetryPolicy(
initial_interval=timedelta(seconds=3),
backoff_coefficient=1.5,
),
)

Best practices

  • Set both timeouts for clarity. Use ScheduleToCloseTimeout as the total SLA and StartToCloseTimeout as a per-attempt safety valve. Omitting StartToCloseTimeout means a single slow response can consume the entire budget.
  • Cap MaximumInterval well below the SLA. If MaximumInterval is 2 hours and the SLA is 24 hours, only 12 retries are possible. Tune the interval so the backoff plateaus at a value that allows meaningful retries within the budget.
  • Handle ActivityError explicitly. When the SLA expires, Temporal delivers an error to the Workflow. Catch it to send an alert, trigger a compensation, or record a breach in an audit log.
  • Distinguish SLA breaches from transient errors. Inspect the error cause — check that the ActivityError's cause is a TimeoutError with TimeoutType.SCHEDULE_TO_CLOSE (Python) or a TimeoutFailure with TimeoutType.SCHEDULE_TO_CLOSE (TypeScript) or TIMEOUT_TYPE_SCHEDULE_TO_CLOSE (Go/Java) to separate an SLA breach from an application failure. This lets you log or alert specifically on SLA violations rather than treating all activity errors the same way.

Common pitfalls

  • Not accounting for ScheduleToStart delay in the budget. ScheduleToCloseTimeout begins when the Activity is first scheduled, which includes the time the task waits in the queue before a Worker picks it up. Under high load or insufficient Worker capacity, tasks can sit in the queue for seconds or minutes before the first attempt starts — consuming SLA budget before any work is done. Provision Workers with enough capacity for peak traffic, or use autoscaling, to keep ScheduleToStart latency negligible relative to the SLA window.
  • Using StartToCloseTimeout alone for SLA enforcement. A downstream system that responds slowly but never fully times out can keep resetting the per-attempt clock indefinitely.
  • Setting ScheduleToCloseTimeout shorter than StartToCloseTimeout. If the total budget is shorter than a single attempt's maximum, the Activity will never complete — Temporal will cancel it before it finishes.
  • Ignoring the breach in the Workflow. Letting the ActivityError propagate without handling it means SLA breaches go unlogged and uncompensated.
  • Not accounting for backoff delays in the budget. The total time includes both attempt durations and the backoff delays between them. A 1-hour budget with a 30-minute initial interval and coefficient 2.0 leaves room for only one or two attempts.

References