Backlog Controller¶

The Backlog Controller is a controller that automatically scales worker pods based on the number of pending backlog work-items. It ensures optimal resource utilization by dynamically adjusting the number of extension workers in response to workload.

Configuration Example¶

backlog_controller:
  max_replicas: 5
  backlog_items_per_replica: 3
  remove_claim_after_minutes: 30

Top-Level Options¶

Option	Type	Default	Description
`max_replicas`	int	`5`	Maximum number of replicas per extension that the controller scales up to.
`backlog_items_per_replica`	int	`3`	Number of backlog work-items required before increasing replica count.
`remove_claim_after_minutes`	int	`30`	Time after which a claimed work-item is released if still processing.

Configuration Details¶

`max_replicas`¶

The maximum number of replicas (worker pods) the controller will scale up to for each extension. This prevents runaway scaling and ensures cluster resources are not exhausted.

Important: The issue-replicator extension is always limited to a maximum of 1 replica, regardless of this setting, to prevent potential duplicate GitHub issues from concurrent processing.

Example scenarios:

max_replicas: 1 - No scaling, always single worker per extension
max_replicas: 5 - Allow up to 5 concurrent workers per extension
max_replicas: 10 - Higher concurrency for demanding workloads

`backlog_items_per_replica`¶

The threshold that determines when to scale up. The controller increases replicas when:

number_of_pending_items / backlog_items_per_replica > current_replicas

Example with backlog_items_per_replica: 3:

1-3 items pending → 1 replica
4-6 items pending → 2 replicas
7-9 items pending → 3 replicas
And so on, up to max_replicas

Tuning guidance:

Lower values (e.g., 2): More aggressive scaling, faster processing, higher resource usage
Higher values (e.g., 5): More conservative scaling, lower resource usage, potentially slower processing
Consider: The average processing time for work-items when choosing this value

`remove_claim_after_minutes`¶

When a worker claims a backlog work-item, it signals that it’s processing that item. If the claim persists longer than this duration, the controller assumes the worker has stalled or crashed and releases the claim so another worker can process it.

Example scenarios:

Worker pod crashes mid-processing
Worker encounters an unhandled error and hangs
Network issues prevent completion notification

Tuning guidance:

Too low (e.g., 5): Risk of duplicate processing if work-items legitimately take longer
Too high (e.g., 120): Stalled work-items block the queue for extended periods
Recommended: Set to 2-3× the typical work-item processing time

Default of 30 minutes works well for most scanning workloads.