Distributed queues get tricky when you add multiple producers/consumers, retries, and failure handling. This post walks through RDQ's internal structure and control flow so you can reason about reliability, performance, and extension points.
Technologies: Go, Redis, Lua Code: github.com/aashmanVerma/rdq
Core Building Blocks
- Client (Redis adapter) — wraps Redis with typed helpers and loads Lua scripts for atomic claim/ack/requeue. See
pkg/rdq/client.go,claim.go,ack.go,enqueue.go. - Registry — maps job names to handlers. See
pkg/rdq/registry.go. - Worker Manager — orchestrates polling, concurrency, visibility timeouts, retries, and DLQ moves. See
internal/worker/manager.go. - Metrics Observer — tracks enqueued/claimed/completed/failed/retried/DLQ and worker lifecycle. See
internal/metrics. - Config — all knobs for Redis, worker behavior, metrics, and health checks live in
pkg/rdq/options.go.
Data Model (Job)
Every job stored in Redis has:
type Job struct {
ID string
Name string
Queue string
Payload json.RawMessage
Attempts int
MaxAttempts int
CreatedAt int64
Meta map[string]string
}
The job JSON is kept under jobKey(prefix, id), while queue scheduling uses sorted sets per queue.
Enqueue Path
- User calls
Queue.Enqueue(queue, jobName, payload, opts...). - Options (delay, max attempts, priority placeholder) are applied (
EnqueueOptions). - The client serializes the job and runs a Redis pipeline:
SET jobKey -> job JSONZADD pendingKey(queue) score=now+delay member=jobID
- Metrics:
JobEnqueuedis recorded.
Key outcome: jobs are sorted by scheduled time; delayed jobs wait in the ZSET until eligible.
Worker Lifecycle
Queue.Startmarks the manager as running, kicks off health and metrics loops.StartWorker(queue, opts...)builds aWorkerConfig(concurrency, visibility timeout, poll interval, retry policy).- Each worker owns:
- A ticker that fires every
PollInterval. - A semaphore-limited goroutine pool of size
Concurrency.
- A ticker that fires every
Claiming Work
On each tick, up to Concurrency goroutines call ClaimJob(queue, visibilityTimeout). The Lua script:
- Pops the next eligible job ID from the pending ZSET (score <= now).
- Writes the job into a “processing” ZSET with score = now + visibility timeout (so it can be reclaimed if the worker dies).
- Returns the job payload.
Metrics: JobClaimed records claim latency.
Processing and Acknowledging
- The manager looks up the handler from the registry by
job.Name. - If missing, the job is moved to DLQ with reason “no handler found”.
- Handler runs; on success:
Ackremoves the job from Redis (processing ZSET + job key).- Metrics:
JobCompleted.
Failures, Retries, DLQ
If a handler returns an error:
- Increment
Attempts. - If
Attempts >= MaxAttempts: move to DLQ with reason “max attempts reached”; metrics:JobMovedToDLQ. - Else:
- Compute backoff via
RetryPolicy.Backoff(attempt). - Re-enqueue the job with the same
MaxAttempts. - Ack the current processing record so the retry is scheduled cleanly.
- Metrics:
JobRetried.
- Compute backoff via
If the worker crashes or is slow:
- The job stays in the processing ZSET until
visibilityTimeoutpasses, then becomes claimable again — implementing at-least-once delivery.
Metrics and Health
- Metrics observer tracks all queue/worker events;
Queue.metricsLooplogs aggregates periodically when enabled. - Health loop (
healthCheckLoop) pings Redis on a schedule and logs failures to surface infra issues early.
Configuration Knobs (practical defaults)
- Redis: address, pool size, idle conns, retries, dial/read/write timeouts, key prefix.
- Worker: concurrency, poll interval, visibility timeout, retry policy (exponential default), max attempts.
- Enqueue: delay, per-job max attempts, priority placeholder.
- Ops: enable metrics, metrics interval, health check period.
Control Flow Summary
- Enqueue -> job JSON stored, ID added to pending ZSET (delayed if needed).
- Poll -> workers tick, try to claim up to
Concurrencyjobs. - Claim -> Lua atomically moves job to processing ZSET with visibility timeout.
- Process -> handler executes.
- Ack -> on success, remove from Redis; on failure, retry or DLQ.
- Recover -> if worker dies, visibility timeout expires and job is claimable again.
Extending RDQ
- Swap backoff: use
LinearBackofforConstantBackoff, or provide your ownRetryPolicy. - Adjust visibility timeouts per worker to match job duration variance.
- Plug in custom metrics observer to emit to Prometheus/OpenTelemetry.
- Namespace queues with Redis
Prefixto run multiple logical queues on one Redis.
Understanding these internals helps you tune concurrency, backoff, and visibility for your workload while trusting RDQ’s Lua-backed atomicity to keep the queue consistent under load and failure.