av
WorkBlog

RDQ - Reliable Distributed Queue

Distributed queues get tricky when you add multiple producers/consumers, retries, and failure handling. This post walks through RDQ's internal structure and control flow so you can reason about reliability, performance, and extension points.

Distributed queues get tricky when you add multiple producers/consumers, retries, and failure handling. This post walks through RDQ's internal structure and control flow so you can reason about reliability, performance, and extension points.

Technologies: Go, Redis, Lua Code: github.com/aashmanVerma/rdq

Core Building Blocks

  • Client (Redis adapter) — wraps Redis with typed helpers and loads Lua scripts for atomic claim/ack/requeue. See pkg/rdq/client.go, claim.go, ack.go, enqueue.go.
  • Registry — maps job names to handlers. See pkg/rdq/registry.go.
  • Worker Manager — orchestrates polling, concurrency, visibility timeouts, retries, and DLQ moves. See internal/worker/manager.go.
  • Metrics Observer — tracks enqueued/claimed/completed/failed/retried/DLQ and worker lifecycle. See internal/metrics.
  • Config — all knobs for Redis, worker behavior, metrics, and health checks live in pkg/rdq/options.go.

Data Model (Job)

Every job stored in Redis has:

type Job struct {
	ID          string
	Name        string
	Queue       string
	Payload     json.RawMessage
	Attempts    int
	MaxAttempts int
	CreatedAt   int64
	Meta        map[string]string
}

The job JSON is kept under jobKey(prefix, id), while queue scheduling uses sorted sets per queue.

Enqueue Path

  1. User calls Queue.Enqueue(queue, jobName, payload, opts...).
  2. Options (delay, max attempts, priority placeholder) are applied (EnqueueOptions).
  3. The client serializes the job and runs a Redis pipeline:
    • SET jobKey -> job JSON
    • ZADD pendingKey(queue) score=now+delay member=jobID
  4. Metrics: JobEnqueued is recorded.

Key outcome: jobs are sorted by scheduled time; delayed jobs wait in the ZSET until eligible.

Worker Lifecycle

  1. Queue.Start marks the manager as running, kicks off health and metrics loops.
  2. StartWorker(queue, opts...) builds a WorkerConfig (concurrency, visibility timeout, poll interval, retry policy).
  3. Each worker owns:
    • A ticker that fires every PollInterval.
    • A semaphore-limited goroutine pool of size Concurrency.

Claiming Work

On each tick, up to Concurrency goroutines call ClaimJob(queue, visibilityTimeout). The Lua script:

  • Pops the next eligible job ID from the pending ZSET (score <= now).
  • Writes the job into a “processing” ZSET with score = now + visibility timeout (so it can be reclaimed if the worker dies).
  • Returns the job payload.

Metrics: JobClaimed records claim latency.

Processing and Acknowledging

  1. The manager looks up the handler from the registry by job.Name.
  2. If missing, the job is moved to DLQ with reason “no handler found”.
  3. Handler runs; on success:
    • Ack removes the job from Redis (processing ZSET + job key).
    • Metrics: JobCompleted.

Failures, Retries, DLQ

If a handler returns an error:

  • Increment Attempts.
  • If Attempts >= MaxAttempts: move to DLQ with reason “max attempts reached”; metrics: JobMovedToDLQ.
  • Else:
    • Compute backoff via RetryPolicy.Backoff(attempt).
    • Re-enqueue the job with the same MaxAttempts.
    • Ack the current processing record so the retry is scheduled cleanly.
    • Metrics: JobRetried.

If the worker crashes or is slow:

  • The job stays in the processing ZSET until visibilityTimeout passes, then becomes claimable again — implementing at-least-once delivery.

Metrics and Health

  • Metrics observer tracks all queue/worker events; Queue.metricsLoop logs aggregates periodically when enabled.
  • Health loop (healthCheckLoop) pings Redis on a schedule and logs failures to surface infra issues early.

Configuration Knobs (practical defaults)

  • Redis: address, pool size, idle conns, retries, dial/read/write timeouts, key prefix.
  • Worker: concurrency, poll interval, visibility timeout, retry policy (exponential default), max attempts.
  • Enqueue: delay, per-job max attempts, priority placeholder.
  • Ops: enable metrics, metrics interval, health check period.

Control Flow Summary

  1. Enqueue -> job JSON stored, ID added to pending ZSET (delayed if needed).
  2. Poll -> workers tick, try to claim up to Concurrency jobs.
  3. Claim -> Lua atomically moves job to processing ZSET with visibility timeout.
  4. Process -> handler executes.
  5. Ack -> on success, remove from Redis; on failure, retry or DLQ.
  6. Recover -> if worker dies, visibility timeout expires and job is claimable again.

Extending RDQ

  • Swap backoff: use LinearBackoff or ConstantBackoff, or provide your own RetryPolicy.
  • Adjust visibility timeouts per worker to match job duration variance.
  • Plug in custom metrics observer to emit to Prometheus/OpenTelemetry.
  • Namespace queues with Redis Prefix to run multiple logical queues on one Redis.

Understanding these internals helps you tune concurrency, backoff, and visibility for your workload while trusting RDQ’s Lua-backed atomicity to keep the queue consistent under load and failure.