Overview
Every distributed system has inherent challenges. This section consolidates the hard problems encountered in the User Service design and their solutions.Challenge Summary
| Area | Challenge | Solution | Trade-off |
|---|---|---|---|
| Data | Email uniqueness | GSI + conditional writes | Extra query before insert |
| Data | Primary email invariant | TransactWriteItems | Transaction overhead |
| Data | Concurrent updates | Optimistic locking | Retry on conflict |
| Auth | Cognito-DynamoDB sync | Triggers + compensation | Eventual consistency |
| Auth | Token revocation | Short TTL + status check | Extra DB read for sensitive ops |
| Events | Ordering | Timestamps + idempotency | Complexity in consumers |
| Events | Delivery guarantee | Outbox pattern | Additional infrastructure |
| Scale | GSI hot partitions | Monitor + shard if needed | Complexity if triggered |
Data Layer Challenges
Email Uniqueness
The Problem
DynamoDB has no native UNIQUE constraint. Two concurrent requests could add the same email address.
Primary Email Invariant
The Problem
Exactly one email must be primary per user. Changing primary requires updating two records atomically.
- All-or-nothing: If any operation fails, none are applied
- Condition checks: Can verify preconditions atomically
- Up to 100 items per transaction
Concurrent Updates
The Problem
Two requests update the same user simultaneously, causing lost updates.
Authentication Challenges
Cognito-DynamoDB Sync
The Problem
User registers in Cognito, but DynamoDB record creation fails. User exists in Cognito but not in application.
Token Revocation
The Problem
JWT access tokens cannot be revoked. A suspended user’s token remains valid until expiry.
1
Short TTL
Access tokens expire in 1 hour, limiting exposure window
2
Status Check
For sensitive operations, verify user status in DynamoDB
3
Global Sign-Out
Invalidate all refresh tokens when user is suspended
Event Layer Challenges
Event Ordering
The Problem
Events may arrive out of order.
user.updated could arrive before user.created.- Use SQS FIFO queue with MessageGroupId = userId
- Trade-off: Lower throughput (3,000 messages/second with batching)
Guaranteed Delivery
The Problem
DynamoDB write succeeds, but EventBridge publish fails. Event is lost.
Failure Mode Analysis
A comprehensive view of what can go wrong, how we detect it, and how we recover.| Failure | Detection | Impact | Recovery | RTO |
|---|---|---|---|---|
| Cognito trigger fails | CloudWatch Lambda errors, orphan reconciliation job | User in Cognito but not DynamoDB | DLQ retry + daily reconciliation | Minutes to 24h |
| DynamoDB throttling | ConsumedCapacity metrics, 5xx errors | API requests fail | Auto-scaling (if provisioned) or on-demand absorbs | Seconds |
| DynamoDB unavailable | API errors, health checks | Full service outage | Wait for AWS recovery, no manual action | AWS SLA |
| EventBridge delivery fails | DLQ depth > 0 | Downstream services stale | Manual/automatic DLQ replay | Minutes |
| Lambda cold start spike | Duration p99 increase | Latency spike for users | Provisioned concurrency or wait | Seconds |
| Cognito unavailable | Auth failures, 503 errors | No new logins, existing tokens work | Wait for AWS recovery | AWS SLA |
| Concurrent update conflict | 409 responses, ConditionalCheckFailed | User must retry | Client refresh + retry | Immediate |
| Email uniqueness race | ConditionalCheckFailed logs | Duplicate prevented, user sees error | User retries with different email | Immediate |
| Rate limit exceeded | 429 responses | User temporarily blocked | Wait for window reset | Minutes |
| Invalid JWT | 401 responses | Request rejected | Client re-authenticates | Immediate |
Blast Radius Analysis
View full diagramRecovery Runbooks
High DLQ Depth
High DLQ Depth
Symptoms: CloudWatch alarm for DLQ depth > 0Steps:
- Check DLQ messages for error patterns
- If transient (network, throttle): Redrive messages to source queue
- If persistent (code bug): Fix code, deploy, then redrive
- Monitor for successful processing
Orphaned Cognito Users
Orphaned Cognito Users
Symptoms: Users report they registered but can’t access the appSteps:
- Check CloudWatch for post-confirmation trigger errors
- Query DynamoDB for user by Cognito sub
- If missing: Manually create DynamoDB record or trigger reconciliation job
- Investigate root cause (DynamoDB throttling, code bug)
Mass Token Revocation Needed
Mass Token Revocation Needed
Symptoms: Security incident requiring immediate logout of all usersSteps:
- Cognito:
AdminUserGlobalSignOutfor affected users (invalidates refresh tokens) - Access tokens remain valid until expiry (1 hour)
- For immediate block: Deploy Lambda change to check user status on every request
- Consider reducing access token TTL for future incidents
What Makes This “Good”
Defense in Depth
Multiple layers of protection: JWT validation, status checks, conditional writes
Explicit Trade-offs
Each solution documents what we gain and what we sacrifice
Failure Handling
Every failure mode has a recovery path: retries, DLQ, compensation
Observable
Structured logging, tracing, and metrics at every decision point
Questions to Ask
When reviewing this design, consider:- What’s the blast radius? If X fails, what else breaks?
- Can we recover? For every failure, is there a path back to consistency?
- What’s the latency impact? Extra DB reads, transaction overhead, network hops
- Is it worth it? Does the complexity match the business criticality?