System Design: Multi-Tenant URL Shortener with Organization Namespaces
From go/link to go/<company>/link β A Staff Engineer's Guide to Enterprise Short Links
Table of Contents
1. The Problem & Why It's Hard
"Design a URL shortener" is perhaps the most common system design interview question. It's so common that many candidates walk in with a memorized answer: hash the URL, store it in a database, redirect on lookup. Done in 15 minutes, right?
Wrong. The interviewer doesn't want you to design bit.ly. They want to see how you handle the hidden complexity underneath a deceptively simple interface.
The interviewer's real question: Can you design a system that's trivially simple on the surface (key β value lookup) but must handle multi-tenancy, namespace isolation, organizational access control, and extreme read-heavy traffic β all while maintaining sub-10ms redirect latency at millions of requests per second?
The multi-tenant variant β where organizations get their own namespace like go/<company>/docs β transforms a freshman data structures problem into a staff-level design challenge. Now you need:
Namespace isolation:
go/acme/roadmapandgo/globex/roadmapare completely different links owned by different organizationsPermission boundaries: An engineer at Acme must not be able to read, create, or discover Globex's links
Tenant-aware caching: A cache miss for Org A must not serve Org B's data
Organizational URL routing: The system must resolve
go/<org>/<slug>in a single DNS hop, not a chain of redirects
Staff+ Signal: The real challenge isn't URL shortening β it's building a multi-tenant namespace that behaves like a private knowledge graph per organization while sharing infrastructure for cost efficiency. Go links are closer to an internal knowledge management system than a URL shortener. At Google, go links became so embedded in company culture that losing them would be equivalent to losing internal search.
A Brief History: Why Go Links Matter
In 2005, Eric DeFriez on Google's SysOps team built the first go/ link system. He aliased go to goto.google.com in internal DNS, and built a simple URL shortener behind it. Within weeks of a soft launch, it leaked to the internal "misc" mailing list and usage shot up 100x in 24 hours. Within months: 20,000 shortcuts and 10,000 redirects/day.
The system was later rewritten by Chris Heiser to run on Google standard infrastructure (Bigtable + Borg). By 2010, every Googler used go links daily. When Xooglers left for companies like LinkedIn, Twitter, Stripe, Airbnb, and Netflix between 2012-2015, they brought the go link culture with them β often building internal clones within their first weeks.
This history reveals something important: go links are organizational knowledge infrastructure, not just convenience shortcuts. They encode institutional knowledge ("go/oncall" always points to the current on-call rotation) and survive employee turnover. The system design must treat them accordingly.
2. Requirements & Scope
Functional Requirements
Create short link:
POST /api/v1/linksβ creatego/<org>/<slug>β<destination_url>Redirect:
GET /go/<org>/<slug>β HTTP 301/302 to destination URLOrganization namespaces: Each organization owns an isolated namespace.
go/acme/docsandgo/globex/docsresolve independentlyGlobal links: Platform-level links like
go/help,go/statusthat work across all orgsCustom slugs: Human-readable slugs (not random hashes).
go/acme/q4-roadmap, notgo/acme/a3Fk9Link management: Update destination, transfer ownership, set expiration, view analytics
Search: Search links within your organization by slug or destination URL
Parameterized links:
go/acme/jira/%sβhttps://acme.atlassian.net/browse/%s(variable substitution)SSO integration: Authenticate via organization's IdP (Okta, Azure AD, Google Workspace)
RBAC: Organization admins, link creators, read-only viewers
Non-Functional Requirements
Redirect latency (p99)
< 10ms
Go links replace typing URLs β must feel instant
Create latency (p99)
< 200ms
Interactive creation from browser extension
Redirect availability
99.99% (< 52 min/year)
Broken go links block entire workflows
Read:write ratio
~1000:1
Links are created once, resolved millions of times
Concurrent orgs
10,000+
Multi-tenant SaaS serving thousands of companies
Links per org
Up to 500K
Large enterprises have deep link libraries
Redirects/second (peak)
500K
All orgs combined during business hours
Scale Estimation (Back-of-Envelope)
3. Phase 1: Single Machine Solution
For a single organization with a few thousand links, the solution is almost trivially simple.
Implementation Sketch
When Does Phase 1 Work?
Single organization, < 10K links
< 100 redirects/second
One server can handle everything
In-memory LRU cache gives < 1ms redirect latency for hot links
When Does Phase 1 Fail?
See next section.
4. Why Naive Fails (The Math)
The naive single-machine solution breaks in three dimensions simultaneously:
Dimension 1: Multi-Tenancy Breaks Everything
Dimension 2: Read Volume Exceeds Single Machine
Dimension 3: Availability Requirements
Staff+ Signal: The availability requirement for go links is higher than most services because they're used during incident response. If
go/runbookdoesn't resolve, your MTTR increases. This creates a bootstrap problem: the go link service must be one of the last things to go down and first things to come up. Design it with minimal external dependencies.
Read throughput
~10K redirects/sec
Redis Cluster + read replicas
Tenant isolation
Shared process memory
Per-tenant cache partitions + row-level security
Availability
Single point of failure
Multi-node with health checks + failover
Cache pollution
One big LRU
Tenant-aware cache with per-org eviction policies
The tipping point: Once you serve more than ~50 organizations or exceed ~10K redirects/sec, you need distributed caching, tenant isolation at the data layer, and redundancy.
5. Phase 2+: Distributed Architecture
The key architectural insight: A multi-tenant URL shortener is a tenant-namespaced key-value store with an HTTP redirect interface. The redirect path (read) must be as close to a single cache lookup as possible. The management path (write) handles all the complexity β auth, RBAC, namespacing, analytics β and can tolerate higher latency.
How Real Companies Built This
Google (go/ links β 2005-present)
Google's internal go link system was the original. Key architectural choices:
DNS-based routing:
goresolves togoto.google.comvia internal DNS. On external networks, a BeyondCorp Chrome extension handles the redirection and authentication.Bigtable backend: Originally simple, the system was rewritten to run on Bigtable + Borg to handle Google-scale usage (tens of thousands of links, millions of redirects daily).
First-come-first-serve namespace: No per-team namespacing β all of Google shares one flat namespace. This works for a single company but is the exact problem multi-tenant systems must solve.
What they learned: Go links became so embedded in culture that they're essential infrastructure. Slides, docs, chat messages, and even printed posters reference go links. The system must have near-100% availability.
Bitly (2008-present, 6B decodes/month)
Bitly handles ~6 billion link redirects per month at enterprise scale:
MySQL β Bigtable migration: In 2023, Bitly migrated 80 billion rows of link data from MySQL to Cloud Bigtable. The migration completed in 6 days using concurrent Go scripts processing 26TB of data. They chose Bigtable for on-demand scaling without relational semantics overhead.
Stream-based architecture: Bitly uses an asynchronous processing pipeline where each step performs a single logical transformation on event data β link creation, click tracking, and analytics flow through separate stages.
Multi-tenant via branded domains: Bitly's enterprise product assigns custom domains per customer (e.g.,
yourcompany.link/xxx), achieving namespace isolation at the DNS level rather than URL path level.
Short.io (War Story β October 2025 AWS Outage)
Short.io's post-mortem from October 2025 is a cautionary tale for URL shortener design:
Root cause: AWS us-east-1 DNS outage made DynamoDB inaccessible across the entire region. All short links went down.
Cascading failure: Their status page (Atlassian Statuspage) and customer support (Intercom) were both hosted in us-east-1 β they couldn't even communicate the outage.
Key lesson: They had a Cassandra fallback but it wasn't operationally available during the incident. "We believed such an outage was highly unlikely" β but it happened.
Post-incident changes: Secondary status page on a different provider, Cassandra as active fallback for DynamoDB, regional autonomy for Frankfurt and Sydney clusters.
Staff+ Signal: Short.io's outage reveals a pattern: infrastructure dependencies for URL shorteners create amplified blast radius. When your go links are down, every other system that's referenced by go links becomes harder to access. Design the redirect path with zero external dependencies beyond your own cache layer. The management path can depend on external auth β but redirects must work even when your IdP is down.
Key Data Structure: Tenant-Namespaced Key-Value
The central abstraction is a composite key: (org_id, slug) β destination_url
6. Core Component Deep Dives
6.1 Redirect Service (The Hot Path)
The redirect service is the most performance-critical component. It handles 99.9% of all traffic and must resolve in < 10ms.
Responsibilities:
Parse incoming URL: extract
org_slugandlink_slugfrom pathLook up destination from Redis (primary) or PostgreSQL (fallback)
Handle parameterized links: substitute
%swith path parametersReturn HTTP 301 (permanent) or 302 (temporary) redirect
Emit analytics event asynchronously
Request Flow:
Design decisions:
No auth on redirect path: The redirect service does NOT authenticate requests. This is deliberate β it eliminates a network hop to the auth service and removes a failure dependency. If a link should be private, the destination itself should be behind auth (the linked Google Doc, Notion page, etc. has its own access control).
301 vs 302: Use 302 (temporary redirect) by default. 301 (permanent) gets cached by the browser, which means you can't update the destination later. Some orgs may opt into 301 for performance.
Stateless: The redirect service holds no state. Any instance can handle any request. Scales horizontally by adding pods.
Staff+ Signal: The decision to not authenticate on the redirect path is controversial but correct for internal go links. Authentication adds 5-20ms latency per redirect and creates a dependency on your IdP. If Okta goes down, no one can resolve go links. Instead, treat go links as "discoverable but not secret" β the same model as internal DNS. If a link must be private, the destination system (Notion, Google Docs, Jira) handles access control. This separates the concerns cleanly and keeps the redirect path dependency-free.
6.2 Management Service (The Cold Path)
Responsibilities:
CRUD operations on links (create, read, update, delete)
Slug uniqueness enforcement within org namespace
RBAC enforcement (who can create/edit/delete links)
Search (find links by slug prefix, destination domain, or tags)
Organization administration (invite members, assign roles)
Cache invalidation (push updates to Redis when links change)
RBAC Model:
6.3 Parameterized Links (go/acme/jira/%s)
Parameterized links are a killer feature that transforms go links from simple redirects into a lightweight internal API gateway:
Resolution logic:
Staff+ Signal: Parameterized links create a subtle caching challenge.
go/acme/jira/PROJ-123andgo/acme/jira/PROJ-456both resolve via the templatego/acme/jira/%s, but you can't cache the final resolved URL (infinite key space). Instead, cache the template:link:acme:jira/%s β https://acme.atlassian.net/browse/%s, and resolve parameters at the redirect service level. This keeps the cache finite while supporting infinite parameter combinations.
6.4 Cache Invalidation Strategy
When a link is updated or deleted, the cache must be invalidated. With Redis Cluster across multiple regions, this requires careful coordination.
Invalidation approaches:
Delete on write (sync)
~5ms
Strong
Low
Publish-subscribe (async)
~50-200ms
Eventual
Medium
TTL-based expiry
Up to TTL
Weak
Lowest
Hybrid (recommended)
~5ms local, ~200ms cross-region
Eventual with fast local
Medium
The recommended approach: synchronously delete from the local Redis on write, then publish an invalidation event to Kafka for cross-region propagation. Set a TTL of 1 hour as a safety net for any missed invalidations.
6.5 Auth & SSO Integration
Each organization connects their identity provider (IdP) for single sign-on:
SCIM provisioning handles the lifecycle:
When a user joins Acme in Okta β automatically provisioned with "viewer" role
When a user leaves Acme in Okta β automatically deprovisioned, their created links transferred to a backup owner
Staff+ Signal: SCIM deprovisioning is the security-critical path most teams forget. When an employee leaves, their go links shouldn't die β they should be transferred to a designated owner or team. But their ability to create or edit links must be revoked immediately. The deprovisioning webhook from the IdP must be processed synchronously (block the response until revoked), not queued asynchronously β otherwise there's a window where a terminated employee can still modify links.
7. The Scaling Journey
Stage 1: Single Tenant MVP (1 org, ~100 redirects/sec)
Single server, single database
In-process LRU cache for hot links
Simple API key auth
Limit: Server crash = cache lost, single point of failure
Stage 2: Multi-Tenant with Redis (~100 orgs, ~10K redirects/sec)
New capabilities at this stage:
Redis as shared cache (survives server restarts)
Multiple redirect service instances behind a load balancer
Tenant isolation via composite cache key:
link:{org_slug}:{slug}SSO integration (SAML/OIDC) for organizational auth
Limit: Single Redis instance maxes at ~100K ops/sec. Single PostgreSQL for reads.
Stage 3: Enterprise Scale (~5,000 orgs, ~200K redirects/sec)
New capabilities at this stage:
Redis Cluster with 3 shards (hash by org_slug for tenant locality)
Kafka for async analytics and cache invalidation
Elasticsearch for link search within orgs
Per-org rate limiting at the API gateway
SCIM provisioning for automated user lifecycle
Limit: Single-region deployment. Cross-continent latency for global orgs.
Staff+ Signal: At this stage, you need a dedicated on-call rotation. The most common alert? Cache hit rate dropping below 95% for a specific org β usually because they're bulk-importing links or running a migration. The runbook should include: (1) check if it's a single noisy tenant, (2) if yes, enable per-org rate limiting on the management path, (3) pre-warm their cache from a DB read replica. Do not let one tenant's bulk operation degrade redirect latency for all tenants.
Stage 4: Global Scale (~10,000+ orgs, ~500K redirects/sec)
Multi-region deployment with per-region Redis clusters and cross-region replication.
Critical design decision: Where do writes go?
All writes go to the primary region (US-East) for PostgreSQL
Redis cache is populated per-region from read replicas
A link created by an engineer in London is written to US-East, then replicated to EU-West Redis within ~200ms
For redirects, each region serves from its local Redis β no cross-region hops on the read path
8. Failure Modes & Resilience
Failure Scenarios
Single redirect pod crash
K8s health probe (5s)
Auto-restart, LB routes around
Zero β other pods serve
Invisible
Redis node failure
Sentinel/Cluster failover (< 5s)
Replica promoted
1/N of cached links briefly slower
~5s of cache misses hitting DB
Redis cluster total failure
Health check cascade
Fall back to PostgreSQL read replicas
All redirects hit DB (10-20ms vs 1ms)
Slower but functional
PostgreSQL primary failure
Replication lag monitor
Promote replica (< 30s)
No new link creation for ~30s
Creates fail, redirects work (cache)
Auth/IdP outage (Okta down)
Auth service health check
Extend existing session TTLs
Management path blocked
Can't create/edit links, redirects work
Full region failure
Cross-region health check
GeoDNS failover to secondary region
Minutes of elevated latency
Links work, slightly slower
The Critical Insight: Redirect Path Independence
The redirect path has exactly two dependencies: Redis (primary) and PostgreSQL read replica (fallback). It does NOT depend on:
Auth service (no authentication on redirects)
External IdP (Okta, Azure AD)
Kafka (analytics are async, can be buffered)
Management service
Elasticsearch
This means: even if your entire management plane is down, users can still resolve go links.
Staff+ Signal: Design the redirect path to survive the failure of everything else. The Short.io post-mortem showed what happens when URL shorteners have too many dependencies β their status page, support tools, and primary data store all shared a blast radius. For go links specifically, the redirect path should have a degradation hierarchy: (1) Redis β works at full speed, (2) Redis down β PostgreSQL read replica at 10x latency, (3) Both down β serve from a local cache snapshot file baked into the container image, updated hourly. Level 3 serves stale data but never returns a 500.
Cache Warming Strategy
Cold start (new region, new Redis node, or Redis flush) is the most dangerous period:
9. Data Model & Storage
PostgreSQL Schema
Redis Data Model
Storage Engine Selection
PostgreSQL
Source of truth for links, orgs, users, memberships
ACID, joins for RBAC queries, mature ecosystem
Redis Cluster
Redirect cache (full dataset), rate limiting, session store
Sub-ms reads, perfect for key-value redirect lookups
Kafka
Analytics event bus, cache invalidation events
Durable, ordered, decouples write path from analytics
Elasticsearch
Link search within orgs
Full-text search on slugs, destinations, tags
S3
Analytics archives (monthly rollups)
Cheap long-term storage for click data
10. Observability & Operations
Key Metrics
Redirect path (SLO-critical):
redirect_latency_ms{org, quantile}β p50, p95, p99 redirect time. Alert if p99 > 10msredirect_cache_hit_rate{org}β should be > 99%. Drop below 95% = investigateredirect_total{org, status_code}β 200 (success), 404 (not found), 500 (error)redirect_rps{org}β requests per second per org. Spike detection for abuse
Management path:
link_creates{org}β creation rate per org. Burst = potential abuse or bulk importauth_failures{org, idp}β SSO failures per org. Spike = IdP issue or misconfigurationcache_invalidation_lag_msβ time between link update and cache delete. Alert if > 1s
Tenant health:
org_link_count{org}β total links per org. Approaching plan limit = upsell opportunityorg_active_users{org}β monthly active users per org. Drop = churn risknoisy_neighbor_score{org}β composite metric of an org's resource consumption relative to plan
Alerting Strategy
Redirect p99 > 10ms
5 min sustained
P1 β Page on-call
Check Redis health, cache hit rate
Cache hit rate < 95%
10 min sustained
P2 β Slack alert
Identify affected org, check for bulk operations
Redis memory > 80%
Threshold
P2 β Slack alert
Identify largest orgs, check for leak
Redirect 5xx rate > 0.1%
5 min window
P1 β Page on-call
Check Redis connectivity, DB fallback
Link creation spike > 10x
Per-org, 1 min window
P3 β Log + rate limit
Likely bulk import; auto-throttle
Auth failure rate > 5%
Per-org, 5 min
P2 β Slack alert
Check IdP health, SAML cert expiry
Distributed Tracing
A redirect request trace:
A create request trace:
11. Design Trade-offs
Trade-off 1: Multi-Tenant Data Isolation
Shared schema (tenant_id column)
All orgs in one links table with org_id column
Simple, efficient, easy to query across orgs for platform analytics
Risk of cross-tenant data leak if queries miss the org_id filter; noisy neighbor
Start here β good enough for 95% of cases
Schema per tenant
Each org gets own PostgreSQL schema (acme.links, globex.links)
Stronger isolation, easy per-org backup/restore
Schema migration across 10K schemas is painful; connection pool overhead
Good for regulated industries
Database per tenant
Each org gets own database
Complete isolation, independent scaling, easy data residency compliance
Operational nightmare at 10K+ tenants; connection management becomes a distributed systems problem
Only for enterprise tier with compliance requirements
Staff+ Signal: The decision between shared schema and schema-per-tenant is a one-way door at scale. Migrating from shared schema to per-tenant schema with 100M links and zero downtime requires a complex dual-write migration. My recommendation: start with shared schema + rigorous row-level security policies in PostgreSQL. If a customer requires physical isolation (healthcare, finance), offer database-per-tenant as an enterprise add-on β and charge accordingly, because the operational cost is real.
Trade-off 2: URL Structure
Path-based namespace
go/<org>/<slug>
Single domain, simple routing, easy to add global links
Org slug in URL visible to users
Subdomain-based
<org>.go.example.com/<slug>
Clean URLs, DNS-level isolation
Wildcard TLS cert needed, DNS propagation delay for new orgs
Custom domain
go.acme.com/<slug>
Branded experience, full DNS control per org
Complex cert management, DNS setup burden on customer
Recommended: Path-based
go/<org>/<slug>
Simplest to implement and operate
Org slug visible (usually fine for internal tools)
Trade-off 3: Redirect Caching Strategy
Full dataset in Redis
~100% (everything cached)
Strong (invalidate on write)
Higher memory (~50GB for 100M links)
LRU hot cache
~90-95% (popular links only)
Eventual (TTL-based)
Lower memory (~5GB)
CDN edge cache
~99% for hot links
Weak (TTL 60s+)
Lowest latency, highest inconsistency
Recommended: Full dataset
~100%
Strong + TTL safety net
Redis is cheap; 50GB is ~$200/month
My Take: At 500 bytes per link and 100M total links, the entire dataset is 50GB. Redis handles this comfortably on a few nodes. Caching everything means cache misses are essentially zero β the only misses are newly created links before cache population (< 1s gap). This dramatically simplifies the system because you never need to worry about thundering herd on cache miss.
Trade-off 4: Analytics Pipeline
Synchronous write (increment on redirect)
Blocks redirect
100% accurate
Simple but adds redirect latency
Async via Kafka
Non-blocking
Near-real-time (seconds lag)
Medium β Kafka + consumer
Batch aggregation (hourly from logs)
Hours of delay
Approximate
Simplest for analytics
Recommended: Async via Kafka
Non-blocking redirect
Seconds of lag (acceptable)
Moderate
Trade-off 5: Global Link Resolution Order
When a user types go/docs, should the system check the org namespace first or the global namespace?
Org-first
Check go/acme/docs, then go/docs global
Org can override global links
Confusing if org shadow a well-known global link
Global-first
Check go/docs global, then go/acme/docs
Consistent platform experience
Org can't customize global names
Recommended: Org-first with warnings
Org takes priority, but warn admins if they shadow a global link
Flexibility with safety
Slightly more complex
Staff+ Signal: The resolution order decision has organizational implications beyond engineering. If global links take priority, the platform team controls the namespace and orgs can't override. If org links take priority, orgs have full autonomy but may accidentally shadow important platform links like
go/helporgo/status. The right answer depends on your organizational model: if you're building an internal tool for one company, global-first is simpler. If you're building a SaaS product, org-first with guardrails (reserved global slugs that can't be shadowed) respects tenant autonomy.
12. Common Interview Mistakes
1. Designing a Generic URL Shortener
What candidates say: "I'll generate a random 7-character hash and store it in a database."
What's wrong: The multi-tenant variant requires namespace isolation, not random hashes. The slug is human-readable (go/acme/roadmap), not random (bit.ly/a3Fk9). This changes the entire data model β you need composite keys, org-scoped uniqueness, and RBAC.
What staff+ candidates say: "Let me first clarify β are these human-readable slugs within organizational namespaces? That changes the key design from hash-based to namespace-scoped composite keys."
2. Putting Auth on the Redirect Path
What candidates say: "Every redirect checks if the user has permission to access this link."
What's wrong: This adds 10-20ms per redirect and creates a dependency on the auth service/IdP for the hottest path. If Okta is down, no redirects work.
What staff+ candidates say: "I'd keep the redirect path auth-free and let the destination system handle access control. The management path (create/edit) requires auth, but redirects should have zero external dependencies beyond cache."
3. Ignoring the Noisy Neighbor Problem
What candidates say: "All orgs share the same Redis cache with a composite key."
What's wrong: A large org importing 500K links can evict other orgs' hot links from cache. A single org running analytics queries can saturate DB read capacity.
What staff+ candidates say: "I'd implement per-org rate limiting on the management path and monitor per-org cache utilization. For Redis, I'd consider hashing by org_slug so each org's links land on the same shard β this gives shard-level isolation and makes per-org metrics easier."
4. Using 301 (Permanent) Redirects by Default
What candidates say: "301 is more efficient β browsers cache it."
What's wrong: Once a browser caches a 301, updating the link's destination has no effect for that user until they clear their cache. For go links where destinations change regularly (e.g., go/acme/oncall points to a rotation that changes weekly), this breaks the system.
What staff+ candidates say: "302 by default because go links are living documents β their destinations change. We can offer 301 as an opt-in for truly permanent links, with a clear warning in the UI."
5. Forgetting Parameterized Links
What candidates say: "Each link maps one slug to one URL."
What's wrong: Parameterized links (go/acme/jira/%s β https://acme.atlassian.net/browse/%s) are the most powerful feature of go links. They transform the system from a URL shortener into an internal routing layer.
6. Not Addressing SSO and Org Lifecycle
What candidates say: "Users create accounts with email and password."
What's wrong: Enterprise go links are B2B SaaS. Organizations authenticate via SSO (SAML/OIDC). User lifecycle is managed via SCIM. Without this, onboarding and offboarding are manual processes that don't scale.
7. Single-Region Without Acknowledging the Trade-off
What candidates say: "I'll deploy everything in us-east-1."
What's wrong: Not that single-region is wrong for an MVP β it's that candidates don't acknowledge the trade-off. Short.io's October 2025 outage proved that single-region URL shorteners can go down completely when that region has issues.
What staff+ candidates say: "I'd start single-region and build the redirect service to be region-independent. The data model (org:slug β url) is simple enough that cross-region replication is straightforward when we need it."
13. Interview Cheat Sheet
Time Allocation (45-minute interview)
Clarify requirements
5 min
Multi-tenant? Human-readable slugs? Parameterized? SSO? Scale target?
High-level design
10 min
Split redirect path (hot) from management path (cold). Redis cache + PostgreSQL
Deep dive #1
10 min
Multi-tenant namespace isolation β data model, cache key design, RBAC
Deep dive #2
8 min
Redirect hot path β < 10ms target, cache strategy, no auth dependency
Scale + failure modes
7 min
Redis Cluster, multi-region, redirect path independence
Trade-offs + wrap-up
5 min
Shared vs. per-tenant schema, 301 vs. 302, resolution order
Step-by-Step Answer Guide
Clarify: "Is this a multi-tenant system like GoLinks where each organization gets their own namespace? Or a single-company internal tool like Google's go/ links?" β this changes everything
Key insight: "The system is a tenant-namespaced key-value store with an HTTP redirect interface. The redirect path must be as fast as a cache lookup. Everything else can be slower."
Draw two paths: Redirect path (stateless, cache-only, no auth) and management path (stateful, auth required, CRUD operations)
Data model: Composite key
(org_id, slug)with uniqueness constraint per org. Redis key:link:{org}:{slug}. Show the SQL schema.Multi-tenancy deep dive: Shared schema with
org_idcolumn, row-level security, composite unique index. Explain why not per-tenant schema (operational overhead).Redirect performance: Full dataset in Redis (~50GB for 100M links). Zero external dependencies on redirect path. No auth check.
Failure handling: Redis down β fall back to PostgreSQL read replica. IdP down β redirects still work. Show the dependency diagram.
Scale: Redis Cluster for sharding, multi-region with GeoDNS, async analytics via Kafka
Trade-offs: 302 vs 301, shared vs isolated schema, org-first vs global-first resolution
Observe: Cache hit rate per org, redirect p99, noisy neighbor detection
What the Interviewer Wants to Hear
At L5/Senior:
Functional design works (redirect + create + basic auth)
Redis caching for performance
Basic understanding of multi-tenancy (composite keys)
Common failure mode: "What if Redis goes down?"
At L6/Staff:
Explicit split of redirect path (hot) vs. management path (cold)
Tenant isolation strategy with trade-off reasoning (shared vs. per-tenant schema)
No auth on redirect path (with justification)
Cache invalidation strategy across regions
Noisy neighbor awareness and per-org rate limiting
Parameterized links and their caching implications
At L7/Principal:
Go links as organizational knowledge infrastructure (not just a URL shortener)
SCIM lifecycle management and security implications of offboarding
Multi-region consistency model: what's the acceptable staleness for a link update?
Total cost of ownership: per-tenant database at 10K tenants is an operational team, not a config flag
Evolution strategy: "Phase 1 is shared schema monolith. Phase 2 splits read/write paths. Phase 3 adds per-tenant isolation for enterprise customers. We can stop after any phase."
Written by Michi Meow as a reference for staff-level system design interviews. The multi-tenant go/ links variant tests everything a generic URL shortener does β plus namespace isolation, organizational access control, and the judgment to keep the redirect path dependency-free.
Last updated