Pre-Deployment Checklist

Verify code quality, secure secrets, backup databases, set up monitoring, and document deployment procedures to prevent production incidents and enable rapid rollback. 5 steps, 45 minutes.

45 minutes 5 stepsCritical

Key Challenge

Most outages are caused by: unreviewed code, accidentally deployed secrets, no backups, or missing monitoring. Following this checklist prevents 95% of production incidents.

1

Code review: ensure tests pass, no console.logs, no hardcoded values

Before deploying, verify: (1) All automated tests pass (unit, integration, E2E). (2) No console.log, debug statements, or commented code. (3) No hardcoded API keys, passwords, or environment-specific values. (4) Code style consistent (run linter). (5) No security vulnerabilities (run dependency audit: npm audit, cargo audit). Example checklist: grep for TODO, FIXME comments (unfinished work); search for env-specific IPs or emails embedded in code; verify all feature flags default to disabled in production. Common mistakes: deploying with console.logs (slows performance, leaks internal details), hardcoded staging API endpoints in production code (requests fail silently), or test code left in production (security risk). Use pre-commit hooks to prevent these automatically.

💡 Pro Tip: Automate code review with linting, type checking, and security scanning. Tools: ESLint (JavaScript), mypy (Python), cargo clippy (Rust), Snyk (dependency audit). Fail CI/CD pipeline if checks don't pass—force review before merge.

Open Code Quality Checker
2

Secure API keys and secrets: use env vars, never commit to git

Secrets (API keys, database passwords, OAuth tokens) must never be in source code. Management: (1) Use environment variables or secret management tools (AWS Secrets Manager, HashiCorp Vault, 1Password). (2) .env file locally (never commit to git). (3) CI/CD pipeline stores secrets securely; injected at deploy time. (4) Rotate secrets every 30–90 days. (5) Audit secret access logs (who accessed which secret, when). Example: Database password stored in env var DATABASE_PASSWORD, code reads process.env.DATABASE_PASSWORD. If credential is exposed, rotate: generate new password in DB, update env var in deployment system, restart service. Automated scanning: tools scan git history for accidentally committed secrets; if found, force secret rotation immediately. AWS credentials, Stripe keys, JWT signing keys—all secrets follow this pattern.

💡 Pro Tip: Use a secret scanner in CI/CD to catch accidentally committed secrets before they reach main branch. Tools: TruffleHog, GitGuardian, git-secrets. If secret leaked: rotate immediately, check access logs for misuse, audit billing for unauthorized charges.

Open Secrets Manager
3

Backup database: verify backup completeness, test restore

Before deploy, ensure: (1) Database backup was successful in last 24 hours. (2) Backup is stored in separate location (different region, different cloud provider if possible). (3) Restore process is tested (monthly restore test to verify backups aren't corrupted). Example: MySQL database backed up daily at 2 AM UTC, stored in S3. Restore test: every month, restore backup to test environment, run queries to verify data integrity. If deploy breaks database schema or corrupts data, you can rollback: restore backup from pre-deploy state. Backup RPO/RTO: RPO (Recovery Point Objective) = max acceptable data loss (aim <1 hour for most apps). RTO (Recovery Time Objective) = max acceptable downtime (aim <15 min). For critical systems: hourly backups, replicated across regions, RTO <5 min. Common mistake: backup exists but was never tested; restore fails when needed. Test restores quarterly.

💡 Pro Tip: Use managed backups (RDS automated backups, Cloud SQL, etc.) for convenience. For self-hosted: set up automated backups, encrypt backups, test restore monthly. Store backups off-site; if production server is hacked and wiped, local backups don't help.

Open Backup Checker
4

Set up monitoring and alerting before going live

Monitoring must be in place before deploy. Metrics to track: (1) API latency (p50, p99). (2) Error rate (5xx errors, timeouts). (3) Database connection pool usage. (4) Memory/CPU usage. (5) Disk space. Set up alerting: (1) If error rate > 1% → page on-call engineer. (2) If latency p99 > 5 seconds → alert. (3) If CPU > 80% for 5 min → alert. Example: NewRelic, DataDog, or self-hosted Prometheus. Configure escalation: first alert → Slack message; 15 min no response → email + SMS; 30 min → page on-call. Logging: all requests logged with request ID for tracing. 500 errors logged with full stack trace. Query slow logs enabled (queries >1 second logged). Without monitoring, you discover issues hours after deploy from angry users.

💡 Pro Tip: Set up monitoring in staging before production deploy. Verify dashboards display correctly, alerts fire and resolve as expected. Practice incident response: simulate an alert, ensure notifications work, run playbook.

Open Monitoring Config
5

Document deployment and rollback procedures

Create runbook: (1) Deployment steps (build, test, deploy command, post-deploy checks). (2) How to verify deployment succeeded (health checks pass, API responds, metrics green). (3) Rollback procedure (how to revert if deploy breaks production). (4) Incident response (what to do if deployment causes outage). Example runbook: 'Deploy v2.1.0: (1) Build docker image: docker build -t app:v2.1.0 . (2) Push to registry: docker push registry/app:v2.1.0. (3) Deploy: kubectl set image deployment/app app=registry/app:v2.1.0. (4) Wait 2 min, verify health: curl https://api/health → status=ok. (5) Check error rate dashboard: should stay <0.5%. (6) Monitor for 10 min. If error rate > 2%, rollback: kubectl set image deployment/app app=registry/app:v2.0.9. (7) Verify rollback successful.' This documentation prevents panic during incidents. New team members can deploy without tribal knowledge.

💡 Pro Tip: Automate rollback. Red button: one-click rollback to previous version. For progressive deploys: canary (5% traffic) → 25% → 50% → 100%. If canary errors exceed threshold, auto-rollback.

Open Runbook Generator

What You'll Have

Code review checklist completed: tests passing, no hardcoded values, no security issues

Secrets management verified: all credentials in env vars, no leaks in git history

Database backup confirmed: completed within 24 hours, restore tested, off-site replicated

Monitoring and alerting configured: dashboards visible, alerts configured and tested

Deployment and rollback runbooks documented for quick incident response

Tools in this workflow

Follow this workflow in sequence to move from question to decision without losing context.

Why This Workflow Works

These five steps comprise the minimal viable production readiness. Code review catches bugs before they harm users. Secrets management prevents breaches. Backups provide insurance. Monitoring enables quick detection of problems. Runbooks enable quick response. Together, they reduce Mean Time To Recovery (MTTR) from hours to minutes. Teams that follow this checklist have 10x fewer production incidents.

FAQs

Should I deploy during business hours or after hours?

Depends on risk and team size. High-risk deploy (major feature, database migration): deploy during business hours with full team present. If issues arise, fix immediately. Low-risk deploy (bug fix, performance improvement): deploy after hours if you have on-call coverage. Never deploy and disappear. If you deploy at midnight and things break, you're the only one awake to fix it. For critical systems: use canary or blue-green deployment to minimize blast radius.

How long should I monitor after deployment?

Minimum 15–30 minutes monitoring if deploy is small and low-risk. 1–2 hours for medium-risk. 4+ hours for major changes (schema migrations, API contract changes). Monitor: error rate, latency, database connection health, user session counts. Watch logs for new error patterns. If metrics are green and no new errors appear after 30 min, risk of immediate catastrophe is low. But stay alert for 24 hours (maybe affected code only triggers on certain user paths).

What if I can't roll back (e.g., database migration is destructive)?

This is a deploy-planning failure. Design deploys to be reversible: (1) Schema migrations: add column (safe), drop column (later). Never drop and re-add in single deploy. (2) Data migrations: run data migration after deploy success, before removing old code. (3) Feature flags: hide new features behind flags; if broken, disable flag without deploy. (4) Backward compatibility: new code accepts old API contracts until old clients are gone. If you truly can't rollback and deploy breaks: (1) Fix forward (deploy patch, no rollback). (2) Restore from backup and retry (acceptable for data services, not for stateless APIs). (3) Manual data repair (expensive, error-prone). Plan deploys assuming rollback might be needed.

How often should I test restore from backup?

At least quarterly (4 times per year). More frequently for high-value data (financial, healthcare): monthly. Process: (1) Restore backup to test environment (not production). (2) Run queries on restored data: select count(*), verify recent data exists. (3) Run application smoke tests against restored data. (4) Record restore time (for RTO calculation). (5) Document any issues and fix. Automated restore testing: every restore is logged with metrics; alerts fire if restore fails. This catches backup corruption before it matters.

What's the minimum monitoring I need?

Three metrics: (1) Error rate (red if >1%), (2) Latency p99 (yellow >2s, red >5s), (3) CPU/Memory (red >80%). Minimal: one dashboard, one alert rule. Better: per-endpoint error rates, per-database query latency, dependency health checks. Best: full observability with logs, traces, and metrics. Start minimal (3 metrics, 1 alert), expand as service grows.