Production Monitoring: SLAs, Errors & User Behavior

DevOps Monitoring Operations
RJ Lindelof
September 2, 2026 8 min read Explore Operations Excellence at RJL.guru
Production Monitoring: SLAs, Errors & User Behavior

Your code deployed successfully. Now what? Here's how to monitor production systems that actually tells you what's happening before your users complain.

Deployment is not the finish line. The most dangerous time for any feature is the first 24 hours in production. Without proper monitoring, you're flying blind - learning about problems from angry user emails instead of dashboards. Good monitoring tells you what's wrong before anyone else notices.

SLAs: Setting Realistic Expectations

Service Level Agreements define what "working" means. The difference between uptime targets is bigger than it looks:

SLAAnnual DowntimeMonthly Downtime
99%3.65 days7.2 hours
99.9%8.76 hours43.8 minutes
99.99%52.6 minutes4.4 minutes
99.999%5.26 minutes26.3 seconds

Each additional nine roughly 10x's the engineering effort and cost. Pick the SLA your business actually needs, not the one that sounds impressive. Most websites don't need five nines.

Error Monitoring: What the Numbers Mean

4xx Errors: Client Problems

  • 400 Bad Request - Client sent malformed data. Check your validation.
  • 401 Unauthorized - Auth failed. Check token expiration, login flows.
  • 403 Forbidden - Auth worked but permissions denied. Review access control.
  • 404 Not Found - Missing resources. Broken links, deleted content, bad URLs.
  • 429 Too Many Requests - Rate limiting kicked in. Someone's hammering your API.

4xx errors are often user errors, but spikes indicate UX problems or breaking changes.

5xx Errors: Your Problems

  • 500 Internal Server Error - Something crashed. Check logs immediately.
  • 502 Bad Gateway - Upstream service down. Check dependencies.
  • 503 Service Unavailable - Overloaded or maintenance. Scale up or investigate.
  • 504 Gateway Timeout - Upstream too slow. Database queries? External APIs?

5xx errors are always your responsibility. Each one is a user who had a bad experience.

Tools That Actually Work

Error Tracking

  • Sentry - Catches exceptions with full stack traces, context, and user info
  • Bugsnag - Similar to Sentry, strong mobile support
  • Rollbar - Real-time error tracking with deployment correlation

APM (Application Performance Monitoring)

  • Datadog - Full stack observability, traces, metrics, logs unified
  • New Relic - Deep application insights, database query analysis
  • Dynatrace - AI-powered root cause analysis

Alerting

  • PagerDuty - On-call scheduling, escalation policies, incident management
  • Opsgenie - Alerting with team routing
  • Slack/Teams integrations - For non-critical alerts

Alerting Without Alert Fatigue

The worst monitoring setup: alerts for everything. Your team learns to ignore them, and the real emergency gets lost in noise.

Design alerts with tiers:

  • Page immediately (wake someone up) - Service down, data loss, security incident
  • Urgent (respond within hours) - Error rate spike, degraded performance
  • Normal (next business day) - Elevated warnings, capacity approaching limits
  • Informational (don't alert) - Log for debugging, no action needed

If everything is urgent, nothing is.

User Behavior Analytics

Technical metrics tell you what's broken. Behavior analytics tell you what's working (or not).

Heatmaps

Where do users actually click? Scroll? Hover? Heatmaps reveal:

  • CTAs that nobody notices
  • Content users never scroll to
  • Elements users try to click that aren't clickable

Session Replay

Watch recordings of actual user sessions (anonymized). You'll see:

  • Rage clicks when something doesn't respond
  • Confusion navigating your UI
  • Steps where users abandon flows

Tools like FullStory, Hotjar, and LogRocket make this easy.

Funnel Analysis

For any multi-step process (signup, checkout, onboarding):

  • How many users start?
  • Where do they drop off?
  • What's different about users who complete vs. abandon?

Acting on Data

Metrics without action are just expensive charts. Create feedback loops:

  • Weekly metrics review - What's trending wrong?
  • Error budgets - When reliability drops, prioritize fixes over features
  • Postmortems - After incidents, document what happened and how to prevent it
  • Alerts that create tickets - Don't rely on memory

Postmortems That Actually Improve Things

Most postmortems are blame sessions that produce nothing. Effective postmortems:

  • Are blameless - Focus on systems, not individuals
  • Establish timeline - What happened when?
  • Identify root cause - Why did the system allow this?
  • List action items - Concrete tasks with owners and deadlines
  • Follow up - Did we actually complete those items?

The goal isn't to prevent all failures - it's to prevent the same failure twice.

Starting Point

If you have nothing today, start with:

  1. Error tracking - Know when things break
  2. Uptime monitoring - Know when the site is down
  3. One key business metric - Signups, purchases, whatever matters most

You can add sophistication later. First, stop being blind.

Your code isn't done when it deploys. It's done when you can prove it's working. Build monitoring that gives you confidence - and sleep.

Frequently Asked Questions

About the Author

RJ Lindelof is a technology executive with 35+ years of experience spanning Fortune 500 companies to startups. He does don't just talk about AI; he implement's it to solve real-world business problems. RJ's approach has led to significant improvements in team velocity, code quality, and time-to-market.