Alert State Management in state.json
The state.json file tracks the current health state of all probes in a project, including failure counts, notification timestamps, and cooldown periods. This file is critical for alert management and preventing notification spam.
File Location
Section titled “File Location”storage/app/private/uplinkr/<project-name>/state.jsonFile Purpose
Section titled “File Purpose”The state file serves several key functions:
- Tracks consecutive failures - Counts how many times a probe has failed in a row
- Manages alert cooldowns - Prevents repeated notifications for the same issue
- Records notification history - Timestamps when alerts were last sent
- Monitors latency issues - Tracks consecutive slow responses
- Supports project-level alert aggregation - Probe-level decisions are grouped into one notification per project (and alert config group)
File Structure
Section titled “File Structure”{ "project": "my-project", "probes": { "<method> <url>": { "last_seen_executed_at": "2026-01-30 21:19:00", "consecutive_failures": 0, "consecutive_slow": 0, "last_notified_failure_at": null, "last_notified_slow_at": null, "total_failures": 0 } }, "updated_at": "2026-01-30 21:19:02"}Field Reference
Section titled “Field Reference”Top-Level Fields
Section titled “Top-Level Fields”| Field | Type | Description |
|---|---|---|
project | string | Project identifier |
probes | object | State information for each probe (keyed by method + URL) |
updated_at | datetime | Last state update timestamp |
Probe Key Format
Section titled “Probe Key Format”Probes are indexed using: <METHOD> <URL>
Examples:
"GET https://example.com/health""POST https://api.example.com/status""DELETE https://api.example.com/resource"Probe State Fields
Section titled “Probe State Fields”| Field | Type | Description |
|---|---|---|
last_seen_executed_at | datetime | When this probe was last executed |
consecutive_failures | integer | Number of failures in a row (resets on success) |
consecutive_slow | integer | Number of slow responses in a row (resets when fast) |
last_notified_failure_at | datetime|null | When failure alert was last sent |
last_notified_slow_at | datetime|null | When slow response alert was last sent |
total_failures | integer | Total lifetime failures for this probe (optional) |
State Lifecycle
Section titled “State Lifecycle”Initial State
Section titled “Initial State”When a probe is first added, its state is created with default values:
{ "last_seen_executed_at": null, "consecutive_failures": 0, "consecutive_slow": 0, "last_notified_failure_at": null, "last_notified_slow_at": null, "total_failures": 0}On Successful Check
Section titled “On Successful Check”When a probe succeeds:
consecutive_failures→ reset to0consecutive_slow→ reset to0(if response was fast)last_seen_executed_at→ updated to current timestamp
On Failed Check
Section titled “On Failed Check”When a probe fails:
consecutive_failures→ incremented by 1total_failures→ incremented by 1 (if tracked)last_seen_executed_at→ updated to current timestamp
If consecutive_failures reaches trigger_after_failures threshold:
- Probe is marked as alertable (if cooldown period has passed)
last_notified_failure_at→ updated to current timestamp
On Slow Response
Section titled “On Slow Response”When a probe exceeds latency threshold:
consecutive_slow→ incremented by 1last_seen_executed_at→ updated to current timestamp
consecutive_slow and last_notified_slow_at are tracked in state.json, but slow-response alert decisions are currently not dispatched by the active alert decision flow.
Alert Triggering Logic
Section titled “Alert Triggering Logic”Failure Alerts
Section titled “Failure Alerts”IF consecutive_failures >= trigger_after_failuresAND (last_notified_failure_at is null OR time_since(last_notified_failure_at) > cooldown_minutes)THEN mark_probe_for_alert()Slow Response Alerts
Section titled “Slow Response Alerts”Slow-response state is tracked, but alert triggering for slow responses is currently not part of the active alert decision dispatch path.
Notification Dispatch (Project-Level)
Section titled “Notification Dispatch (Project-Level)”GROUP all alertable probes by project (+ matching alert configuration)THEN send one notification per groupProject-Level Aggregation Behavior
Section titled “Project-Level Aggregation Behavior”Alert decisions are still evaluated per probe, using each probe’s state and cooldown timestamps.
Notification delivery is grouped afterwards:
- Multiple failing probes in the same project are sent in one notification
- The grouped message contains a list of affected probes
- Grouping is separated by alert configuration, so channel/cooldown behavior remains consistent
This reduces notification noise while preserving probe-level state tracking.
Example: Complete State File
Section titled “Example: Complete State File”{ "project": "uplinkr-dev-api-test", "probes": { "GET https://uplinkr.dev/health": { "last_seen_executed_at": "2026-01-30 21:19:00", "consecutive_failures": 53, "consecutive_slow": 0, "last_notified_failure_at": "2026-01-30 20:26:01", "last_notified_slow_at": null, "total_failures": 220 }, "GET https://api-test.uplinkr.dev/health": { "last_seen_executed_at": "2026-01-30 21:19:01", "consecutive_failures": 0, "consecutive_slow": 0, "last_notified_failure_at": null, "last_notified_slow_at": null }, "POST https://api-test.uplinkr.dev/status": { "last_seen_executed_at": "2026-01-30 21:19:01", "consecutive_failures": 54, "consecutive_slow": 0, "last_notified_failure_at": "2026-01-30 20:26:01", "last_notified_slow_at": null, "total_failures": 220 } }, "updated_at": "2026-01-30 21:19:02"}Interpreting State Data
Section titled “Interpreting State Data”Healthy Probe
Section titled “Healthy Probe”{ "consecutive_failures": 0, "consecutive_slow": 0, "last_notified_failure_at": null, "last_notified_slow_at": null}Interpretation: Probe is functioning normally with no recent issues.
Failing Probe (Not Yet Alerted)
Section titled “Failing Probe (Not Yet Alerted)”{ "consecutive_failures": 15, "consecutive_slow": 0, "last_notified_failure_at": null, "last_notified_slow_at": null}Interpretation: Probe is failing but hasn’t reached the alert threshold yet (e.g., trigger_after_failures: 20).
Failing Probe (Already Alerted)
Section titled “Failing Probe (Already Alerted)”{ "consecutive_failures": 53, "consecutive_slow": 0, "last_notified_failure_at": "2026-01-30 20:26:01", "last_notified_slow_at": null}Interpretation: Probe has been failing for 53 consecutive checks. Alert was sent at 20:26, and cooldown is active.
Slow Response Pattern
Section titled “Slow Response Pattern”{ "consecutive_failures": 0, "consecutive_slow": 12, "last_notified_failure_at": null, "last_notified_slow_at": "2026-01-30 18:00:00"}Interpretation: Probe is reachable but responding slowly. Alert was sent at 18:00.
Manual State Management
Section titled “Manual State Management”Resetting State
Section titled “Resetting State”To reset state for a specific probe (e.g., after fixing an issue):
# Edit state.json and set:"consecutive_failures": 0,"consecutive_slow": 0,"last_notified_failure_at": null,"last_notified_slow_at": nullClearing Notification History
Section titled “Clearing Notification History”To force immediate re-alerting (bypass cooldown):
# Set to null:"last_notified_failure_at": null,"last_notified_slow_at": nullRelationship with Alert Configuration
Section titled “Relationship with Alert Configuration”State file works in conjunction with settings.json alert configuration:
From settings.json:
Section titled “From settings.json:”{ "trigger_after_failures": 20, "cooldown_minutes": 120, "latency_threshold_ms": 1000, "trigger_after_slow": 10}Applied to state.json:
Section titled “Applied to state.json:”consecutive_failurescompared againsttrigger_after_failureslast_notified_failure_atchecked againstcooldown_minutes- Response time compared against
latency_threshold_ms consecutive_slowcompared againsttrigger_after_slow
Common Scenarios
Section titled “Common Scenarios”Scenario 1: Flapping Service
Section titled “Scenario 1: Flapping Service”State pattern:
{ "consecutive_failures": 0, "total_failures": 150}Analysis: Service is currently healthy but has experienced many failures over time. Consider investigating root cause of instability.
Scenario 2: Persistent Outage
Section titled “Scenario 2: Persistent Outage”State pattern:
{ "consecutive_failures": 200, "last_notified_failure_at": "2026-01-30 12:00:00"}Analysis: Service has been down for 200 consecutive checks. Alert was sent but cooldown prevents spam. Urgent attention needed.
Scenario 3: Performance Degradation
Section titled “Scenario 3: Performance Degradation”State pattern:
{ "consecutive_failures": 0, "consecutive_slow": 50, "last_notified_slow_at": "2026-01-30 15:00:00"}Analysis: Service is reachable but performance is degraded. May indicate resource constraints or scaling issues.
Best Practices
Section titled “Best Practices”Monitoring State Health
Section titled “Monitoring State Health”Regularly check state files for:
- High
consecutive_failurescounts - High
consecutive_slowcounts - Probes that never succeed (
total_failureskeeps growing)
Alert Tuning
Section titled “Alert Tuning”Adjust thresholds in settings.json based on state patterns:
- If getting too many alerts, increase
trigger_after_failures - If alerts repeat too often, increase
cooldown_minutes - If missing slow responses, decrease
latency_threshold_ms
State Backup
Section titled “State Backup”Include state.json in backups to preserve:
- Historical notification timestamps
- Failure count trends
- Current alert cooldown states
Related Topics
Section titled “Related Topics”- Storage Structure - Overall storage architecture
- Project Files - Alert configuration in settings.json
- Probe Data - Probe execution results that update state
- Configuration - System-wide alert settings