Skip to content

Alert State Management in state.json

The state.json file tracks the current health state of all probes in a project, including failure counts, notification timestamps, and cooldown periods. This file is critical for alert management and preventing notification spam.

storage/app/private/uplinkr/<project-name>/state.json

The state file serves several key functions:

  1. Tracks consecutive failures - Counts how many times a probe has failed in a row
  2. Manages alert cooldowns - Prevents repeated notifications for the same issue
  3. Records notification history - Timestamps when alerts were last sent
  4. Monitors latency issues - Tracks consecutive slow responses
  5. Supports project-level alert aggregation - Probe-level decisions are grouped into one notification per project (and alert config group)
{
"project": "my-project",
"probes": {
"<method> <url>": {
"last_seen_executed_at": "2026-01-30 21:19:00",
"consecutive_failures": 0,
"consecutive_slow": 0,
"last_notified_failure_at": null,
"last_notified_slow_at": null,
"total_failures": 0
}
},
"updated_at": "2026-01-30 21:19:02"
}
FieldTypeDescription
projectstringProject identifier
probesobjectState information for each probe (keyed by method + URL)
updated_atdatetimeLast state update timestamp

Probes are indexed using: <METHOD> <URL>

Examples:

"GET https://example.com/health"
"POST https://api.example.com/status"
"DELETE https://api.example.com/resource"
FieldTypeDescription
last_seen_executed_atdatetimeWhen this probe was last executed
consecutive_failuresintegerNumber of failures in a row (resets on success)
consecutive_slowintegerNumber of slow responses in a row (resets when fast)
last_notified_failure_atdatetime|nullWhen failure alert was last sent
last_notified_slow_atdatetime|nullWhen slow response alert was last sent
total_failuresintegerTotal lifetime failures for this probe (optional)

When a probe is first added, its state is created with default values:

{
"last_seen_executed_at": null,
"consecutive_failures": 0,
"consecutive_slow": 0,
"last_notified_failure_at": null,
"last_notified_slow_at": null,
"total_failures": 0
}

When a probe succeeds:

  • consecutive_failures → reset to 0
  • consecutive_slow → reset to 0 (if response was fast)
  • last_seen_executed_at → updated to current timestamp

When a probe fails:

  • consecutive_failures → incremented by 1
  • total_failures → incremented by 1 (if tracked)
  • last_seen_executed_at → updated to current timestamp

If consecutive_failures reaches trigger_after_failures threshold:

  • Probe is marked as alertable (if cooldown period has passed)
  • last_notified_failure_at → updated to current timestamp

When a probe exceeds latency threshold:

  • consecutive_slow → incremented by 1
  • last_seen_executed_at → updated to current timestamp

consecutive_slow and last_notified_slow_at are tracked in state.json, but slow-response alert decisions are currently not dispatched by the active alert decision flow.

IF consecutive_failures >= trigger_after_failures
AND (last_notified_failure_at is null
OR time_since(last_notified_failure_at) > cooldown_minutes)
THEN mark_probe_for_alert()

Slow-response state is tracked, but alert triggering for slow responses is currently not part of the active alert decision dispatch path.

GROUP all alertable probes by project (+ matching alert configuration)
THEN send one notification per group

Alert decisions are still evaluated per probe, using each probe’s state and cooldown timestamps.

Notification delivery is grouped afterwards:

  • Multiple failing probes in the same project are sent in one notification
  • The grouped message contains a list of affected probes
  • Grouping is separated by alert configuration, so channel/cooldown behavior remains consistent

This reduces notification noise while preserving probe-level state tracking.

{
"project": "uplinkr-dev-api-test",
"probes": {
"GET https://uplinkr.dev/health": {
"last_seen_executed_at": "2026-01-30 21:19:00",
"consecutive_failures": 53,
"consecutive_slow": 0,
"last_notified_failure_at": "2026-01-30 20:26:01",
"last_notified_slow_at": null,
"total_failures": 220
},
"GET https://api-test.uplinkr.dev/health": {
"last_seen_executed_at": "2026-01-30 21:19:01",
"consecutive_failures": 0,
"consecutive_slow": 0,
"last_notified_failure_at": null,
"last_notified_slow_at": null
},
"POST https://api-test.uplinkr.dev/status": {
"last_seen_executed_at": "2026-01-30 21:19:01",
"consecutive_failures": 54,
"consecutive_slow": 0,
"last_notified_failure_at": "2026-01-30 20:26:01",
"last_notified_slow_at": null,
"total_failures": 220
}
},
"updated_at": "2026-01-30 21:19:02"
}
{
"consecutive_failures": 0,
"consecutive_slow": 0,
"last_notified_failure_at": null,
"last_notified_slow_at": null
}

Interpretation: Probe is functioning normally with no recent issues.

{
"consecutive_failures": 15,
"consecutive_slow": 0,
"last_notified_failure_at": null,
"last_notified_slow_at": null
}

Interpretation: Probe is failing but hasn’t reached the alert threshold yet (e.g., trigger_after_failures: 20).

{
"consecutive_failures": 53,
"consecutive_slow": 0,
"last_notified_failure_at": "2026-01-30 20:26:01",
"last_notified_slow_at": null
}

Interpretation: Probe has been failing for 53 consecutive checks. Alert was sent at 20:26, and cooldown is active.

{
"consecutive_failures": 0,
"consecutive_slow": 12,
"last_notified_failure_at": null,
"last_notified_slow_at": "2026-01-30 18:00:00"
}

Interpretation: Probe is reachable but responding slowly. Alert was sent at 18:00.

To reset state for a specific probe (e.g., after fixing an issue):

Terminal window
# Edit state.json and set:
"consecutive_failures": 0,
"consecutive_slow": 0,
"last_notified_failure_at": null,
"last_notified_slow_at": null

To force immediate re-alerting (bypass cooldown):

Terminal window
# Set to null:
"last_notified_failure_at": null,
"last_notified_slow_at": null

State file works in conjunction with settings.json alert configuration:

{
"trigger_after_failures": 20,
"cooldown_minutes": 120,
"latency_threshold_ms": 1000,
"trigger_after_slow": 10
}
  • consecutive_failures compared against trigger_after_failures
  • last_notified_failure_at checked against cooldown_minutes
  • Response time compared against latency_threshold_ms
  • consecutive_slow compared against trigger_after_slow

State pattern:

{
"consecutive_failures": 0,
"total_failures": 150
}

Analysis: Service is currently healthy but has experienced many failures over time. Consider investigating root cause of instability.

State pattern:

{
"consecutive_failures": 200,
"last_notified_failure_at": "2026-01-30 12:00:00"
}

Analysis: Service has been down for 200 consecutive checks. Alert was sent but cooldown prevents spam. Urgent attention needed.

State pattern:

{
"consecutive_failures": 0,
"consecutive_slow": 50,
"last_notified_slow_at": "2026-01-30 15:00:00"
}

Analysis: Service is reachable but performance is degraded. May indicate resource constraints or scaling issues.

Regularly check state files for:

  • High consecutive_failures counts
  • High consecutive_slow counts
  • Probes that never succeed (total_failures keeps growing)

Adjust thresholds in settings.json based on state patterns:

  • If getting too many alerts, increase trigger_after_failures
  • If alerts repeat too often, increase cooldown_minutes
  • If missing slow responses, decrease latency_threshold_ms

Include state.json in backups to preserve:

  • Historical notification timestamps
  • Failure count trends
  • Current alert cooldown states