Alert State Management in state.json

The state.json file tracks the current health state of all probes in a project, including failure counts, notification timestamps, and cooldown periods. This file is critical for alert management and preventing notification spam.

File Location

storage/app/private/uplinkr/<project-name>/state.json

File Purpose

The state file serves several key functions:

Tracks consecutive failures - Counts how many times a probe has failed in a row
Manages alert cooldowns - Prevents repeated notifications for the same issue
Records notification history - Timestamps when alerts were last sent
Monitors latency issues - Tracks consecutive slow responses
Supports project-level alert aggregation - Probe-level decisions are grouped into one notification per project (and alert config group)

File Structure

{
    "project": "my-project",
    "probes": {
        "<method> <url>": {
            "last_seen_executed_at": "2026-01-30 21:19:00",
            "consecutive_failures": 0,
            "consecutive_slow": 0,
            "last_notified_failure_at": null,
            "last_notified_slow_at": null,
            "total_failures": 0
        }
    },
    "updated_at": "2026-01-30 21:19:02"
}

Field Reference

Top-Level Fields

Field	Type	Description
`project`	string	Project identifier
`probes`	object	State information for each probe (keyed by method + URL)
`updated_at`	datetime	Last state update timestamp

Probe Key Format

Probes are indexed using: <METHOD> <URL>

Examples:

"GET https://example.com/health"
"POST https://api.example.com/status"
"DELETE https://api.example.com/resource"

Probe State Fields

Field	Type	Description
`last_seen_executed_at`	datetime	When this probe was last executed
`consecutive_failures`	integer	Number of failures in a row (resets on success)
`consecutive_slow`	integer	Number of slow responses in a row (resets when fast)
`last_notified_failure_at`	datetime\|null	When failure alert was last sent
`last_notified_slow_at`	datetime\|null	When slow response alert was last sent
`total_failures`	integer	Total lifetime failures for this probe (optional)

State Lifecycle

Initial State

When a probe is first added, its state is created with default values:

{
    "last_seen_executed_at": null,
    "consecutive_failures": 0,
    "consecutive_slow": 0,
    "last_notified_failure_at": null,
    "last_notified_slow_at": null,
    "total_failures": 0
}

On Successful Check

When a probe succeeds:

consecutive_failures → reset to 0
consecutive_slow → reset to 0 (if response was fast)
last_seen_executed_at → updated to current timestamp

On Failed Check

When a probe fails:

consecutive_failures → incremented by 1
total_failures → incremented by 1 (if tracked)
last_seen_executed_at → updated to current timestamp

If consecutive_failures reaches trigger_after_failures threshold:

Probe is marked as alertable (if cooldown period has passed)
last_notified_failure_at → updated to current timestamp

On Slow Response

When a probe exceeds latency threshold:

consecutive_slow → incremented by 1
last_seen_executed_at → updated to current timestamp

consecutive_slow and last_notified_slow_at are tracked in state.json, but slow-response alert decisions are currently not dispatched by the active alert decision flow.

Alert Triggering Logic

Failure Alerts

IF consecutive_failures >= trigger_after_failures
AND (last_notified_failure_at is null
     OR time_since(last_notified_failure_at) > cooldown_minutes)
THEN mark_probe_for_alert()

Slow Response Alerts

Slow-response state is tracked, but alert triggering for slow responses is currently not part of the active alert decision dispatch path.

Notification Dispatch (Project-Level)

GROUP all alertable probes by project (+ matching alert configuration)
THEN send one notification per group

Project-Level Aggregation Behavior

Alert decisions are still evaluated per probe, using each probe’s state and cooldown timestamps.

Notification delivery is grouped afterwards:

Multiple failing probes in the same project are sent in one notification
The grouped message contains a list of affected probes
Grouping is separated by alert configuration, so channel/cooldown behavior remains consistent

This reduces notification noise while preserving probe-level state tracking.

Example: Complete State File

{
    "project": "uplinkr-dev-api-test",
    "probes": {
        "GET https://uplinkr.dev/health": {
            "last_seen_executed_at": "2026-01-30 21:19:00",
            "consecutive_failures": 53,
            "consecutive_slow": 0,
            "last_notified_failure_at": "2026-01-30 20:26:01",
            "last_notified_slow_at": null,
            "total_failures": 220
        },
        "GET https://api-test.uplinkr.dev/health": {
            "last_seen_executed_at": "2026-01-30 21:19:01",
            "consecutive_failures": 0,
            "consecutive_slow": 0,
            "last_notified_failure_at": null,
            "last_notified_slow_at": null
        },
        "POST https://api-test.uplinkr.dev/status": {
            "last_seen_executed_at": "2026-01-30 21:19:01",
            "consecutive_failures": 54,
            "consecutive_slow": 0,
            "last_notified_failure_at": "2026-01-30 20:26:01",
            "last_notified_slow_at": null,
            "total_failures": 220
        }
    },
    "updated_at": "2026-01-30 21:19:02"
}

Interpreting State Data

Healthy Probe

{
    "consecutive_failures": 0,
    "consecutive_slow": 0,
    "last_notified_failure_at": null,
    "last_notified_slow_at": null
}

Interpretation: Probe is functioning normally with no recent issues.

Failing Probe (Not Yet Alerted)

{
    "consecutive_failures": 15,
    "consecutive_slow": 0,
    "last_notified_failure_at": null,
    "last_notified_slow_at": null
}

Interpretation: Probe is failing but hasn’t reached the alert threshold yet (e.g., trigger_after_failures: 20).

Failing Probe (Already Alerted)

{
    "consecutive_failures": 53,
    "consecutive_slow": 0,
    "last_notified_failure_at": "2026-01-30 20:26:01",
    "last_notified_slow_at": null
}

Interpretation: Probe has been failing for 53 consecutive checks. Alert was sent at 20:26, and cooldown is active.

Slow Response Pattern

{
    "consecutive_failures": 0,
    "consecutive_slow": 12,
    "last_notified_failure_at": null,
    "last_notified_slow_at": "2026-01-30 18:00:00"
}

Interpretation: Probe is reachable but responding slowly. Alert was sent at 18:00.

Manual State Management

Resetting State

To reset state for a specific probe (e.g., after fixing an issue):

# Edit state.json and set:
"consecutive_failures": 0,
"consecutive_slow": 0,
"last_notified_failure_at": null,
"last_notified_slow_at": null

Clearing Notification History

To force immediate re-alerting (bypass cooldown):

# Set to null:
"last_notified_failure_at": null,
"last_notified_slow_at": null

Relationship with Alert Configuration

State file works in conjunction with settings.json alert configuration:

From `settings.json`:

{
    "trigger_after_failures": 20,
    "cooldown_minutes": 120,
    "latency_threshold_ms": 1000,
    "trigger_after_slow": 10
}

Applied to `state.json`:

consecutive_failures compared against trigger_after_failures
last_notified_failure_at checked against cooldown_minutes
Response time compared against latency_threshold_ms
consecutive_slow compared against trigger_after_slow

Common Scenarios

Scenario 1: Flapping Service

State pattern:

{
    "consecutive_failures": 0,
    "total_failures": 150
}

Analysis: Service is currently healthy but has experienced many failures over time. Consider investigating root cause of instability.

Scenario 2: Persistent Outage

State pattern:

{
    "consecutive_failures": 200,
    "last_notified_failure_at": "2026-01-30 12:00:00"
}

Analysis: Service has been down for 200 consecutive checks. Alert was sent but cooldown prevents spam. Urgent attention needed.

Scenario 3: Performance Degradation

State pattern:

{
    "consecutive_failures": 0,
    "consecutive_slow": 50,
    "last_notified_slow_at": "2026-01-30 15:00:00"
}

Analysis: Service is reachable but performance is degraded. May indicate resource constraints or scaling issues.

Best Practices

Monitoring State Health

Regularly check state files for:

High consecutive_failures counts
High consecutive_slow counts
Probes that never succeed (total_failures keeps growing)

Alert Tuning

Adjust thresholds in settings.json based on state patterns:

If getting too many alerts, increase trigger_after_failures
If alerts repeat too often, increase cooldown_minutes
If missing slow responses, decrease latency_threshold_ms

State Backup

Include state.json in backups to preserve:

Historical notification timestamps
Failure count trends
Current alert cooldown states

Storage Structure - Overall storage architecture
Project Files - Alert configuration in settings.json
Probe Data - Probe execution results that update state
Configuration - System-wide alert settings

Alert State Management in state.json

File Location

File Purpose

File Structure

Field Reference

Top-Level Fields

Probe Key Format

Probe State Fields

State Lifecycle

Initial State

On Successful Check

On Failed Check

On Slow Response

Alert Triggering Logic

Failure Alerts

Slow Response Alerts

Notification Dispatch (Project-Level)

Project-Level Aggregation Behavior

Example: Complete State File

Interpreting State Data

Healthy Probe

Failing Probe (Not Yet Alerted)

Failing Probe (Already Alerted)

Slow Response Pattern

Manual State Management

Resetting State

Clearing Notification History

Relationship with Alert Configuration

From settings.json:

Applied to state.json:

Common Scenarios

Scenario 1: Flapping Service

Scenario 2: Persistent Outage

Scenario 3: Performance Degradation

Best Practices

Monitoring State Health

Alert Tuning

State Backup

Related Topics

From `settings.json`:

Applied to `state.json`: