Many IoT platforms still store online state as a single field such as online = true/false. That looks convenient until the system has to support different device classes, unstable networks, command delivery, fleet search, and operations alerts. At that point one field starts answering four different questions at once:
- is there a network session right now
- is the device still alive on schedule
- when did the platform last observe valid activity
- did the device go away cleanly or through an abnormal disconnect
The core conclusion of this article is: device online state should not be modeled as a single field. A safer design combines four distinct signals: Connectivity for session presence, Heartbeat for expected liveness, Last Seen for the most recent valid activity, and LWT or an equivalent disconnect event for abnormal break detection. The state that operations and alarms use should be a derived judgment built from those signals, not any one of them alone.
Once those concerns are collapsed into a single value, the platform usually creates the same failures:
- the session is gone but the UI still shows the device as online
- heartbeat is missing while sporadic telemetry makes the state flap
- MQTT disconnects are treated as generic silence because no abnormal signal is captured
- low-power devices are misclassified as unstable because they were never meant to be always online
Definition Block
In this article,
device online statedoes not mean broker connectivity alone and does not mean the latest timestamp in a table. It means the platform's current operational judgment about whether the device can be treated as communicable, observable, and dependable right now.
Decision Block
If your system supports alarms, fleet search, command delivery, operations triage, or SLA reporting, do not expose a raw boolean as the source of truth for device status. Store the underlying signals separately and derive operational states such as
online,suspect_offline,offline, andstalefrom them. Otherwise network state, device liveness, and data freshness will be mixed into one misleading concept.
1. Why online state is not a single field
1.1 Different layers are asking different questions
The same word online means different things depending on who asks:
- the connectivity layer cares whether an MQTT, TCP, WebSocket, or cellular session exists
- the device platform cares whether the device is still alive on the expected schedule
- the operations console cares whether commands are likely to work now
- the business layer cares whether the reported state is trustworthy enough for alerts or automation
If the platform keeps only one online field, those layers are forced to share one answer even though they are not asking the same question.
1.2 A usable status model needs object, condition, and consequence
A meaningful state judgment should always clarify:
- object: session presence, device liveness, or data freshness
- condition: based on heartbeat timeout, disconnect event, lack of data, or LWT
- consequence: affects dashboard display, alarms, command routing, or ticket escalation
Without those dimensions, online state becomes whatever the last component happened to write.
2. What the four signals actually do
| Signal | What it answers | Common source | Limitation |
|---|---|---|---|
Connectivity | Is there an active session now | MQTT session, TCP link, cellular PDP, WebSocket | Session presence does not prove application liveness |
Heartbeat | Is the device alive on the expected cadence | periodic ping, app-level keepalive, state report | Poor cadence design creates false alarms |
Last Seen | When did the platform last observe valid activity | telemetry, ACK, event, heartbeat | It shows recent observation, not guaranteed current availability |
LWT | Did the session break abnormally | MQTT LWT, broker disconnect event, session loss | Only available on some transports and cannot replace liveness logic |
These are layered signals, not substitutes:
Connectivityanswers whether the session still existsHeartbeatanswers whether the device is behaving alive on scheduleLast Seenanswers when the platform most recently observed activityLWTadds evidence about abnormal disconnect behavior
flowchart LR
T("Telemetry / ACK / Heartbeat"):::green --> LS("Last Seen"):::blue
S("Connect / Disconnect Events"):::orange --> C("Connectivity"):::blue
H("Application Liveness"):::violet --> HB("Heartbeat"):::blue
L("Abnormal Disconnect Signal"):::red --> LWT("LWT / Session Lost"):::blue
LS --> G("State Aggregator"):::slate
C --> G
HB --> G
LWT --> G
G --> O("Derived State\nonline / suspect / offline / stale"):::amber
classDef blue fill:#EAF4FF,stroke:#2563EB,color:#16324F,stroke-width:2px;
classDef green fill:#ECFDF3,stroke:#16A34A,color:#14532D,stroke-width:2px;
classDef orange fill:#FFF7ED,stroke:#EA580C,color:#7C2D12,stroke-width:2px;
classDef violet fill:#F5F3FF,stroke:#7C3AED,color:#4C1D95,stroke-width:2px;
classDef red fill:#FEF2F2,stroke:#DC2626,color:#7F1D1D,stroke-width:2px;
classDef slate fill:#F8FAFC,stroke:#64748B,color:#1F2937,stroke-width:2px;
classDef amber fill:#FFFBEB,stroke:#D97706,color:#78350F,stroke-width:2px;3. How to combine those signals into a stable model
3.1 Store raw signals first, derive status second
A practical minimum field set looks like this:
connectivity_stateheartbeat_atlast_seen_atdisconnect_reasonlast_lwt_atderived_online_statederived_state_reason
The first five are observations. The last two are platform judgments.
That split matters because thresholds, device classes, and alert rules always change over time. If raw observations are lost, later tuning turns into guesswork.
3.2 Use Connectivity for session state, not for full liveness
Connectivity should capture events such as:
- MQTT client connected or disconnected
- TCP session established or closed
- WebSocket connected or closed
- cellular link up or down
It is useful for:
- realtime command eligibility
- current session counts
- broker or gateway connectivity alarms
It should not alone define whether the device is operationally online. A session can exist while the application loop is frozen, the sensor task is dead, or the device is stuck in a degraded mode.
3.3 Use Heartbeat for liveness, not for transport
Heartbeat should be an application-level choice made per device class. A safer pattern is:
- set cadence per device type instead of one global interval
- include lightweight runtime context such as
device_time,boot_id, orfirmware_version - evaluate timeout as
expected interval x tolerance factorinstead of one hardcoded value
Examples:
- mains-powered devices may use a 60-second heartbeat and enter suspicion after 3 missed periods
- battery devices may use a 15-minute or 1-hour heartbeat and should not share the same threshold
- low-bandwidth or satellite devices may rely on business activity rather than constant keepalive
The key judgment is this: heartbeat exists to model liveness correctly, not to force all devices into the same rhythm.
3.4 Use Last Seen as observation freshness, not as proof of presence
Last Seen is valuable because nearly any valid activity can refresh it:
- heartbeat
- telemetry
- event report
- command acknowledgement
- configuration reply
It is especially useful for:
- operations triage
- identifying silent devices over a time range
- correlating data freshness with connectivity alarms
But it cannot answer whether the device is online now. A device that sent temperature 20 minutes ago may already have lost its session and should not be treated as currently available.
3.5 Use LWT to preserve abnormal disconnect semantics
The value of LWT is not that it replaces heartbeat. Its value is that it can mark a disconnect as abnormal immediately. That changes:
- alert severity
- retry behavior
- session cleanup decisions
- operator interpretation of the incident
But LWT is only one signal:
- it depends on protocol support
- it does not cover every network path
- it cannot tell whether the device is still logically alive but temporarily unreachable
So LWT should be treated as evidence, not as the entire online model.
4. A practical derived state machine
A useful minimum derived state set is:
online: session and liveness are within policysuspect_offline: one or more signals are drifting but the device is not yet confirmed offlineoffline: disconnect or timeout evidence has crossed the hard thresholdstale: the device is quiet for a long time by design and should not be treated as a realtime participant
stateDiagram-v2
[*] --> online
online --> suspect_offline: missed heartbeat window\nor unstable connectivity
online --> offline: LWT triggered\nor explicit disconnect + timeout
suspect_offline --> online: heartbeat recovered\nor session restored
suspect_offline --> offline: timeout exceeded
offline --> online: new session + fresh heartbeat
offline --> stale: expected low-frequency silence
stale --> online: new valid activityDifferent consequences should attach to different derived states:
suspect_offlineis usually a warning or yellow statusofflineis where hard alarms, command suppression, and SLA effects should happenstalebelongs to a separate low-frequency view rather than the same failure bucket
5. The most common modeling mistakes
5.1 Treating broker connectivity as device health
If a gateway fronts many child devices, broker connectivity may only prove that the upstream tunnel exists. It does not prove that each child device is alive.
5.2 Treating every message as heartbeat
Some devices report only on exception. Some messages are batched or replayed. If every activity is treated as heartbeat, the platform mistakes delayed data for current health.
5.3 Using one timeout for the entire fleet
This is the fastest way to create noisy alarms. Device power mode, network type, reporting strategy, cost constraints, and business criticality vary too much for one global threshold to stay credible.
5.4 Storing the state without the reason
An operator needs to know whether the device was marked offline because:
- the connection dropped
- heartbeat timed out
- LWT fired
- last activity went stale
- the device is expected to be low frequency
That is why derived_state_reason matters as much as the derived state itself.
6. When you can keep it simpler
You can simplify if:
- the fleet is very small
- there is one device type and one stable transport
- online state is only a convenience label
- alarms, search, and command routing do not depend on it
You should not simplify once the system needs to:
- find offline devices in bulk
- distinguish brief jitter from real failure
- set different thresholds per device class
- explain why commands failed
- connect status with alarms and tickets
At that point a single-field model usually costs more later than building the correct state model now.
7. A practical implementation checklist
If you are rebuilding online state, start with these five moves:
- store
connectivity,heartbeat,last_seen, andlwtseparately - configure timeout policy per device class rather than globally
- expose
derived_online_stateplus a clear reason field - let fleet search filter by both derived state and raw timestamps
- make command routing aware of derived state, but do not let the command system reuse one raw boolean as truth
The final judgment is: the most reliable online state in IoT is not a field that somebody last wrote to true. It is a derived model that can explain the signal source, the timing rule, and the operational consequence. Heartbeat, Connectivity, Last Seen, and LWT all matter, but they should never impersonate one another.