fix(mcu): harden checkSystemHealth() watchdog against cold-start + stale-ts

checkSystemHealth()'s internal watchdog (pre-fix step 9) had two linked
defects that, combined with the previous commit's escalation of
ERROR_WATCHDOG_TIMEOUT to Emergency_Stop(), would false-latch AERIS-10:

  1. Cold-start false trip:
       static uint32_t last_health_check = 0;
       if (HAL_GetTick() - last_health_check > 60000) { trip; }
     On the first call, last_health_check == 0, so the subtraction
     against a seeded-zero sentinel exceeds 60 000 ms as soon as the MCU
     has been up >60 s -- normal after the ADAR1000 / AD9523 / ADF4382
     init sequence -- and the watchdog trips spuriously.

  2. Stale timestamp after early returns:
       last_health_check = HAL_GetTick();   // at END of function
     Every earlier sub-check (IMU, BMP180, GPS, PA Idq, temperature) has
     an `if (fault) return current_error;` path that skips the update.
     After ~60 s of transient faults, the next clean call compares
     against a long-stale last_health_check and trips.

With ERROR_WATCHDOG_TIMEOUT now escalating to Emergency_Stop(), either
failure mode would cut the RF rails on a perfectly healthy system.

Fix: move the watchdog check to function ENTRY. A dedicated cold-start
branch seeds the timestamp on the first call without checking. On every
subsequent call, the elapsed delta is captured first and
last_health_check is updated BEFORE any sub-check runs, so early returns
no longer leave a stale value. 32-bit tick-wrap semantics are preserved
because the subtraction remains on uint32_t.

Add test_gap3_health_watchdog_cold_start.c covering cold-start, paced
main-loop, stall detection, boundary (exactly 60 000 ms), recovery
after trip, and 32-bit HAL_GetTick() wrap -- wired into tests/Makefile
alongside the existing gap-3 safety tests.
This commit is contained in:
3aLaee
2026-04-15 20:35:49 +02:00
parent 0b25db08b5
commit 35539ea934
3 changed files with 163 additions and 10 deletions
@@ -65,7 +65,8 @@ TESTS_STANDALONE := test_bug12_pa_cal_loop_inverted \
test_gap3_temperature_max \
test_gap3_idq_periodic_reread \
test_gap3_emergency_state_ordering \
test_gap3_overtemp_emergency_stop
test_gap3_overtemp_emergency_stop \
test_gap3_health_watchdog_cold_start
# Tests that need platform_noos_stm32.o + mocks
TESTS_WITH_PLATFORM := test_bug11_platform_spi_transmit_only
@@ -78,7 +79,7 @@ ALL_TESTS := $(TESTS_WITH_REAL) $(TESTS_MOCK_ONLY) $(TESTS_STANDALONE) $(TESTS_W
.PHONY: all build test clean \
$(addprefix test_,bug1 bug2 bug3 bug4 bug5 bug6 bug7 bug8 bug9 bug10 bug11 bug12 bug13 bug14 bug15) \
test_gap3_estop test_gap3_iwdg test_gap3_temp test_gap3_idq test_gap3_order \
test_gap3_overtemp
test_gap3_overtemp test_gap3_wdog
all: build test
@@ -167,6 +168,9 @@ test_gap3_emergency_state_ordering: test_gap3_emergency_state_ordering.c
test_gap3_overtemp_emergency_stop: test_gap3_overtemp_emergency_stop.c
$(CC) $(CFLAGS) $< -o $@
test_gap3_health_watchdog_cold_start: test_gap3_health_watchdog_cold_start.c
$(CC) $(CFLAGS) $< -o $@
# Tests that need platform_noos_stm32.o + mocks
$(TESTS_WITH_PLATFORM): %: %.c $(MOCK_OBJS) $(PLATFORM_OBJ)
$(CC) $(CFLAGS) $(INCLUDES) $< $(MOCK_OBJS) $(PLATFORM_OBJ) -o $@
@@ -254,6 +258,9 @@ test_gap3_order: test_gap3_emergency_state_ordering
test_gap3_overtemp: test_gap3_overtemp_emergency_stop
./test_gap3_overtemp_emergency_stop
test_gap3_wdog: test_gap3_health_watchdog_cold_start
./test_gap3_health_watchdog_cold_start
# --- Clean ---
clean: