From 00bee40bd15c06bb32750bcb6cee9e6c0c386700 Mon Sep 17 00:00:00 2001
From: Lewis Diamond <git@lewisdiamond.com>
Date: Mon, 19 May 2025 21:53:46 +0000
Subject: [PATCH] Update iris

---
 iris.md | 165 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 160 insertions(+), 5 deletions(-)

diff --git a/iris.md b/iris.md
index fa1d127..28f8069 100644
--- a/iris.md
+++ b/iris.md
@@ -1,6 +1,161 @@
+# Embedded software update system
+
+Iris' software is shipped as part of 3rd party firmwares. The lack of control
+over the firmware upgrade schedule and flow greatly limits its ability to
+quickly iterate and ship new features. This document proposes an architecture to
+decouple Iris' application updates from full firmware upgrades, enabling fast,
+safe, and continuous delivery.
+
+## Assumptions
+
+1. Devices run a Linux-based OS
+2. Applications have read-write access to non-volatile storage
+3. A fixed entry-point is launched and supervised by the host system
+4. Applications have no access to the host storage outside of its sandbox
+5. Applications have restricted privilege and cannot escalate
+
+## Architecture overview
+
+The goal of this design is to limit the impact of embedded device's firmware
+static nature. Firmwares often are a complete OS image meant to be written to
+the internal flash memory and mounted as read-only file systems. This provides
+a level of safety by providing a reset mechanism to clear all writable memory
+(volatile and non-volatile) and recover from errors. However, firmware updates
+tend to be very involved processes and limit the ability to provide automatic
+updates. In order to avoid this problem, we can create an out-of-band update
+mechanism specific to Iris' software. For this, we need write access to
+non-volatile[1] memory in order to store the application and a supervisor
+process (like s6-supervise) to ensure it always runs.
+
+Besides the application itself, the supervisor will run an update manager
+process responsible for local updates as well as health monitoring in order to
+trigger rollbacks in case of issues.
+
 ```mermaid
-graph TD;
-    A(stuff)-->B[one];
-    A-->C[two];
-    A-->D[three];
-```
\ No newline at end of file
+block-beta
+  columns 9
+  block:ApplicationBlock:9
+    up["update daemon"]
+    space
+    application
+  end
+  block:Storage:3
+    columns 1
+    storageA["Active version"]
+    storageB["Fallback version"]
+  end
+  space:3
+  supervisor:3 
+  block:host:9
+    columns 1
+    runtime
+    OS
+  end
+  supervisor --"runs"--> up
+  supervisor --"runs"--> application
+  up -- "monitors" --> application
+  supervisor --"reads"--> Storage
+  up --"updates"--> Storage
+```
+
+### Update daemon
+
+The update daemon will run as a separate process and will periodically query the
+update service with the in order to see if a newer version is available. If an
+update is available, it will fetch the data, validate its integrity via
+cryptographic signature and write it to local storage. In order to prevent
+faulty updates from breaking the application, at least one other version of the
+software is kept. The update daemon will monitor the health of the application
+and revert updates that result in failed health checks. This can be achieved in
+different ways depending on the system features which essentially boils down to
+adding a level of indirection on top of the application. For example, symbolic
+links or an overlay file system can be used. The update daemon would change the
+target of a symlink to update or revert.
+
+### Update service
+
+The update service will implement incremental roll-outs in order to limit the
+blast radius of updates. This can be done by using a unique identifier for
+devices. This identifier can be sent to the service as part of the request for
+update data such that a tailored response can be returned. By hashing the device
+id and update information (for example the software version), devices can be
+placed in buckets and the service can deterministically target a subset of the
+fleet. Multiple requests for update data by the same device must return the same
+response for a given deployment progress. Such roll-out can be managed using
+readily available open source CI/CD software.
+
+#### Roll-out
+
+Obviously, industry standards should be followed where pre-production
+environments run automated tests and canary devices provide early signal.
+Customers should also have the option to designate devices as "non-production"
+such that they can be selected early before production ones.
+
+Update timing, that is, when the application should be restarted depends on the
+nature of the application itself and how it affects the functionality of the
+overall system. Application metrics can be used to determine the device's status
+and make a decision on when a restart can be triggered. Additional information
+such as user-provided update schedules can help limit negative impact.
+
+If bandwidth limits are problematic, local devices can be leveraged as caches
+and serve the update payload to neighbors. Also, diff-based updates can be
+leveraged to further reduce the amount of data transferred.
+
+### Update monitoring
+
+While rolling out updates, it's important to closely monitor the fleet and stop
+the roll-out as soon as issues are detected. The application should be
+instrumented (using OpenTelemetry for example). The metrics will be used by the
+update daemon to initiate local rollbacks in case of unhealthy signals as well
+as by the backend service for fleet-wide rollbacks.
+
+# Stereo-vision and device-specific calibration
+
+Note: The problems listed in this section are not part of my expertise. Consider
+this a preliminary assessment based on limited research.
+
+## Assumptions
+
+1. No hardware is available to assist (GPS, laser, etc) and only video data can
+   be used
+2. In the field, things like checkerboards can't be used
+3. The zoom+distance to focus mapping function is linear
+4. Variance between units of a model is limited
+
+## Per-model calibration
+
+First, assuming samples are available, model-specific focus mappings can be
+built for supported cameras. Within a controlled environment where different
+easy to detect objects are placed at known distances, we can then sample a range
+of values for zoom and focus. The correctness of the focus can be assessed
+automatically using tools like OpenCV which implements different techniques like
+Brenner Gradient, Laplace operator, etc.
+
+From the set of points gathered, we can use a linear regression to build
+a function with the zoom + distance as an input and the correct focus level as
+the output.
+
+## Stereo vision
+
+In order to leverage stereo vision, the relative position of cameras must be
+obtained. OpenCV includes several [tools][2] for this purpose. While this is
+outside of my field of competencies, my assessment would be that feature
+detection such as ORB could be leveraged to allow multiple cameras to point
+towards the same target without using a checkerboard and the essential matrix
+and pose can be used to perform triangulation.
+
+### Follow up
+
+Other techniques that allow positioning a camera in 3D space exist addressing
+a related problem space: Simultaneous Localization and Mapping / Structure from
+Motion. For example, bundle adjustment is an approach commonly used. This
+applies only to moving sensors (which can be cameras). Could a variation of
+a solution be applicable to PTZ?
+
+# Links
+
+[1]: Support for volatile storage only would be possible if the local network is
+always assumed to include an instance of the update service or if the offline
+mode is not necessary.
+
+[2]: OpenCV 3d calibration <https://docs.opencv.org/4.x/d9/d0c/group__calib3d.html>