From 00bee40bd15c06bb32750bcb6cee9e6c0c386700 Mon Sep 17 00:00:00 2001 From: Lewis Diamond Date: Mon, 19 May 2025 21:53:46 +0000 Subject: [PATCH] Update iris --- iris.md | 165 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 160 insertions(+), 5 deletions(-) diff --git a/iris.md b/iris.md index fa1d127..28f8069 100644 --- a/iris.md +++ b/iris.md @@ -1,6 +1,161 @@ +# Embedded software update system + +Iris' software is shipped as part of 3rd party firmwares. The lack of control +over the firmware upgrade schedule and flow greatly limits its ability to +quickly iterate and ship new features. This document proposes an architecture to +decouple Iris' application updates from full firmware upgrades, enabling fast, +safe, and continuous delivery. + +## Assumptions + +1. Devices run a Linux-based OS +2. Applications have read-write access to non-volatile storage +3. A fixed entry-point is launched and supervised by the host system +4. Applications have no access to the host storage outside of its sandbox +5. Applications have restricted privilege and cannot escalate + +## Architecture overview + +The goal of this design is to limit the impact of embedded device's firmware +static nature. Firmwares often are a complete OS image meant to be written to +the internal flash memory and mounted as read-only file systems. This provides +a level of safety by providing a reset mechanism to clear all writable memory +(volatile and non-volatile) and recover from errors. However, firmware updates +tend to be very involved processes and limit the ability to provide automatic +updates. In order to avoid this problem, we can create an out-of-band update +mechanism specific to Iris' software. For this, we need write access to +non-volatile[1] memory in order to store the application and a supervisor +process (like s6-supervise) to ensure it always runs. + +Besides the application itself, the supervisor will run an update manager +process responsible for local updates as well as health monitoring in order to +trigger rollbacks in case of issues. + ```mermaid -graph TD; - A(stuff)-->B[one]; - A-->C[two]; - A-->D[three]; -``` \ No newline at end of file +block-beta + columns 9 + block:ApplicationBlock:9 + up["update daemon"] + space + application + end + block:Storage:3 + columns 1 + storageA["Active version"] + storageB["Fallback version"] + end + space:3 + supervisor:3 + block:host:9 + columns 1 + runtime + OS + end + supervisor --"runs"--> up + supervisor --"runs"--> application + up -- "monitors" --> application + supervisor --"reads"--> Storage + up --"updates"--> Storage +``` + +### Update daemon + +The update daemon will run as a separate process and will periodically query the +update service with the in order to see if a newer version is available. If an +update is available, it will fetch the data, validate its integrity via +cryptographic signature and write it to local storage. In order to prevent +faulty updates from breaking the application, at least one other version of the +software is kept. The update daemon will monitor the health of the application +and revert updates that result in failed health checks. This can be achieved in +different ways depending on the system features which essentially boils down to +adding a level of indirection on top of the application. For example, symbolic +links or an overlay file system can be used. The update daemon would change the +target of a symlink to update or revert. + +### Update service + +The update service will implement incremental roll-outs in order to limit the +blast radius of updates. This can be done by using a unique identifier for +devices. This identifier can be sent to the service as part of the request for +update data such that a tailored response can be returned. By hashing the device +id and update information (for example the software version), devices can be +placed in buckets and the service can deterministically target a subset of the +fleet. Multiple requests for update data by the same device must return the same +response for a given deployment progress. Such roll-out can be managed using +readily available open source CI/CD software. + +#### Roll-out + +Obviously, industry standards should be followed where pre-production +environments run automated tests and canary devices provide early signal. +Customers should also have the option to designate devices as "non-production" +such that they can be selected early before production ones. + +Update timing, that is, when the application should be restarted depends on the +nature of the application itself and how it affects the functionality of the +overall system. Application metrics can be used to determine the device's status +and make a decision on when a restart can be triggered. Additional information +such as user-provided update schedules can help limit negative impact. + +If bandwidth limits are problematic, local devices can be leveraged as caches +and serve the update payload to neighbors. Also, diff-based updates can be +leveraged to further reduce the amount of data transferred. + +### Update monitoring + +While rolling out updates, it's important to closely monitor the fleet and stop +the roll-out as soon as issues are detected. The application should be +instrumented (using OpenTelemetry for example). The metrics will be used by the +update daemon to initiate local rollbacks in case of unhealthy signals as well +as by the backend service for fleet-wide rollbacks. + +# Stereo-vision and device-specific calibration + +Note: The problems listed in this section are not part of my expertise. Consider +this a preliminary assessment based on limited research. + +## Assumptions + +1. No hardware is available to assist (GPS, laser, etc) and only video data can + be used +2. In the field, things like checkerboards can't be used +3. The zoom+distance to focus mapping function is linear +4. Variance between units of a model is limited + +## Per-model calibration + +First, assuming samples are available, model-specific focus mappings can be +built for supported cameras. Within a controlled environment where different +easy to detect objects are placed at known distances, we can then sample a range +of values for zoom and focus. The correctness of the focus can be assessed +automatically using tools like OpenCV which implements different techniques like +Brenner Gradient, Laplace operator, etc. + +From the set of points gathered, we can use a linear regression to build +a function with the zoom + distance as an input and the correct focus level as +the output. + +## Stereo vision + +In order to leverage stereo vision, the relative position of cameras must be +obtained. OpenCV includes several [tools][2] for this purpose. While this is +outside of my field of competencies, my assessment would be that feature +detection such as ORB could be leveraged to allow multiple cameras to point +towards the same target without using a checkerboard and the essential matrix +and pose can be used to perform triangulation. + +### Follow up + +Other techniques that allow positioning a camera in 3D space exist addressing +a related problem space: Simultaneous Localization and Mapping / Structure from +Motion. For example, bundle adjustment is an approach commonly used. This +applies only to moving sensors (which can be cameras). Could a variation of +a solution be applicable to PTZ? + +# Links + +[1]: Support for volatile storage only would be possible if the local network is +always assumed to include an instance of the update service or if the offline +mode is not necessary. + +[2]: OpenCV 3d calibration