Update iris

2025-05-19 21:53:46 +00:00
parent 8ea860789f
commit 00bee40bd1
1 changed files with 160 additions and 5 deletions
--- a/iris.md
+++ b/iris.md
@@ -1,6 +1,161 @@
 # Embedded software update system
 Iris' software is shipped as part of 3rd party firmwares. The lack of control
 over the firmware upgrade schedule and flow greatly limits its ability to
 quickly iterate and ship new features. This document proposes an architecture to
 decouple Iris' application updates from full firmware upgrades, enabling fast,
 safe, and continuous delivery.
 ## Assumptions
 1. Devices run a Linux-based OS
 2. Applications have read-write access to non-volatile storage
 3. A fixed entry-point is launched and supervised by the host system
 4. Applications have no access to the host storage outside of its sandbox
 5. Applications have restricted privilege and cannot escalate
 ## Architecture overview
 The goal of this design is to limit the impact of embedded device's firmware
 static nature. Firmwares often are a complete OS image meant to be written to
 the internal flash memory and mounted as read-only file systems. This provides
 a level of safety by providing a reset mechanism to clear all writable memory
 (volatile and non-volatile) and recover from errors. However, firmware updates
 tend to be very involved processes and limit the ability to provide automatic
 updates. In order to avoid this problem, we can create an out-of-band update
 mechanism specific to Iris' software. For this, we need write access to
 non-volatile[1] memory in order to store the application and a supervisor
 process (like s6-supervise) to ensure it always runs.
 Besides the application itself, the supervisor will run an update manager
 process responsible for local updates as well as health monitoring in order to
 trigger rollbacks in case of issues.
 ```mermaid
-graph TD;
+block-beta
-    A(stuff)-->B[one];
+  columns 9
-    A-->C[two];
+  block:ApplicationBlock:9
-    A-->D[three];
+    up["update daemon"]
    space
    application
  end
  block:Storage:3
    columns 1
    storageA["Active version"]
    storageB["Fallback version"]
  end
  space:3
  supervisor:3 
  block:host:9
    columns 1
    runtime
    OS
  end
  supervisor --"runs"--> up
  supervisor --"runs"--> application
  up -- "monitors" --> application
  supervisor --"reads"--> Storage
  up --"updates"--> Storage
 ```
 ### Update daemon
 The update daemon will run as a separate process and will periodically query the
 update service with the in order to see if a newer version is available. If an
 update is available, it will fetch the data, validate its integrity via
 cryptographic signature and write it to local storage. In order to prevent
 faulty updates from breaking the application, at least one other version of the
 software is kept. The update daemon will monitor the health of the application
 and revert updates that result in failed health checks. This can be achieved in
 different ways depending on the system features which essentially boils down to
 adding a level of indirection on top of the application. For example, symbolic
 links or an overlay file system can be used. The update daemon would change the
 target of a symlink to update or revert.
 ### Update service
 The update service will implement incremental roll-outs in order to limit the
 blast radius of updates. This can be done by using a unique identifier for
 devices. This identifier can be sent to the service as part of the request for
 update data such that a tailored response can be returned. By hashing the device
 id and update information (for example the software version), devices can be
 placed in buckets and the service can deterministically target a subset of the
 fleet. Multiple requests for update data by the same device must return the same
 response for a given deployment progress. Such roll-out can be managed using
 readily available open source CI/CD software.
 #### Roll-out
 Obviously, industry standards should be followed where pre-production
 environments run automated tests and canary devices provide early signal.
 Customers should also have the option to designate devices as "non-production"
 such that they can be selected early before production ones.
 Update timing, that is, when the application should be restarted depends on the
 nature of the application itself and how it affects the functionality of the
 overall system. Application metrics can be used to determine the device's status
 and make a decision on when a restart can be triggered. Additional information
 such as user-provided update schedules can help limit negative impact.
 If bandwidth limits are problematic, local devices can be leveraged as caches
 and serve the update payload to neighbors. Also, diff-based updates can be
 leveraged to further reduce the amount of data transferred.
 ### Update monitoring
 While rolling out updates, it's important to closely monitor the fleet and stop
 the roll-out as soon as issues are detected. The application should be
 instrumented (using OpenTelemetry for example). The metrics will be used by the
 update daemon to initiate local rollbacks in case of unhealthy signals as well
 as by the backend service for fleet-wide rollbacks.
 # Stereo-vision and device-specific calibration
 Note: The problems listed in this section are not part of my expertise. Consider
 this a preliminary assessment based on limited research.
 ## Assumptions
 1. No hardware is available to assist (GPS, laser, etc) and only video data can
   be used
 2. In the field, things like checkerboards can't be used
 3. The zoom+distance to focus mapping function is linear
 4. Variance between units of a model is limited
 ## Per-model calibration
 First, assuming samples are available, model-specific focus mappings can be
 built for supported cameras. Within a controlled environment where different
 easy to detect objects are placed at known distances, we can then sample a range
 of values for zoom and focus. The correctness of the focus can be assessed
 automatically using tools like OpenCV which implements different techniques like
 Brenner Gradient, Laplace operator, etc.
 From the set of points gathered, we can use a linear regression to build
 a function with the zoom + distance as an input and the correct focus level as
 the output.
 ## Stereo vision
 In order to leverage stereo vision, the relative position of cameras must be
 obtained. OpenCV includes several [tools][2] for this purpose. While this is
 outside of my field of competencies, my assessment would be that feature
 detection such as ORB could be leveraged to allow multiple cameras to point
 towards the same target without using a checkerboard and the essential matrix
 and pose can be used to perform triangulation.
 ### Follow up
 Other techniques that allow positioning a camera in 3D space exist addressing
 a related problem space: Simultaneous Localization and Mapping / Structure from
 Motion. For example, bundle adjustment is an approach commonly used. This
 applies only to moving sensors (which can be cameras). Could a variation of
 a solution be applicable to PTZ?
 # Links
 [1]: Support for volatile storage only would be possible if the local network is
 always assumed to include an instance of the update service or if the offline
 mode is not necessary.
 [2]: OpenCV 3d calibration <https://docs.opencv.org/4.x/d9/d0c/group__calib3d.html>