Update iris

2025-05-19 21:53:46 +00:00
parent 8ea860789f
commit 00bee40bd1

163
iris.md

@@ -1,6 +1,161 @@
# Embedded software update system
Iris' software is shipped as part of 3rd party firmwares. The lack of control
over the firmware upgrade schedule and flow greatly limits its ability to
quickly iterate and ship new features. This document proposes an architecture to
decouple Iris' application updates from full firmware upgrades, enabling fast,
safe, and continuous delivery.
## Assumptions
1. Devices run a Linux-based OS
2. Applications have read-write access to non-volatile storage
3. A fixed entry-point is launched and supervised by the host system
4. Applications have no access to the host storage outside of its sandbox
5. Applications have restricted privilege and cannot escalate
## Architecture overview
The goal of this design is to limit the impact of embedded device's firmware
static nature. Firmwares often are a complete OS image meant to be written to
the internal flash memory and mounted as read-only file systems. This provides
a level of safety by providing a reset mechanism to clear all writable memory
(volatile and non-volatile) and recover from errors. However, firmware updates
tend to be very involved processes and limit the ability to provide automatic
updates. In order to avoid this problem, we can create an out-of-band update
mechanism specific to Iris' software. For this, we need write access to
non-volatile[1] memory in order to store the application and a supervisor
process (like s6-supervise) to ensure it always runs.
Besides the application itself, the supervisor will run an update manager
process responsible for local updates as well as health monitoring in order to
trigger rollbacks in case of issues.
```mermaid ```mermaid
graph TD; block-beta
A(stuff)-->B[one]; columns 9
A-->C[two]; block:ApplicationBlock:9
A-->D[three]; up["update daemon"]
space
application
end
block:Storage:3
columns 1
storageA["Active version"]
storageB["Fallback version"]
end
space:3
supervisor:3
block:host:9
columns 1
runtime
OS
end
supervisor --"runs"--> up
supervisor --"runs"--> application
up -- "monitors" --> application
supervisor --"reads"--> Storage
up --"updates"--> Storage
``` ```
### Update daemon
The update daemon will run as a separate process and will periodically query the
update service with the in order to see if a newer version is available. If an
update is available, it will fetch the data, validate its integrity via
cryptographic signature and write it to local storage. In order to prevent
faulty updates from breaking the application, at least one other version of the
software is kept. The update daemon will monitor the health of the application
and revert updates that result in failed health checks. This can be achieved in
different ways depending on the system features which essentially boils down to
adding a level of indirection on top of the application. For example, symbolic
links or an overlay file system can be used. The update daemon would change the
target of a symlink to update or revert.
### Update service
The update service will implement incremental roll-outs in order to limit the
blast radius of updates. This can be done by using a unique identifier for
devices. This identifier can be sent to the service as part of the request for
update data such that a tailored response can be returned. By hashing the device
id and update information (for example the software version), devices can be
placed in buckets and the service can deterministically target a subset of the
fleet. Multiple requests for update data by the same device must return the same
response for a given deployment progress. Such roll-out can be managed using
readily available open source CI/CD software.
#### Roll-out
Obviously, industry standards should be followed where pre-production
environments run automated tests and canary devices provide early signal.
Customers should also have the option to designate devices as "non-production"
such that they can be selected early before production ones.
Update timing, that is, when the application should be restarted depends on the
nature of the application itself and how it affects the functionality of the
overall system. Application metrics can be used to determine the device's status
and make a decision on when a restart can be triggered. Additional information
such as user-provided update schedules can help limit negative impact.
If bandwidth limits are problematic, local devices can be leveraged as caches
and serve the update payload to neighbors. Also, diff-based updates can be
leveraged to further reduce the amount of data transferred.
### Update monitoring
While rolling out updates, it's important to closely monitor the fleet and stop
the roll-out as soon as issues are detected. The application should be
instrumented (using OpenTelemetry for example). The metrics will be used by the
update daemon to initiate local rollbacks in case of unhealthy signals as well
as by the backend service for fleet-wide rollbacks.
# Stereo-vision and device-specific calibration
Note: The problems listed in this section are not part of my expertise. Consider
this a preliminary assessment based on limited research.
## Assumptions
1. No hardware is available to assist (GPS, laser, etc) and only video data can
be used
2. In the field, things like checkerboards can't be used
3. The zoom+distance to focus mapping function is linear
4. Variance between units of a model is limited
## Per-model calibration
First, assuming samples are available, model-specific focus mappings can be
built for supported cameras. Within a controlled environment where different
easy to detect objects are placed at known distances, we can then sample a range
of values for zoom and focus. The correctness of the focus can be assessed
automatically using tools like OpenCV which implements different techniques like
Brenner Gradient, Laplace operator, etc.
From the set of points gathered, we can use a linear regression to build
a function with the zoom + distance as an input and the correct focus level as
the output.
## Stereo vision
In order to leverage stereo vision, the relative position of cameras must be
obtained. OpenCV includes several [tools][2] for this purpose. While this is
outside of my field of competencies, my assessment would be that feature
detection such as ORB could be leveraged to allow multiple cameras to point
towards the same target without using a checkerboard and the essential matrix
and pose can be used to perform triangulation.
### Follow up
Other techniques that allow positioning a camera in 3D space exist addressing
a related problem space: Simultaneous Localization and Mapping / Structure from
Motion. For example, bundle adjustment is an approach commonly used. This
applies only to moving sensors (which can be cameras). Could a variation of
a solution be applicable to PTZ?
# Links
[1]: Support for volatile storage only would be possible if the local network is
always assumed to include an instance of the update service or if the offline
mode is not necessary.
[2]: OpenCV 3d calibration <https://docs.opencv.org/4.x/d9/d0c/group__calib3d.html>