Update iris
165
iris.md
165
iris.md
@@ -1,6 +1,161 @@
|
|||||||
|
# Embedded software update system
|
||||||
|
|
||||||
|
Iris' software is shipped as part of 3rd party firmwares. The lack of control
|
||||||
|
over the firmware upgrade schedule and flow greatly limits its ability to
|
||||||
|
quickly iterate and ship new features. This document proposes an architecture to
|
||||||
|
decouple Iris' application updates from full firmware upgrades, enabling fast,
|
||||||
|
safe, and continuous delivery.
|
||||||
|
|
||||||
|
## Assumptions
|
||||||
|
|
||||||
|
1. Devices run a Linux-based OS
|
||||||
|
2. Applications have read-write access to non-volatile storage
|
||||||
|
3. A fixed entry-point is launched and supervised by the host system
|
||||||
|
4. Applications have no access to the host storage outside of its sandbox
|
||||||
|
5. Applications have restricted privilege and cannot escalate
|
||||||
|
|
||||||
|
## Architecture overview
|
||||||
|
|
||||||
|
The goal of this design is to limit the impact of embedded device's firmware
|
||||||
|
static nature. Firmwares often are a complete OS image meant to be written to
|
||||||
|
the internal flash memory and mounted as read-only file systems. This provides
|
||||||
|
a level of safety by providing a reset mechanism to clear all writable memory
|
||||||
|
(volatile and non-volatile) and recover from errors. However, firmware updates
|
||||||
|
tend to be very involved processes and limit the ability to provide automatic
|
||||||
|
updates. In order to avoid this problem, we can create an out-of-band update
|
||||||
|
mechanism specific to Iris' software. For this, we need write access to
|
||||||
|
non-volatile[1] memory in order to store the application and a supervisor
|
||||||
|
process (like s6-supervise) to ensure it always runs.
|
||||||
|
|
||||||
|
Besides the application itself, the supervisor will run an update manager
|
||||||
|
process responsible for local updates as well as health monitoring in order to
|
||||||
|
trigger rollbacks in case of issues.
|
||||||
|
|
||||||
```mermaid
|
```mermaid
|
||||||
graph TD;
|
block-beta
|
||||||
A(stuff)-->B[one];
|
columns 9
|
||||||
A-->C[two];
|
block:ApplicationBlock:9
|
||||||
A-->D[three];
|
up["update daemon"]
|
||||||
```
|
space
|
||||||
|
application
|
||||||
|
end
|
||||||
|
block:Storage:3
|
||||||
|
columns 1
|
||||||
|
storageA["Active version"]
|
||||||
|
storageB["Fallback version"]
|
||||||
|
end
|
||||||
|
space:3
|
||||||
|
supervisor:3
|
||||||
|
block:host:9
|
||||||
|
columns 1
|
||||||
|
runtime
|
||||||
|
OS
|
||||||
|
end
|
||||||
|
supervisor --"runs"--> up
|
||||||
|
supervisor --"runs"--> application
|
||||||
|
up -- "monitors" --> application
|
||||||
|
supervisor --"reads"--> Storage
|
||||||
|
up --"updates"--> Storage
|
||||||
|
```
|
||||||
|
|
||||||
|
### Update daemon
|
||||||
|
|
||||||
|
The update daemon will run as a separate process and will periodically query the
|
||||||
|
update service with the in order to see if a newer version is available. If an
|
||||||
|
update is available, it will fetch the data, validate its integrity via
|
||||||
|
cryptographic signature and write it to local storage. In order to prevent
|
||||||
|
faulty updates from breaking the application, at least one other version of the
|
||||||
|
software is kept. The update daemon will monitor the health of the application
|
||||||
|
and revert updates that result in failed health checks. This can be achieved in
|
||||||
|
different ways depending on the system features which essentially boils down to
|
||||||
|
adding a level of indirection on top of the application. For example, symbolic
|
||||||
|
links or an overlay file system can be used. The update daemon would change the
|
||||||
|
target of a symlink to update or revert.
|
||||||
|
|
||||||
|
### Update service
|
||||||
|
|
||||||
|
The update service will implement incremental roll-outs in order to limit the
|
||||||
|
blast radius of updates. This can be done by using a unique identifier for
|
||||||
|
devices. This identifier can be sent to the service as part of the request for
|
||||||
|
update data such that a tailored response can be returned. By hashing the device
|
||||||
|
id and update information (for example the software version), devices can be
|
||||||
|
placed in buckets and the service can deterministically target a subset of the
|
||||||
|
fleet. Multiple requests for update data by the same device must return the same
|
||||||
|
response for a given deployment progress. Such roll-out can be managed using
|
||||||
|
readily available open source CI/CD software.
|
||||||
|
|
||||||
|
#### Roll-out
|
||||||
|
|
||||||
|
Obviously, industry standards should be followed where pre-production
|
||||||
|
environments run automated tests and canary devices provide early signal.
|
||||||
|
Customers should also have the option to designate devices as "non-production"
|
||||||
|
such that they can be selected early before production ones.
|
||||||
|
|
||||||
|
Update timing, that is, when the application should be restarted depends on the
|
||||||
|
nature of the application itself and how it affects the functionality of the
|
||||||
|
overall system. Application metrics can be used to determine the device's status
|
||||||
|
and make a decision on when a restart can be triggered. Additional information
|
||||||
|
such as user-provided update schedules can help limit negative impact.
|
||||||
|
|
||||||
|
If bandwidth limits are problematic, local devices can be leveraged as caches
|
||||||
|
and serve the update payload to neighbors. Also, diff-based updates can be
|
||||||
|
leveraged to further reduce the amount of data transferred.
|
||||||
|
|
||||||
|
### Update monitoring
|
||||||
|
|
||||||
|
While rolling out updates, it's important to closely monitor the fleet and stop
|
||||||
|
the roll-out as soon as issues are detected. The application should be
|
||||||
|
instrumented (using OpenTelemetry for example). The metrics will be used by the
|
||||||
|
update daemon to initiate local rollbacks in case of unhealthy signals as well
|
||||||
|
as by the backend service for fleet-wide rollbacks.
|
||||||
|
|
||||||
|
# Stereo-vision and device-specific calibration
|
||||||
|
|
||||||
|
Note: The problems listed in this section are not part of my expertise. Consider
|
||||||
|
this a preliminary assessment based on limited research.
|
||||||
|
|
||||||
|
## Assumptions
|
||||||
|
|
||||||
|
1. No hardware is available to assist (GPS, laser, etc) and only video data can
|
||||||
|
be used
|
||||||
|
2. In the field, things like checkerboards can't be used
|
||||||
|
3. The zoom+distance to focus mapping function is linear
|
||||||
|
4. Variance between units of a model is limited
|
||||||
|
|
||||||
|
## Per-model calibration
|
||||||
|
|
||||||
|
First, assuming samples are available, model-specific focus mappings can be
|
||||||
|
built for supported cameras. Within a controlled environment where different
|
||||||
|
easy to detect objects are placed at known distances, we can then sample a range
|
||||||
|
of values for zoom and focus. The correctness of the focus can be assessed
|
||||||
|
automatically using tools like OpenCV which implements different techniques like
|
||||||
|
Brenner Gradient, Laplace operator, etc.
|
||||||
|
|
||||||
|
From the set of points gathered, we can use a linear regression to build
|
||||||
|
a function with the zoom + distance as an input and the correct focus level as
|
||||||
|
the output.
|
||||||
|
|
||||||
|
## Stereo vision
|
||||||
|
|
||||||
|
In order to leverage stereo vision, the relative position of cameras must be
|
||||||
|
obtained. OpenCV includes several [tools][2] for this purpose. While this is
|
||||||
|
outside of my field of competencies, my assessment would be that feature
|
||||||
|
detection such as ORB could be leveraged to allow multiple cameras to point
|
||||||
|
towards the same target without using a checkerboard and the essential matrix
|
||||||
|
and pose can be used to perform triangulation.
|
||||||
|
|
||||||
|
### Follow up
|
||||||
|
|
||||||
|
Other techniques that allow positioning a camera in 3D space exist addressing
|
||||||
|
a related problem space: Simultaneous Localization and Mapping / Structure from
|
||||||
|
Motion. For example, bundle adjustment is an approach commonly used. This
|
||||||
|
applies only to moving sensors (which can be cameras). Could a variation of
|
||||||
|
a solution be applicable to PTZ?
|
||||||
|
|
||||||
|
# Links
|
||||||
|
|
||||||
|
[1]: Support for volatile storage only would be possible if the local network is
|
||||||
|
always assumed to include an instance of the update service or if the offline
|
||||||
|
mode is not necessary.
|
||||||
|
|
||||||
|
[2]: OpenCV 3d calibration <https://docs.opencv.org/4.x/d9/d0c/group__calib3d.html>
|
||||||
|
|||||||
Reference in New Issue
Block a user