Update iris
163
iris.md
163
iris.md
@@ -1,6 +1,161 @@
|
||||
# Embedded software update system
|
||||
|
||||
Iris' software is shipped as part of 3rd party firmwares. The lack of control
|
||||
over the firmware upgrade schedule and flow greatly limits its ability to
|
||||
quickly iterate and ship new features. This document proposes an architecture to
|
||||
decouple Iris' application updates from full firmware upgrades, enabling fast,
|
||||
safe, and continuous delivery.
|
||||
|
||||
## Assumptions
|
||||
|
||||
1. Devices run a Linux-based OS
|
||||
2. Applications have read-write access to non-volatile storage
|
||||
3. A fixed entry-point is launched and supervised by the host system
|
||||
4. Applications have no access to the host storage outside of its sandbox
|
||||
5. Applications have restricted privilege and cannot escalate
|
||||
|
||||
## Architecture overview
|
||||
|
||||
The goal of this design is to limit the impact of embedded device's firmware
|
||||
static nature. Firmwares often are a complete OS image meant to be written to
|
||||
the internal flash memory and mounted as read-only file systems. This provides
|
||||
a level of safety by providing a reset mechanism to clear all writable memory
|
||||
(volatile and non-volatile) and recover from errors. However, firmware updates
|
||||
tend to be very involved processes and limit the ability to provide automatic
|
||||
updates. In order to avoid this problem, we can create an out-of-band update
|
||||
mechanism specific to Iris' software. For this, we need write access to
|
||||
non-volatile[1] memory in order to store the application and a supervisor
|
||||
process (like s6-supervise) to ensure it always runs.
|
||||
|
||||
Besides the application itself, the supervisor will run an update manager
|
||||
process responsible for local updates as well as health monitoring in order to
|
||||
trigger rollbacks in case of issues.
|
||||
|
||||
```mermaid
|
||||
graph TD;
|
||||
A(stuff)-->B[one];
|
||||
A-->C[two];
|
||||
A-->D[three];
|
||||
block-beta
|
||||
columns 9
|
||||
block:ApplicationBlock:9
|
||||
up["update daemon"]
|
||||
space
|
||||
application
|
||||
end
|
||||
block:Storage:3
|
||||
columns 1
|
||||
storageA["Active version"]
|
||||
storageB["Fallback version"]
|
||||
end
|
||||
space:3
|
||||
supervisor:3
|
||||
block:host:9
|
||||
columns 1
|
||||
runtime
|
||||
OS
|
||||
end
|
||||
supervisor --"runs"--> up
|
||||
supervisor --"runs"--> application
|
||||
up -- "monitors" --> application
|
||||
supervisor --"reads"--> Storage
|
||||
up --"updates"--> Storage
|
||||
```
|
||||
|
||||
### Update daemon
|
||||
|
||||
The update daemon will run as a separate process and will periodically query the
|
||||
update service with the in order to see if a newer version is available. If an
|
||||
update is available, it will fetch the data, validate its integrity via
|
||||
cryptographic signature and write it to local storage. In order to prevent
|
||||
faulty updates from breaking the application, at least one other version of the
|
||||
software is kept. The update daemon will monitor the health of the application
|
||||
and revert updates that result in failed health checks. This can be achieved in
|
||||
different ways depending on the system features which essentially boils down to
|
||||
adding a level of indirection on top of the application. For example, symbolic
|
||||
links or an overlay file system can be used. The update daemon would change the
|
||||
target of a symlink to update or revert.
|
||||
|
||||
### Update service
|
||||
|
||||
The update service will implement incremental roll-outs in order to limit the
|
||||
blast radius of updates. This can be done by using a unique identifier for
|
||||
devices. This identifier can be sent to the service as part of the request for
|
||||
update data such that a tailored response can be returned. By hashing the device
|
||||
id and update information (for example the software version), devices can be
|
||||
placed in buckets and the service can deterministically target a subset of the
|
||||
fleet. Multiple requests for update data by the same device must return the same
|
||||
response for a given deployment progress. Such roll-out can be managed using
|
||||
readily available open source CI/CD software.
|
||||
|
||||
#### Roll-out
|
||||
|
||||
Obviously, industry standards should be followed where pre-production
|
||||
environments run automated tests and canary devices provide early signal.
|
||||
Customers should also have the option to designate devices as "non-production"
|
||||
such that they can be selected early before production ones.
|
||||
|
||||
Update timing, that is, when the application should be restarted depends on the
|
||||
nature of the application itself and how it affects the functionality of the
|
||||
overall system. Application metrics can be used to determine the device's status
|
||||
and make a decision on when a restart can be triggered. Additional information
|
||||
such as user-provided update schedules can help limit negative impact.
|
||||
|
||||
If bandwidth limits are problematic, local devices can be leveraged as caches
|
||||
and serve the update payload to neighbors. Also, diff-based updates can be
|
||||
leveraged to further reduce the amount of data transferred.
|
||||
|
||||
### Update monitoring
|
||||
|
||||
While rolling out updates, it's important to closely monitor the fleet and stop
|
||||
the roll-out as soon as issues are detected. The application should be
|
||||
instrumented (using OpenTelemetry for example). The metrics will be used by the
|
||||
update daemon to initiate local rollbacks in case of unhealthy signals as well
|
||||
as by the backend service for fleet-wide rollbacks.
|
||||
|
||||
# Stereo-vision and device-specific calibration
|
||||
|
||||
Note: The problems listed in this section are not part of my expertise. Consider
|
||||
this a preliminary assessment based on limited research.
|
||||
|
||||
## Assumptions
|
||||
|
||||
1. No hardware is available to assist (GPS, laser, etc) and only video data can
|
||||
be used
|
||||
2. In the field, things like checkerboards can't be used
|
||||
3. The zoom+distance to focus mapping function is linear
|
||||
4. Variance between units of a model is limited
|
||||
|
||||
## Per-model calibration
|
||||
|
||||
First, assuming samples are available, model-specific focus mappings can be
|
||||
built for supported cameras. Within a controlled environment where different
|
||||
easy to detect objects are placed at known distances, we can then sample a range
|
||||
of values for zoom and focus. The correctness of the focus can be assessed
|
||||
automatically using tools like OpenCV which implements different techniques like
|
||||
Brenner Gradient, Laplace operator, etc.
|
||||
|
||||
From the set of points gathered, we can use a linear regression to build
|
||||
a function with the zoom + distance as an input and the correct focus level as
|
||||
the output.
|
||||
|
||||
## Stereo vision
|
||||
|
||||
In order to leverage stereo vision, the relative position of cameras must be
|
||||
obtained. OpenCV includes several [tools][2] for this purpose. While this is
|
||||
outside of my field of competencies, my assessment would be that feature
|
||||
detection such as ORB could be leveraged to allow multiple cameras to point
|
||||
towards the same target without using a checkerboard and the essential matrix
|
||||
and pose can be used to perform triangulation.
|
||||
|
||||
### Follow up
|
||||
|
||||
Other techniques that allow positioning a camera in 3D space exist addressing
|
||||
a related problem space: Simultaneous Localization and Mapping / Structure from
|
||||
Motion. For example, bundle adjustment is an approach commonly used. This
|
||||
applies only to moving sensors (which can be cameras). Could a variation of
|
||||
a solution be applicable to PTZ?
|
||||
|
||||
# Links
|
||||
|
||||
[1]: Support for volatile storage only would be possible if the local network is
|
||||
always assumed to include an instance of the update service or if the offline
|
||||
mode is not necessary.
|
||||
|
||||
[2]: OpenCV 3d calibration <https://docs.opencv.org/4.x/d9/d0c/group__calib3d.html>
|
||||
|
||||
Reference in New Issue
Block a user