Embedded software update system

Iris' software is shipped as part of 3rd party firmwares. The lack of control over the firmware upgrade schedule and flow greatly limits its ability to quickly iterate and ship new features. This document proposes an architecture to decouple Iris' application updates from full firmware upgrades, enabling fast, safe, and continuous delivery.

Assumptions

Devices run a Linux-based OS
Applications have read-write access to non-volatile storage
A fixed entry-point is launched and supervised by the host system
Applications have no access to the host storage outside of its sandbox
Applications have restricted privilege and cannot escalate

Architecture overview

The goal of this design is to limit the impact of embedded device's firmware static nature. Firmwares often are a complete OS image meant to be written to the internal flash memory and mounted as read-only file systems. This provides a level of safety by providing a reset mechanism to clear all writable memory (volatile and non-volatile) and recover from errors. However, firmware updates tend to be very involved processes and limit the ability to provide automatic updates. In order to avoid this problem, we can create an out-of-band update mechanism specific to Iris' software. For this, we need write access to non-volatile[1] memory in order to store the application and a supervisor process (like s6-supervise) to ensure it always runs.

Besides the application itself, the supervisor will run an update manager process responsible for local updates as well as health monitoring in order to trigger rollbacks in case of issues.

block-beta
  columns 9
  block:ApplicationBlock:9
    up["update daemon"]
    space
    application
  end
  space:9
  block:Storage:3
    columns 1
    storageA["Active version"]
    storageB["Fallback version"]
  end
  space:3
  supervisor:3 
  block:host:9
    columns 1
    runtime
    OS
  end
  supervisor --"runs"--> up
  supervisor --"runs"--> application
  up -- "monitors" --> application
  supervisor --"reads"--> Storage
  up --"updates"--> Storage

Update daemon

The update daemon will run as a separate process and will periodically query the update service with the in order to see if a newer version is available. If an update is available, it will fetch the data, validate its integrity via cryptographic signature and write it to local storage. In order to prevent faulty updates from breaking the application, at least one other version of the software is kept. The update daemon will monitor the health of the application and revert updates that result in failed health checks. This can be achieved in different ways depending on the system features which essentially boils down to adding a level of indirection on top of the application. For example, symbolic links or an overlay file system can be used. The update daemon would change the target of a symlink to update or revert.

Update service

The update service will implement incremental roll-outs in order to limit the blast radius of updates. This can be done by using a unique identifier for devices. This identifier can be sent to the service as part of the request for update data such that a tailored response can be returned. By hashing the device id and update information (for example the software version), devices can be placed in buckets and the service can deterministically target a subset of the fleet. Multiple requests for update data by the same device must return the same response for a given deployment progress. Such roll-out can be managed using readily available open source CI/CD software.

Roll-out

Obviously, industry standards should be followed where pre-production environments run automated tests and canary devices provide early signal. Customers should also have the option to designate devices as "non-production" such that they can be selected early before production ones.

Update timing, that is, when the application should be restarted depends on the nature of the application itself and how it affects the functionality of the overall system. Application metrics can be used to determine the device's status and make a decision on when a restart can be triggered. Additional information such as user-provided update schedules can help limit negative impact.

If bandwidth limits are problematic, local devices can be leveraged as caches and serve the update payload to neighbors. Also, diff-based updates can be leveraged to further reduce the amount of data transferred.

Update monitoring

While rolling out updates, it's important to closely monitor the fleet and stop the roll-out as soon as issues are detected. The application should be instrumented (using OpenTelemetry for example). The metrics will be used by the update daemon to initiate local rollbacks in case of unhealthy signals as well as by the backend service for fleet-wide rollbacks.

Stereo-vision and device-specific calibration

Note: The problems listed in this section are not part of my expertise. Consider this a preliminary assessment based on limited research.

Assumptions

No hardware is available to assist (GPS, laser, etc) and only video data can be used
In the field, things like checkerboards can't be used
The zoom+distance to focus mapping function is linear
Variance between units of a model is limited

Per-model calibration

First, assuming samples are available, model-specific focus mappings can be built for supported cameras. Within a controlled environment where different easy to detect objects are placed at known distances, we can then sample a range of values for zoom and focus. The correctness of the focus can be assessed automatically using tools like OpenCV which implements different techniques like Brenner Gradient, Laplace operator, etc.

From the set of points gathered, we can use a linear regression to build a function with the zoom + distance as an input and the correct focus level as the output.

Stereo vision

In order to leverage stereo vision, the relative position of cameras must be obtained. OpenCV includes several tools[2] for this purpose. While this is outside of my field of competencies, my assessment would be that feature detection such as ORB could be leveraged to allow multiple cameras to point towards the same target without using a checkerboard and the essential matrix and pose can be used to perform triangulation.

Follow up

Other techniques that allow positioning a camera in 3D space exist addressing a related problem space: Simultaneous Localization and Mapping / Structure from Motion. For example, bundle adjustment is an approach commonly used. This applies only to moving sensors (which can be cameras). Could a variation of a solution be applicable to PTZ?

Links

[1]: Support for volatile storage only would be possible if the local network is always assumed to include an instance of the update service or if the offline mode is not necessary.

[2]: OpenCV 3d calibration https://docs.opencv.org/4.x/d9/d0c/group__calib3d.html