Day 2 Operations

As a Platform Specialist you now have under control the Edge Devices installed across the Edge perimeter and you are able to deploy new version of the Device or add Applications in a massive and declarative way thanks to Flight Control.

OS Configuration

In the case of Image Base OS, it is usually best practice to include OS-level configuration into the OS image for maximum consistency and repeatability. To update configuration, a new OS image should be created and devices updated to the new image.

However, there are scenarios where this is impractical, for example, when configuration is missing in the image, needs to be specific to a device (with its dynamically generate uuid), needs to be update-able at runtime without updating the OS image and rebooting or if it contains secrets or keys that you might not want to publish with the Image. For these cases, Flight Control allows users to declare a set of configuration files that shall be present on the device’s file system.

Conceptually, this set of configuration files can be thought of as an additional, dynamic layer on top of the OS image’s layers. The Flight Control Agent applies updates to this layer transactionally, ensuring that either all files have been successfully updated in the file system or have been returned to their pre-update state. Further, if the user updates both a devices OS and configuration set at the same time, the Flight Control Agent will first update the OS, then apply the specified configuration set on top.

After the Flight Control Agent has updated the configuration on disk, this configuration still needs to be activated. That means, running services need to reload the new configuration into memory for it to become effective. If the update involves a reboot, services will be restarted by systemd in the right order with the new configuration automatically. If the update does not involve a reboot, many services can detect changes to their configuration files and automatically reload them. When a service does not support this, you use Device Lifecycle Hooks to specify rules like "if configuration file X has changed, run command Y".

Users can specify a list of configurations sets, in which case the Flight Control Agent applies the sets in sequence and on top of each other, such that in case of conflict the "last one wins".

Configuration can come from multiple sources, called "configuration providers" in Flight Control. Flight Control currently supports the following configuration providers:

Git Config Provider: Fetches device configuration files from a Git repository.
Kubernetes Secret Provider: Fetches a Secret from a Kubernetes cluster and writes its content to the device’s file system.
HTTP Config Provider: Fetches device configuration files from an HTTP(S) endpoint.
Inline Config Provider: Allows specifying device configuration files inline in the device manifest without querying external systems.

What we would need to add in this case in terms of configuration, is the pull-secret for Microshift that will be placed under /etc/crio/openshift-pull-secret.

You can see how to get your pull-secret in this page

Since the Device is managed under a Fleet, we would need to update the Fleet template to include this configuration file (you might also want to add more labels to select for example only the Devices that include Microshift for this specific case).

We can decide to proceed through the CLI (from your Workstations) or GUI

$ flightctl get fleet/my-fleet -o yaml > fleet.yaml

apiVersion: v1alpha1
kind: Fleet
metadata:
  labels:
    env: test
  name: my-fleet
spec:
  selector:
    matchLabels:
      gpu: "true"
      microshift: "true"
  template:
    metadata:
      labels:
        fleet: my-fleet
    spec:
      applications: []
      config:
      - inline:
        - content: {"auths":{"cloud.openshift.com":{"auth":"b3....Yw","email":"luferrar@redhat.com"}}}
          path: /etc/crio/openshift-pull-secret
        name: pull-secret

and apply the above

$ flightctl apply -f fleet.yaml

platform flight control configuration file

We can now start Microshift again with much more success after checking that the Local Edge Device has received the pull-secret.

$ cat /etc/crio/openshift-pull-secret
...
$ sudo systemctl enable microshift --now

After 5 to 10 minutes (downloading the container images) you should be able to see the pods running on the Local Edge Device

$ sudo oc --kubeconfig /var/lib/microshift/resources/kubeadmin/kubeconfig get pods -A

NAMESPACE                  NAME                                       READY   STATUS    RESTARTS        AGE
kube-system                csi-snapshot-controller-6885679877-k28jq   1/1     Running   0               5d1h
kube-system                csi-snapshot-webhook-896bb5c65-nqp5t       1/1     Running   0               5d1h
openshift-dns              dns-default-9fbt7                          2/2     Running   0               4m35s
openshift-dns              node-resolver-rlkm6                        1/1     Running   0               5d1h
openshift-ingress          router-default-9f776c7d-xwwhm              1/1     Running   0               5d1h
openshift-ovn-kubernetes   ovnkube-master-thcz6                       4/4     Running   1 (4m43s ago)   5d1h
openshift-ovn-kubernetes   ovnkube-node-nkrwc                         1/1     Running   1 (4m44s ago)   5d1h
openshift-service-ca       service-ca-5d57956667-m2hlc                1/1     Running   0               5d1h
openshift-storage          lvms-operator-7f544467bc-pc752             1/1     Running   0               5d1h
openshift-storage          vg-manager-pqdvr                           1/1     Running   0               3m27s

Now that Microshift is up and running you might want to also monitor the status of this specific service across your Fleet.

We are now ready to deploy the Applications!

Application deployment

You can deploy, update, or undeploy applications on a device by updating the list of applications in the device’s specification. The next time the agent checks in, it learns of the change in the specification, downloads any new or updated application packages and images from an OCI-compatible registry, and deploys them to the appropriate application runtime or removes them from that runtime.

At the moment the following runtimes and formats are supported:

Podman: podman-compose
Podman: Quadlet
MicroShift: Kubernetes manifests

Let’s recap for a minute, before deploying anything, what we would like the final application workflow to be.

On the local Nvidia Edge Device we should be running:

Inference Server app
Camera Stream Manager app
Actuator app

In the Cloud, on OpenShift Cluster:

Dashboard Backend app
Dashboard Frontend app

Assuming you haven’t deployed that in the AI Specialist App Deployment we will start with the Cloud deployment since we would need the URL of the Cloud App to receive the notification from the alert app on the Edge.

Cloud Applications

You can follow these steps from your Workstation:

Create a new OpenShift Project (USERNAME—test)

Deploy the backend using the following YAML manifests (you can use the + icon on the top right corner of the OpenShift Console to paste them)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: object-detection-dashboard-backend
  labels:
    app: object-detection-dashboard
    app.kubernetes.io/part-of: Dashboard
    app.openshift.io/runtime: "python"
spec:
  replicas: 1
  selector:
    matchLabels:
      app: object-detection-dashboard
      component: backend
  template:
    metadata:
      labels:
        app: object-detection-dashboard
        component: backend
    spec:
      containers:
      - name: backend
        image: quay.io/luisarizmendi/object-detection-dashboard-backend:v1
        ports:
        - containerPort: 5005
---
apiVersion: v1
kind: Service
metadata:
  name: object-detection-dashboard-backend
  labels:
    app: object-detection-dashboard
spec:
  selector:
    app: object-detection-dashboard
    component: backend
  ports:
  - protocol: TCP
    port: 5005
    targetPort: 5005
  type: ClusterIP
---
apiVersion: route.openshift.io/v1
kind: Route
metadata:
  name: object-detection-dashboard-backend
  labels:
    app: object-detection-dashboard
spec:
  to:
    kind: Service
    name: object-detection-dashboard-backend
  port:
    targetPort: 5005

Create the frontend application. This time you cannot just copy-paste the manifests below since you will need to include in the Deployment manifest a value for the BACKEND_API_BASE_URL environment variable. You can get the Backend URL if you check it in the Networking > Routes menu in the OpenShift Console (it will something like http://object-detection-dashboard-backend-user99-test.apps.cluster-hkr2j.hkr2j.sandbox1307.opentlc.com)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: object-detection-dashboard-frontend
  labels:
    app: object-detection-dashboard
    app.kubernetes.io/part-of: Dashboard
    app.openshift.io/runtime: "nodejs"
  annotations:
    app.openshift.io/connects-to: '[{"apiVersion":"apps/v1","kind":"Deployment","name":"object-detection-dashboard-backend"}]'
spec:
  replicas: 1
  selector:
    matchLabels:
      app: object-detection-dashboard
      component: frontend
  template:
    metadata:
      labels:
        app: object-detection-dashboard
        component: frontend
    spec:
      containers:
      - name: frontend
        image: quay.io/luisarizmendi/object-detection-dashboard-frontend:v1
        ports:
        - containerPort: 3000
        env:
        - name: BACKEND_API_BASE_URL
          value: HERE-YOUR-BACKEND-API-BASE-URL-!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!-DONT-FORGET-TO-COMPLETE
---
apiVersion: v1
kind: Service
metadata:
  name: object-detection-dashboard-frontend
  labels:
    app: object-detection-dashboard
spec:
  selector:
    app: object-detection-dashboard
    component: frontend
  ports:
  - protocol: TCP
    port: 3000
    targetPort: 3000
  type: ClusterIP
---
apiVersion: route.openshift.io/v1
kind: Route
metadata:
  name: object-detection-dashboard-frontend
  labels:
    app: object-detection-dashboard
spec:
  to:
    kind: Service
    name: object-detection-dashboard-frontend
  port:
    targetPort: 3000

Go to Routes in the "Administrator" view in the OpenShift Console and take note of the Backend and Frontend URLs.

You can now test the access to the dashboard navigating to the Dashboard Frontend URL.

The Dashboard application does not use TLS, so the URL must start http:// and https:// otherwhile you will get a message "Application is not available" even when then POD is already running.

You should be able to see something like this navigating to the browser:

Edge Applications

To add a new application to a device, we will be modifying the Fleet Device Spec template. Given we are targeting Microshift as platform we will need to add Kubernetes manifests to the Fleet template (you can see why is that here).

So we want to apply again the logic we saw above on defining an OS config (since manifests are just configuration on the filesystem).

You will be able to find all the pre-built Edge apps in the Edge Container Registry under:

192.168.2.200:5000/gitea/object-detection-stream-manager:prod
192.168.2.200:5000/gitea/object-detection-inference-server:prod
192.168.2.200:5000/gitea/object-detection-action:prod

Given the kustomize files require a complex file structure, we will use the Git Config Provider option (mentioned above) in Flight Control.

Example 1. Git Config Provider

You can store device configuration in a Git repository such as GitHub or GitLab and let Flight Control synchronize it to the device’s file system by adding a Git Config Provider.

The Git Config Provider takes the following parameters:

Repository: the name of a Repository resource defined in Flight Control.
TargetRevision: the branch, tag, or commit of the repository to checkout.
Path: the subdirectory of the repository that contains the configuration.
MountPath: (optional) the directory in the device’s file system to write the configuration to. Defaults to the file system root /.

The Repository resource definition tells Flight Control the Git repository to connect to and which protocol and access credentials to use. It needs to be set up once (see Setting Up Repositories) and can then be used to configure multiple devices or fleets.

Now we can start by creating this Git repository on your Github (we would be able to also use the private Edge Container Registry if we had full control over the network, to bear in mind in case you want to store secrets inside the git repo).

You can find the full folder structure for the 3 Edge apps (and Nvidia device plugin) with kubernetes manifest files here

We can now define the Git repo inside Flight Control.

The repo that you find above needs to be forked and customized according to your own use case before being used as a source in Flight Control. Specifically you would need to change:

the image location in all deployments
the ALIVE_ENDPOINT and ALERT_ENDPOINT in action deployment, to link it to the Dashboard previously deployed in the Cloud

Now we can modify the Fleet template accordingly and apply the modification once again from your Workstation:

$ flightctl get fleet/my-fleet -o yaml > fleet.yaml
...
   spec:
   ...
      config:
      - gitRef:
          mountPath: /etc/microshift/manifests.d/
          path: /
          repository: object-detection
          targetRevision: main
        name: microshift-manifests

...

You should now see something similar in Flight Control interface

You should also see the new manifests now landed on your Edge Device (in case you don’t I suggest you restart flightctl-agent.service again)

$ sudo ls /etc/microshift/manifests.d/

action	inference-server  stream-manager nvidia

You now go ahead and restart Microshift, so that it picks up the new manifests and apply them automatically with kustomizer. And check finally if the applications are running correctly on Microshift:

$ sudo oc --kubeconfig /var/lib/microshift/resources/kubeadmin/kubeconfig get all -n object-detection

NAME                                                                READY   STATUS    RESTARTS   AGE
pod/object-detection-action-deployment-54c6b68b47-h9swt             1/1     Running   0          30m
pod/object-detection-inference-server-deployment-7f99bd57dc-vh6pm   1/1     Running   0          30m
pod/object-detection-stream-manager-deployment-5c95454654-wcpnb     1/1     Running   0          25m

NAME                                                TYPE           CLUSTER-IP      EXTERNAL-IP       PORT(S)          AGE
service/object-detection-inference-server-service   LoadBalancer   10.43.220.243   192.168.122.117   8080:31583/TCP   30m
service/object-detection-stream-manager-service     NodePort       10.43.168.194   <none>            5000:32160/TCP   30m

NAME                                                           READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/object-detection-action-deployment             1/1     1            1           30m
deployment.apps/object-detection-inference-server-deployment   1/1     1            1           30m
deployment.apps/object-detection-stream-manager-deployment     1/1     1            1           30m

NAME                                                                      DESIRED   CURRENT   READY   AGE
replicaset.apps/object-detection-action-deployment-54c6b68b47             1         1         1       30m
replicaset.apps/object-detection-inference-server-deployment-7f99bd57dc   1         1         1       30m
replicaset.apps/object-detection-stream-manager-deployment-5c95454654     1         1         1       30m

Integrated Applications Test

If the two previous deployments worked fine you should see a Device popping up in the Device Monitoring Dashboard like this:

We can now interact with the Inference Server and see how the situation changes when we move the camera and focus on your face

To be able to reach the Streaming app you might need to open the related firewall port on the Local Edge Device (in this case we are opening the NodePort we defined in the stream-manager service definition):

$ sudo firewall-cmd --zone=public --add-port=32160/tcp --permanent
$ sudo firewall-cmd --reload

Now navigate to the IP of the Local Edge Device with port 32160, under the URL video_stream

Given we are not wearing any helmet at the moment, this should be the result:

Should the Edge Device disconnect from the Cloud you should see something like this:

OS Update

You can update a device’s OS by updating the target OS image name or version in the device’s specification. The next time the agent checks in, it learns of the requested update and automatically starts downloading and verifying the new OS version in the background. It then schedules the actual system update to be performed according to the update policy. When the time has come to update, it installs the new version in parallel and performs a reboot into the new version.

In this case we are going to generate a new image that includes cockpit-ostree package, which will allow us to visualize graphically on the cockpit the update tree for the target system.

$ flightctl get fleet/my-fleet -o yaml > fleet.yaml
...
   spec:
      updatePolicy:
        downloadSchedule:
          at: "* 9-16 * * *"
        updateSchedule:
          at: "0 9 * * 5"
      applications: []
...

In this case we have decided to download updates everyday, during working hours and we are going to reboot into updated version only on Saturdays, because we know this shift is going to be covered by Ops team.

Let’s now add the required package

FROM {container-registry-gitea}/{id}/nvidia:0.0.2

RUN dnf -y install cockpit-ostree; \
    dnf -y clean all

and build and push the new image to the Local Edge Registry (again using either the Shared Edge Builder or the VM on your Local Edge Device).

$ sudo podman build -t 192.168.2.200:5000/ID/nvidia:0.0.2 .
$ sudo push  192.168.2.200:5000/ID/nvidia:0.0.2

We can now change the Fleet definition and point to the reference OS Image URL.

$ flightctl get fleet/my-fleet -o yaml > fleet.yaml
...
    spec:
      applications: []
      os:
        image: {container-registry-gitea}/{id}/nvidia:0.0.2
      config:

and wait for the agent to take care of downloading the new image.

You should also be able to see that happening in the Dashboard.

Mind you, it might take a while for the agent to download the image (should the agent timeout, you can restart the service easily with systemctl restart flightctl-agent.service). After a while you should see the Device is ready for reboot.

which we will trigger manually. After that you should be able to see a new section inside cockpit.

SELINUX might be blocking your cockpit interface, you can temporarily disable it with setenforce 0

Device Monitoring

One important element of Edge Device is preventive monitoring and maintenance, since most of them are really remote and losing one of them without previous notice might imply losing visibility on a whole remote site.

You can set up monitors for device resources and define alerts when the utilization of these resources crosses a defined threshold in Flight Control. When the agent alerts the Flight Control service, the service sets the device status to "degraded" or "error" (depending on the severity level) and may suspend the rollout of updates and alert the user as a result.

Note this is not meant to replace an observability solution.

Resource monitors take the following parameters:

MonitorType: the resource to monitor. Currently supported resources are "CPU", "Memory", and "Disk".
SamplingInterval the interval in which the monitor samples utilization, specified as positive integer followed by a time unit ('s' for seconds, 'm' for minutes, 'h' for hours).
AlertRules: a list of alert rules.
Path: (Disk monitor only) the absolute path to the directory to monitor. Utilization reflects the filesystem containing the path, similar to df, even if it’s not a mount point.

Alert rules take the following parameters:

Severity: the alert rule’s severity level out of "Info", "Warning", or "Critical". Only one alert rule is allowed per severity level and monitor.
Duration: the duration that resource utilization is measured and averaged over when sampling, specified as positive integer followed by a time unit ('s' for seconds, 'm' for minutes, 'h' for hours). Must be smaller than the sampling interval.
Percentage: the utilization threshold that triggers the alert, as percentage value (range 0 to 100 without the "%" sign).
Description: a human-readable description of the alert. This is useful for adding details about the alert that might help with debugging

We are going to add a simple CPU monitor to the local Nvidia Device and make sure that threshold is so low that it triggers alerts. But first let’s examine the Device definition inside Flight Control.

$ flightctl get device/cfq3nqurpqqhc91rs4sunh4a133dg3rlnntq9r7kfqr61rtmud60 -o yaml

apiVersion: v1alpha1
kind: Device
metadata:
  annotations:
    device-controller/renderedVersion: "6"
    fleet-controller/renderedTemplateVersion: "2025-01-31T11:05:54.073774434Z"
    fleet-controller/templateVersion: "2025-01-31T11:05:54.073774434Z"
  creationTimestamp: "2025-01-31T10:41:41.451373Z"
  generation: 2
  labels:
    alias: nvidia-agx-vm
    gpu: "true"
    location: home
  name: cfq3nqurpqqhc91rs4sunh4a133dg3rlnntq9r7kfqr61rtmud60
  owner: Fleet/my-fleet
  resourceVersion: "64"
spec:
  applications: []
  config: []
  os:
    image: osbuild.lmf.openshift.es:5000/lmf/nvidia:0.0.2
status:
  applications: []
  applicationsSummary:
    info: No application workloads are defined.
    status: Healthy
  conditions:
  - lastTransitionTime: "2025-01-31T10:43:55.27613807Z"
    message: 'Updated to desired renderedVersion: 2'
    reason: Updated
    status: "False"
    type: Updating
  - lastTransitionTime: "2025-01-31T10:41:41.498099133Z"
    message: ""
    reason: Valid
    status: "True"
    type: SpecValid
  config:
    renderedVersion: "2"
  integrity:
    summary:
      status: ""
  lastSeen: "2025-01-31T10:44:54.403698984Z"
  lifecycle:
    status: Unknown
  os:
    image: osbuild.lmf.openshift.es:5000/lmf/nvidia:0.0.2
    imageDigest: sha256:cf1221f4fc7d3618be3542fa5f55d4495c499d59b22a60c8c6ee64c7645a167f
  resources:
    cpu: Healthy
    disk: Healthy
    memory: Healthy
  summary:
    info: Did not check in for more than 5 minutes
    status: Unknown
  systemInfo:
    architecture: arm64
    bootID: |
      3aac7f9e-2998-452e-bfc4-a1728914d279
    operatingSystem: linux
  updated:
    info: The device has been updated to the latest device spec.
    status: UpToDate

As we can see in the Status, the Resources are all healthy (CPU, Disk, Memory).

Since the Device is managed inside a Fleet we would need to update the Fleet template

$ flightctl get fleet/my-fleet -o yaml > fleet.yaml

We can include a simple monitoring snippet

      resources:
      - alertRules:
        - description: CPU Usage high, check for running processes!
          duration: 10m
          percentage: 1
          severity: Warning
        monitorType: CPU
        samplingInterval: 5s

and apply the modified yaml configuration again

$ flightctl apply -f fleet.yaml

You should now see something changing in the Resource status section

Move back to Index