Low-latency video streaming with GStreamer on HoloLens 2

Several of our cases, including our work for IKEA (Altitude Picking) and BD (Medical Imaging), stream video from external cameras into a HoloLens 2, giving the wearer insight into what the camera sees. Whilst designing these solutions, we quickly concluded that most off-the-shelf solutions couldn’t offer what we were looking for. Therefore, we landed on GStreamer, an open-source multimedia framework.

This article will walk you, the reader, through our journey of reaching our goal of low-latency video streaming on the HoloLens 2. Note that this article contains deeply technical details; readers not interested in the technical details of software development, video streaming, hardware acceleration, and color formats, might want to consider skipping this article.

Case Introduction

As this article will mainly focus on two cases we’ve built, a very short introduction to the cases themselves is in order.

The first case is BD Medical Imaging, which allows a surgeon to see what a laparoscopic camera sees, inside the HoloLens 2, during surgery. Originally the surgeon viewed this footage through a separate monitor, looking sideways or otherwise away from the patient whilst moving the camera. The HoloLens 2 provided an opportunity here since it allows the surgeon to maintain vision on the patient, and the camera being manipulated, reducing the risk of hurting the patient.

The second case is IKEA (Altitude Picking), where a reach truck driver moves a fork upwards to pick items off a shelf. Since the shelves can be high up (9 meters is not uncommon), it is hard for the driver to see exactly what is happening with the fork. Our solution places a camera high up on the fork, allowing the driver to get a view equally high up, reducing the chances of damaging the shelves, or the products, and speeding up their workflow.

Latency

It quickly became apparent that not only was video streaming necessary, but it also had to be of very low latency: under 200 milliseconds, preferably under 150 milliseconds, but ideally even lower, the main reason being physical danger; the wearer of the HoloLens 2 must be able to react quickly to changing circumstances. Picture a surgeon performing surgery on a patient, and the video being delayed by even only a second ‒ in that time the surgeon might already have hurt the patient without having noticed it.

Note also that this is end-to-end latency, meaning the time between something happening in the real world and the wearer of the HoloLens 2 seeing it must be minimized in its entirety.

Technological Constraints

The premise is not inherently easy from the outset but is exacerbated by the technical constraints we need to consider.

UWP

Based on Windows 10 Holographic, a special variant of Windows 10, and unlike desktop Windows, the HoloLens 2 only allows code running under an UWP (Universal Windows Platform) context.

Windows 10 is a very common operating system, but UWP often seems to not be catered for, or simply not tested for in software libraries. This is not surprising given that it is a more constrained environment, traditional non-UWP applications are still very (most?) common, and it is not a hard requirement on desktop Windows ‒ even the matching Microsoft Store requirement was lifted recently. We also regularly feel this pain in other software libraries, where we occasionally contribute fixes for UWP.

ARM64

The HoloLens 2 is powered by the Snapdragon 850 Mobile Compute Platform, meaning the HoloLens 2 uses the ARM64 processor architecture. ARM64 is common in smartphones and various low-cost chips such as the Raspberry Pi nowadays, but these devices usually run different operating systems: iOS, Android or Linux directly. All these ecosystems are either common and mature, or ‒ in the case of Linux ‒ have a strong multi-architecture open-source ecosystem where contributing fixes for these platforms either has already happened or is otherwise straightforward.

Unity

At Mr. Watts we use the Unity engine as we build our user interfaces using the MRTK for the HoloLens 2, so preferably the solution we choose works with Unity ‒ if not out of the box, then through a custom integration.

Network Streaming

Next to this, for both cases, the camera cannot be directly attached to the HoloLens 2, either because the USB cable would become too long to maintain the necessary bandwidth for the data transfer, or because no wires may be present to avoid obstruction of the wearer. This means streaming must happen over the network, either by ethernet cable or by Wi-Fi. This also means we need a device that can capture the data from the camera and transmit it.

Also, for the IKEA case, a stereoscopic 3D camera was preferred to facilitate stereoscopic vision natural to humans with the HoloLens 2, yielding a combined resolution of 3840×1080 (stereo 1080p) at 30 FPS. This means higher data bandwidth, more taxing to decode, process, and render, and thus higher latency if we don’t take precautions.

Hardware Video Decoding

As if the Windows UWP ARM64 requirement wasn’t enough, handling video efficiently requires help from the hardware ‒ meaning hardware video acceleration. Without it, processing happens on the CPU, which is less suited to the task, and will prevent us from descending to the beckoning depths of that low latency we so desperately crave for.

Handling video efficiently is also a problem we must solve for both sender and receiver, as they both influence the end-to-end latency. This opens another can of worms:

  • We need to use appropriate APIs on both sides to facilitate hardware video encoding and decoding.
    • For the Nvidia Jetson Nano, this could mean FFmpeg or GStreamer.
    • For the HoloLens 2, this means the video acceleration parts of DirectX 11/12, or alternatively Windows Media Foundation.
  • We need a codec for the stream that can be handled efficiently on both sides.
    • For the Nvidia Jetson Nano, this means h.264 or h.265 (HEVC). Encoding VP8 and VP9 could not be accelerated at the time we were investigating this.
    • For the HoloLens 2, this means h.264 or h.265 (HEVC).
  • We need a video format that can be handled efficiently on both sides.
    • For the Nvidia Jetson Nano, the camera was YUY2, but the Jetson can efficiently convert to most common formats if necessary.
    • For the HoloLens 2, we weren’t exactly sure what this meant as it depends on what is exposed in DirectX, supported by the GPU, and what the library we’d choose supports.
  • We need to avoid copying data on the CPU as much as possible.

Towards a Solution

With the above requirements in mind, we started evaluating several off-the-shelf solutions. For the camera capturing bit, we quickly landed on the Nvidia Jetson Nano, an ARM64 device running Ubuntu Linux. We chose it for its video acceleration capabilities, which are discussed in more detail below.

For the HoloLens 2, several readily available proprietary software solutions, as well as some free and open source, didn’t make the cut because they couldn’t do ARM64 UWP. Others we tried, streamed video just fine, but got nowhere near the latency we needed, likely due to poor negotiation of formats, falling back to shuffling around buffers on the CPU, and didn’t allow customizing a lot. It was obvious that this was going to be a tough problem to solve.

We started looking at libraries of media players and frameworks that have been doing their jobs well for a long time, such as VLC and GStreamer. It turned out VLC has libVLC for this, but we ended up preferring GStreamer because, not long before we started investigating, GStreamer 1.18 had been released, shipping binaries for ARM64 UWP, thanks to work contributed to GStreamer by Centricular, demoed at GStreamer Conference 2019. This seemed very promising as a starting point.

GStreamer was especially interesting because it could cover both the sender and receiver side of things. Nvidia also already provided so-called “elements” for GStreamer that allow encoding camera data efficiently, and GStreamer provided additional elements that allow transmitting data over the network afterwards. We figured that, on the side of the HoloLens 2, a reverse GStreamer “pipeline” could then ingest said data, decode it, and hand it to Unity for rendering.

Serendipitously, we also landed upon mrayGStreamerUnity (by Yamen Saraiji) to integrate GStreamer into Unity, which bridges GStreamer’s C API to Unity’s underlying rendering back end ‒ Direct3D 11 in our case ‒ by also being a Unity C++ plugin. We also know that GStreamer has support for dealing with Direct3D 11 directly on Windows. Great! Then we have everything we need, right? Unfortunately, we don’t.

Headaches

On the Nvidia Jetson Nano, we managed to get to a working hardware accelerated GStreamer pipeline relatively quickly. This pipeline would capture camera frames, encode them using h.264, and transmit the data over the network.

On the HoloLens 2 side, the GStreamer Unity plugin mentioned above looked like it would do most of the heavy lifting but was problematic on the HoloLens 2 because UWP restricted the way GStreamer could load its plugins. We knew GStreamer itself had UWP support, but we still needed to ship the libraries in the Unity application in such a way that GStreamer, and the GStreamer Unity plugin, could pick them up, and we would also not violate their license. This took us some time to figure out and stabilize.

Also, the GStreamer Unity plugin works by taking I420 data provided by a GStreamer pipeline, copying it to a Unity (RGB) texture, and applying a fragment shader to convert the I420 data to RGB. The latter part is clever, as it offloads the color format conversion to the GPU. It also avoids sending RGB data over the network, which would allow dropping the shader, but is bulkier than YUV formats such as I420, NV12 and YUY2 ‒ also one of the reasons cameras tend to use YUY formats, as they usually can’t saturate the precision RGB allows, lowering the bandwidth of their output.

As such, a conversion to I420 was part of most of the example pipelines, to ensure that incoming data always had the I420 format, since that’s what the plugin expects. Letting the sender send I420 should allow dropping this conversion, right? Nope, colors were now incorrect ‒ a (re)negotiation seemed to be happening on the HoloLens 2 because it was not liking I420. This also meant that the I420 conversion we removed from the pipeline on the HoloLens 2 was not a no-op, which means it was increasing the latency, as processing the video is now more taxing, so it had to go.

Fun Negotiating Color Formats

Digging into the supported formats of the GStreamer decoding elements used on Windows revealed that the one we were using does not support all color formats, and which ones it does depends on what the GPU supports.

Yes, each GStreamer element can decide what formats they support and negotiate, as if it wasn’t hard enough to find a suitable combination ticking all the boxes. Long story short, the format the HoloLens 2 wanted was NV12. This meant we needed to go from YUY2 (provided by the camera) to NV12 (desired by the HoloLens 2), but at least this conversion could be done by the Nvidia Jetson, which is suited to tasks like this.

After changing our sender pipeline to provide NV12 and removing the I420 conversion from the receiver, we now had a new problem: the GStreamer Unity plugin was still assuming all incoming data was I420, causing crashes. No problem, we’d just fix that… and… nope, the colors were still incorrect. We were now dropping NV12 pixel data into an RGB Unity texture instead of I420, and the I420 shader was still operating on it. To fix this, we had to write a new fragment shader that converts NV12 to RGB.

It Works! But...

And there it was! A sweet, correctly colored, video stream. But… whilst latency was already under 200 milliseconds, the stream was not particularly smooth, and we wondered if the situation could be improved further.

What followed was a slew of optimizations around avoiding lock contention during multi-threading, micro-optimizations to avoid or reduce memory copies whenever possible and, finally: goal achieved! Around 100 milliseconds of latency for our stereo 1080p video stream on the HoloLens 2.

But Can We Do More?

We’re very happy with the latency we managed to achieve for these cases, and even more so that we were able to do it using free and open-source software such as GStreamer and mrayGStreamerUnity. We’d like to thank everyone who worked on making these projects awesome.

Whilst the latency we achieved is already nice, we’re confident that, given more time, we could squeeze out more by optimizing further. Some ideas that come to mind:

  • Switch to h.265 (HEVC). At the time we built our cases this did not result in improved performance, possibly because these paths in GStreamer had not been optimized yet as much as h.264, or due to missing hardware decoders on the HoloLens 2. (The situation might have changed in the meantime.)
  • Try to avoid the copy of NV12 data decoded from the h.264 frames and received from GStreamer into a Unity texture (on the CPU) by letting GStreamer decode directly to a DirectX texture and sharing that with Unity.

Summary

Whilst this article was very technical, we hope you enjoyed reading it. If you work for a company looking to build a similar case, or perhaps a developer building similar solutions, we hope this provides some insight and ideas on how to achieve the same, as we realize this information is not easy to come by in the young ecosystem that mixed reality still has today.

Need help or more information on developing video streaming solutions, using GStreamer, or integrating GStreamer with Unity? We also offer consulting services.