Practical immersive volumetric video for VR and virtual production, with layered depth images and Stable Diffusion

Lifecast Incorporated
15 min readApr 6, 2023

Lifecast makes software for practical immersive volumetric video, for virtual reality and virtual production, with a dash of AI-generated art. We are excited to announce a big step forward, the LDI3 format. Today we are unpacking what that means, showing some demos, and going into the technical details.

Here are the demos. You can watch them on a 2D screen in a browser, on a phone, or access these URLs from a VR web browser on Quest 2/Quest Pro to experience them in VR (no software install is required):

If you are interested in creating videos like this with Lifecast, for VR or virtual production, please contact info@lifecastvr.com. We present results here as they come out of our software, with no manual post production (although that is possible and can further improve results).

Lifecast’s video player has always been open source under the MIT license. WebXR and Unreal Engine 5 versions of the player for the new LDI3 format introduced in this article are on Lifecast’s Public GitHub repo.

In virtual reality photos and video, there are a spectrum of formats with different degrees of immersion and correctness in 3D rendering (which is important for a comfortable viewing experience). Some are 2D, and others are stereoscopic pseudo-3D (a different image for the left and right eye, wrapped around a sphere or half-sphere). VR videos and photos typically respond to rotation of a user’s head (3 degrees of freedom, a.k.a. 3DOF), but do not respond to moving side to side, forward and backward, or up and down. In contrast, VR games and a tiny fraction of VR videos support 6 degrees of freedom (6DOF), which allows a user to move in all directions, and see correct 3D images regardless of how they move. 3DOF VR video formats provide a suboptimal experience for viewers if they move their head at all, or look anywhere other than directly at the horizon, and these issues can cause motion sickness, double vision, and eye-strain. 6DOF VR video mitigates these issues, but creating 6DOF VR video is much harder. Even big tech companies like Meta and Google have thus far not delivered a practical solution. It is hard because it requires a photorealistic 3D model of every frame of video to be estimated from available sensors, compressed, and efficiently rendered in realtime.

An illustration of 6 degrees of freedom (6DOF). [Source]

Virtual production is a powerful new tool for 2D filmmaking. The idea is to have a 3D environment rendered on a huge LED wall behind the actors. The 3D environment is usually modeled in Unreal Engine or Unity. If the film camera moves, the image on the LED wall needs to respond accordingly, which is exactly the same problem as rendering 6DOF for VR. It is time consuming and expensive to make environments that look photorealistic in Unreal and Unity. Lifecast started out making software for virtual reality, but we learned from talking with film-industry professionals that the same technology for 6DOF video is a cost-effective and efficient way of creating photorealistic 3D environments for virtual production. Photogrammetry is similar, but applicable only to static scenes, whereas video allows the virtual environments to feel more alive. In this article we mostly explain our progress for VR. We are bringing the same volumetric video technology to virtual production as well.

An LED wall used for virtual product in, “The Mandolorian”. [source]

Lifecast uses the term “volumetric” and 6DOF interchangeably. This terminology will offend some, while for others it conveys the idea clearly. In our view, the most precise use of the term volumetric is for 3D scene representations which assign some value to each point in 3D space, such as voxels, or neural radiance fields (NeRF). However, it has also become common to refer to RGBD (color + depth map) images and video as “volumetric”, and for volumetric video software to operate on one or more streams of RGBD video. The important thing here is that RGBD allows for 6DOF rendering. What we are unveiling today is like RGBD on steroids; for the scholars, it’s an “inflated equiangular layered depth image”.

What does it mean to be immersive? Some VR videos have 180 degree field of view (half a sphere), while others cover a full sphere. We believe that half a sphere is enough to be immersive, while others demand a full 360 sphere. We are focusing on 180 degree content right now because it offers a favorable set of tradeoffs when considering the entire system, from cameras capturing the video, to processing and compression, and realtime playback.

When done well, 6DOF can be more immersive than 3DOF because the user can move, and correct 3D rendering is more immersive than stereoscopic pseudo-3D. However, 6DOF can also have different visual artifacts that reduce immersion. Overcoming these artifacts by creating a photorealistic 3D model of every frame of video, and being able to render that in real time on limited hardware, is an open problem in computer vision and graphics.

Video is harder than photos. Existing techniques such as photogrammetry and NeRF can produce 6DOF 3D representations of a static scene from a large number of images from different points of view. However, basic formulations do not work for parts of a scene that move, and extension to video is non-trivial, and involves tradeoffs with practicality. For example, prior work from Meta and Google on immersive volumetric video uses custom camera arrays with 24 or 46 cameras, in order to have many images of the scene all at the same moment in time. Unfortunately, working with this many cameras isn’t very practical.

Facebook- (now Meta’s) prototype Surround360 x24, with 24 cameras.
Google’s 46 camera array from “Immersive light field video with a layered mesh representation,” Broxton et al.

Academic publications provide an inspiring window into the future, but so far volumetric video for VR hasn’t become mainstream because it isn’t practical to create, edit, or watch. Lifecast believes the elements of a practical solution include:

  • It is possible to capture anything, anywhere, not just in a controlled environment.
  • It is possible to capture using reliable off-the-shelf cameras.
  • The amount of data captured is not prohibitively large.
  • The data can be processed into a volumetric representation in a reasonable amount of time.
  • The volumetric video can be edited using existing tools such as Adobe Premiere.
  • The video can be compressed efficiently and streamed over the internet.
  • The player runs in real time on the most popular, widely available mobile VR devices, which have relatively little GPU power compared with desktop VR systems.
  • The player runs on the web (not just standalone applications).
  • The player can be mixed into Unreal and Unity projects.

Lifecast’s software is designed with all of these goals in mind. Our approach is to work within the limitations of current hardware, and use more machine learning.

Recently, the Canon EOS R5 with dual fisheye lens has emerged as a category-redefining VR camera which can capture cinematic quality VR180 footage in 8K resolution. We developed a new pipeline for processing volumetric video which works with this camera, or any other VR180 camera (some other top-notch VR180 cameras include the FM Duo by FXG, and the K2 Pro by Z-Cam).

A Canon EOS R5 with dual fisheye VR lenses, capable of filming cinematic quality 8K VR180 video, portable, and available off-the-shelf.

Other approaches to volumetric video such as light stages use many (sometimes hundreds) of cameras facing inward to capture a detailed 3D model of a person. Such systems have many uses, but they cannot capture full immersive scenes on their own, and are not applicable to filming volumetric video in any location. RGBD depth sensors such as Azure Kinect have limited capabilities outdoors, and insufficient field of view and resolution for VR. Lifecast makes volumetric video using VR180 cameras, which have sufficient resolution and field of view for VR, but require more machine learning to process the data. Our first-generation pipeline for converting VR180 video to volumetric/6DOF shipped over a year ago. Since then, we have improved the visual quality of the results significantly.

To create a practical immersive volumetric video format, Lifecast works within several constraints. Video should be compressed using a standard format such as h264, h265, or ProRes, and stored in a standard container such as .mp4 or .mov. This is in contrast with representations which store large amount of auxiliary data such as triangle meshes together with some video data, or entirely custom formats. One good reason to work with standard video formats is that they can be edited using existing video software like Premiere, After Effects, and Resolve. Another is that modern computers have energy-efficient dedicated chips for decoding high resolution video, and it is bad for battery life, if possible at all, to decode as many pixels on a general-purpose CPU or GPU. Lifecast’s solution stores the volumetric video in a standard video file, using a carefully crafted encoding.

Even with desktop GPUs, there is a limit on the maximum resolution of video that can be decoded. On a Quest 2 or Quest Pro, the limit is 5760x5760 (that’s 33 million pixels per frame at 30 or 60 frames per second), and the limits aren’t much higher even on the most powerful desktop GPU. For volumetric video, we must spend our pixels wisely.

Even in 2D, it has always been a challenge to make VR videos and photos look clear because we have to stretch a limited number of pixels to cover a sphere, or half a sphere. There just aren’t enough pixels to go around. A “projection” is a particular formula for wrapping a rectangular image around a sphere. For example, the most widely used projection for VR video and photos is equirectangular (which is also used to make maps of the earth).

An equirectangular projection map of the earth. [source]
An equirectangular projection 360 photo.

The VR180 format for VR videos and photos has recently become popular because it provides a good set of tradeoffs in resolution, field of view, and ease of stitching. VR180 consists of one image for the left eye, and one for the right eye, in equirectangular projection, but each eye gets 180 degrees (half a sphere) instead of a whole sphere.

A VR180 photo consists of one image for the left eye and one for the right, in equirectangular projection, with 180 degree field of view. Photo by Thomas Hübner.

Lifecast’s software takes VR180 videos and photos as input, and enhances them into our volumetric representation using machine learning and computer vision. As part of this transformation, we also modify the projection, so we can provide higher visual quality while working within the video pixel budget.

The VR industry has standardized on equirectangular projection for VR180, but unfortunately equirectangular projection puts more pixels at the edge and less pixels in the middle of the scene, which is the opposite of what we want. VR180 only produces correct 3D when the user is facing directly forward and looking at the horizon; as the user looks farther toward the edges, the stereoscopic rendering becomes incorrect, so it is a shame to spend most of the pixels on that part of the scene.

Instead of equirectangular, Lifecast’s LDI3 format uses equiangular (a.k.a. f-theta projection), and we “inflate” the equiangular projection, to put even more detail in the middle. The images below illustrate the different projections. With the Lifecast format, we can also choose to trade FOV for pixel density by zooming in on the equiangular projection.

Equirectangular projection. This is half of a VR180 photo (just the part corresponding to the right eye). Notice the small couple in the center of the image. Photo by Thomas Hübner.
The image in “inflated” equiangular projection. Notice the center is magnified, while the edges are squished.

So far we have seen what these different projections look like, but to better illustrate why this matters, we will crop the center 512x512 pixels out of an 4400x4400 equirectangular image (one eye in “8K” VR180 resolution), and a 1920x1920 inflated equiangular image.

The center 512x512 pixels from the original equiangular VR180 image, which is 8800x4400 (i.e. slightly above “8K” resolution).
The center 512x512 pixels from the inflated equiangular projection at 1920x1920 resolution.

The two images above look nearly identical. By using a better projection, we squeeze an effective 8K pixel density in the middle (and less at the edges), with close to 180 degree field of view, into 1920x1920 pixels. This is important because we can fit 9 of these into a 5760x5760 video file, which means we can do 3 layers, each with a depth map and alpha channel.

Neural radiance fields (NeRF) have recently emerged as the new state of the art in reconstructing 3D scenes from multiple images, and providing photorealistic rendering of novel views. However, the original NeRF formulation has many limitations which make it unsuitable for practical use with video. Some limitations include:

  • Needs many images of a static scene or many cameras to capture the same moment
  • Slow processing to estimate a NeRF
  • Not compressible using standard video formats
  • Not trivial to render in real time

Some of these limitations have been overcome independently, but solving them all at once remains a challenge. We are taking inspiration from NeRF in some ways, but we remain focussed on what is practical to deploy today.

“Layered depth images” by Jonathan Shade, Steven J. Gortler, Li-wei He, and Richard Szeliski, SIGGRAPH 1998, introduces a foundation that we build upon to achieve a practical volumetric video representation. A layered depth image (LDI) consists of a collection of RGBDA layers, each of which has a color (RGB), a depth map (D), and an alpha channel (A). Advanced readers may notice the similarities in the rendering equations of NeRF and LDIs; each essentially blends a sequence of colors from samples along a ray, according to an alpha value. Attempts to accelerate NeRF often involve sampling fewer points and choosing those points efficiently by skipping empty space or predicting where important samples are likely to be. LDI can be thought of as an efficient way of approximating such a process.

The new Lifecast LDI3 format is an LDI with three layers, in inflated equiangular projection. We have increased the number of layers from two to three, because more layers can better represent complex scenes, and allows for appropriate in-painting within each layer. Going up to four layers would be an unfavorable tradeoff in image quality given a 5760x5760 resolution limit. We tested three layers a year ago on a Quest 2, and it wasn’t realtime, but since then Meta has shipped some optimizations (particularly for WebXR) which make three layer rendering possible. In particular, fixed-foveated rendering is a necessary optimization at this point (although it can result in pixelation artifacts in the user’s peripheral vision). We appreciate these efforts and look forward to further optimization in Meta’s WebXR stack improving the experience for users of Lifecast’s VR video player.

Previously, Lifecast released our 2-layer 6DOF format, which is essentially an LDI with 2 layers, in equiangular projection (not inflated), but the alpha channels are not stored explicitly in the video; instead they are computed in realtime on the GPU as we decompress the video. The advantage of not storing an alpha channel explicitly is that we don’t have to spend any pixels on that, and we can use those pixels to make the images look clearer instead. The disadvantage is that there is only so much a mobile GPU can do in realtime to compute alpha channels. With LDI3, we now store the alpha channels in the video, which means we can spend more time offline to compute better alpha channels, and then render them at low cost in realtime.

Lifecast’s 2-layer format. The top row is the foreground layer, and the bottom row is the background layer. The left column is an image, and the right column is a depth map.

Lifecast’s 2-layer format was designed to address a challenging issue with 6DOF videos and photos. When you give users the ability to move, they try to look behind objects into places where the camera never saw, and they expect to see something plausible.

The most basic level of handling this problem is to ignore it, which is the approach taken in some prior work on RGBD video for VR. Wherever a large discontinuity in depth occurs (at the edge of an object), the foreground and background are connected with “streaky triangles,” so there is no possibility to look behind the object. This artifact is distracting for many users, and reduces immersion.

With the 2-layer format, we created some of the first volumetric videos to run on the Quest 2 which go beyond streaky triangles. We used machine learning to imagine what is behind objects, by in-painting a background layer and its depth map. With the LDI3 format, and several enhancements to our pipeline, the AI now does a much better job of filling in the missing data.

This is a frame in the new Lifecast LDI3 format:

Lifecast’s LDI3 format. Top row: foreground layer. Middle row: middle layer. Bottom row: background layer. Left column: images. Middle column: depth maps with 12-bit error correcting code. Right column: alpha channels.

Each row stores one layer’s color (RGB), depth map (D), and alpha-channel (A). The middle column looks more complicated than a typical depth map. This is because we use a special encoding to improve depth map fidelity in LDI3.

Most videos on the web use 8-bit compression, which means there are at most 256 possible shades of grey. This is a problem when storing depth maps for LDIs, because there can only be 256 distinct depth values, which causes smooth surfaces to look like a jagged staircase. We need to be able to store depth maps with more than 8 bits of accuracy to overcome this problem.

8-bit depth maps result in staircase artifacts on smooth surfaces.
12-bit depth maps mitigate this artifact.

10-bit video is used for high-dynamic range (HDR) color. What we want is like HDR for depth maps. There are video codecs with more bits, but they are not compatible with all browsers and devices. In our tests, a Quest 2 could not decode a 10-bit video in realtime, in WebXR, at the full 5760x5760 resolution. 10-bits is good for color, but we want more bits for depth ideally.

So we developed an error-correcting code that allows us to store 12-bit depth maps in an 8-bit video file, in a way that is robust to lossy compression. This is why there are 3 smaller copies of something resembling a depth map in the middle column of the LDI3 format. Further details are beyond the scope of this article, but you can see the difference it makes by comparing the two videos above.

Now we will return to the problem of rendering plausible imagery when users look behind objects into parts of the scene that the camera never saw. The LDI3 format’s precomputed alpha channels are part of the solution. Another part is in our pipeline for processing VR180 input to produce the LDI3 output. This is where we mix in some ideas from NeRF and Stable Diffusion.

First, we solve an optimization problem to decompose the scene into 3 layers, obtain alpha channels, and determine the appropriate context for in-painting in each layer. We developed a novel method for this which uses a neural multi-resolution hash map, similar to NVidia’s Instant NGP (currently one of the fastest implementations of NeRF).

Decomposing a scene into 3 layers using a neural multi-resolution hash map. Red = background, green = middle, blue = foreground.

Our earlier 2-layer format and pipeline used a AoT GAN for in-painting. Stable Diffusion didn’t exist when we first developed the 2-layer format. In our new pipeline for LDI3, we use Stable Diffusion for in-painting, and it brings a significant improvement in quality. Results for in-painting with AoT GAN and Stable Diffusion are compared below (these images are what the AI imagines behind the foreground):

In-painting with AoT Gan in the Lifecast 2-layer pipeline.
In-painting with Stable Diffusion in the Lifecast LDI3 pipeline. Notice the more plausible texture on the tile floor, and that the lamp has been completely removed.

In conclusion the Lifecast LDI3 format and rendering pipeline is a new milestone for practical immersive volumetric video. Significant upgrades include:

  • Inflated equiangular projection for effective 8K resolution in the center of the scene.
  • 12-bit depth maps using a compression-robust error-correcting code.
  • In-painting using Stable Diffusion.
  • A method for decomposing a scene into 3 layers using a neural multi-resolution hash map.

The future of practical immersive volumetric video for VR and virtual production is bright. To create content in LDI3 format using VR180 cameras, reach out to info@lifecastvr.com.

(Note that Lifecast’s VR180 to 6DOF converter, available on lifecastvr.com generates the 2-layer format; creating LDI3 video is currently available only by working directly with Lifecast).

LDI3 is not only for photorealistic reconstructions of the real world from VR180 cameras. We can also create LDI3 scenes from a text prompt, any 2D photo captured with a regular camera, or any 2D image from any source, using Stable Diffusion to imagine the scene and/or fill in missing details. This is live and free to use on holovolo.tv (created by Lifecast). Holovolo.tv is kind of like a Swiss Army knife for immersive volumetric media. In addition to viewing AI-generated 3D scenes in VR, it can also export to Unreal or Unity, as Facebook 3D photos, or an OBJ 3D model. We are constantly iterating to improve these capabilities.

Connect with Lifecast: Facebook | Mailing List | Twitter | Github | Youtube

--

--

Lifecast Incorporated

Forrest Briggs - CEO @ Lifecast Inc. Ph.D. in machine learning. Worked on 3D VR cameras at Facebook, self-driving cars at Lyft, and robots at Google X.