top of page

Zero123++ - Generate 3D-consistent multi-view images from a single image

If you're a CXO, founder or investor - follow me on LinkedIn & Twitter, or join my newsletter on my website here. I share latest simplified AI research and tactical advice on building AI products.

Practical Uses

  1. Gaming studios and game designers can use this to generate 3D assets from 2D image.

  2. VR/AR startups can use this to generate 3D avatars of people and also 3D assets from 2D images.

  3. Camera app startups can use this to output 3D pictures and avatars.

  4. Text-to-image startups can use this to produce more consistent 3D images.

Pre-requisite definitions

Diffusion Models

A type of generative models which generate diverse types of output similar to, but not exact copy, of training data. A random noise is added and model learns iteratively how to go from small noise to completely denoised image.

Noise Scheduler

Noise schedulers are used to feed noise to diffusion models at a defined schedule.

Reference Attention

Running the denoising UNet model on an extra reference image and appending the self-attention key and value matrices from the reference image to the corresponding attention layers when denoising the model input.

How this model works

Stable Diffusion 2 model is fine-tuned (using techniques in next section) to generate a new multi-view consistent base diffusion model.

It implements various local and global conditioning techniques to maximize the utilization of Stable Diffusion priors.

How this model is different

Enhanced modelling

The essence of generating consistent multi-view images is the correct modeling of the joint distribution of multiple images.

Zero-1-to-3 models the conditional marginal distribution of each image separately and independently, which ignores the correlations between multi-view images.

Zero123++ takes the simplest form of tiling 6 images with a 3 × 2 layout into a single frame for multi-view generation, which creates more consistency.

Fixed elevation and relative azimuth angles

Zero-1-to-3 is trained on camera poses with relative azimuth and elevation angles to the input view. This, however, requires knowing the elevation angle of the input view (which is inconsistent in dataset) to determine the relative poses between novel views. Therefore, various pipelines like One-2-3-45 and DreamGaussian were built which incorporated an additional elevation estimation module, but they also introduced an extra error into the pipeline.

Zero123++ uses fixed absolute elevation angles and relative azimuth angles as the novel view (six) poses, eliminating the orientation ambiguity without requiring additional elevation estimation.

Six poses consist of interleaving elevations of 30◦ downward and 20◦ upward, combined with azimuths that start at 30◦ and increase by 60◦ for each pose.

Linear noise scheduler

The original noise schedule for Stable Diffusion (scaled-linear schedule) places emphasis on local details but has very few steps with lower Signal-to-Noise Ratio (SNR). These low SNR steps occur in the early denoising stage, which is crucial for determining the global low-frequency structure of the content.

A reduced number of steps in this stage, either during training or inference, can lead to greater structural variation. This setup, while suitable for single-image generation, limits the model’s ability to ensure the global consistency between multiple views.

Also, Zero-1-to-3 uses low-resolution images during training which acts as a modification of the noise schedule. High-resolution images (higher degree of redundancy in nearby pixels) appear less noisy compared to low-resolution images when subjected to the same absolute level of independent noise.

This also leads to instability issues in Zero-1-to-3 approach.

Zero123++ uses linear noise scheduler to avoid all these issues.

This shift does introduce another challenge of adapting the pretrained model to the new schedule. Fortunately, Stable Diffusion 2 v-prediction model is quite robust when it comes to swapping the schedule.

Scaled Reference Attention for local conditioning

Zero-1-to-3 concatenates the input image (single view) in the feature dimension with the noisy inputs to be denoised for local image conditioning. This imposes an incorrect pixel-wise spatial correspondence between the input and the target image.

Whereas, Zero123++ uses a scaled version of Reference Attention to provide proper local conditioning input, to eliminate this issue.

FlexDiffuse for global conditioning

In original Stable Diffusion, global conditioning comes solely from text embeddings. Stable Diffusion employs CLIP as the text encoder, and performs cross-attention between model latents and per-token CLIP text embeddings.

Zero123++ uses a trainable variant of the linear guidance mechanism introduced in FlexDiffuse to incorporate global image conditioning into the model while minimizing the extent of fine-tuning.

How this models was trained

Zero123++ was trained (using above techniques) on Objaverse data rendered with random HDRI environment lighting.

How this model was evaluated

Outputs of this model were compared with those from Zero-1-to-3 XL and SyncDreamer – qualitatively (side by side, after background removal) and quantitatively (LPIPS score on validation split).


Qualitative Analysis of Zero123++
Qualitative Analysis of Zero123++

Quantitative Analysis of Zero123++
Quantitative Analysis of Zero123++

Text to Image Multi-View
Text to Image Multi-View



bottom of page