top of page

PIXART-⍺ — Building cost-effective text to image(T2I) model

Practical Uses

  1. Startups can build their own Text to Image models with comparable accuracy to best commercial models like Stable Diffusion at 1% of cost and time.

  2. Companies serious about environment friendliness can reduce their carbon footprint in training Text to Image models by 99% and achieve their carbon-neutrality targets while getting near around state-of-art accuracies.

How this model works

The T2I model is initialized with a low-cost class-condition model to reduce learning cost.

Then this model is pre-trained on information rich text-image pair data and is fine-tuned on image data with superior aesthetic quality for high quality image generation.

How this model is different

Model Initialization

Current T2I models (Stable Diffusion, DALL-E 2, Imagen, RAPHEL) are primarily either transformer models or diffusion models trained on text-image pair data.

Here instead, a Diffusion Transformer model is used which allows injection of text conditions (through cross-attention modules) to streamline class-condition branch thereby improving computation efficiency

Plus, a re-parameterization technique is used which allows to load original class-condition model's (ImageNet's) parameters directly. Thus prior learned knowledge of normal image distribution is leveraged for reasonable initialization of T2I Transformer model, accelerating it's training.

Rich Dataset

Also datasets(like LAION) used for training current models suffer from limited information defect containing only partial description of objects. Also, a severe tail effect is observed due to large number of nouns being present with low frequencies.

These deficiencies reduce training efficiency and necessitate millions of iterations to learn text-image alignment.

Here instead an auto-labelling pipeline is used where a state-of-art Vision Language Model (LLaVA) is used to generate captions on SAM dataset, which is advantageous due to rich collection of objects.

Thus a high-information-density text-image pair dataset is generated which reduces learning cost of text-image alignment.

How this model was trained

  1. DiT-XL/2 was used as base network architecture.

  2. T5 large language model (4.3B Flan-T5-XXL) was used as text encoder for conditional feature extraction

  3. Length of extracted text tokens was adjusted to 120 (vs normal of 77), as the caption curated in PIXART-⍺ is much denser.

  4. A pre-trained and frozen VAE from LDM was used to capture the latent features of input images

  5. Images were resized and center-cropped to have the same size before feeding into the VAE. Also multi-aspect augmentation introduced in SDXL was used to enable arbitrary aspect image generation.

  6. The AdamW optimizer was utilized with a weight decay of 0.03 and a constant 2e-5 learning rate.

  7. The final model was trained on 64 V100 for approximately 22 days.

How this model was evaluated

This model was evaluated on 3 metrics -

  1. Frechet Inception Distance (FID) on MSCOCO dataset

  2. Compositionality on T2I-CompBench

  3. Human-preference rate on user study


Training cost was observed to be merely 1% compared to larger SOTA model like RAPHAEL.

Extensive experiments demonstrated that PIXART-α excels in image quality, artistry, and semantic control.

FID Learning Cost Evaluation
FID Learning Cost Evaluation

Compositionality evaluation on T2I-CompBench
Compositionality evaluation on T2I-CompBench

User Study on 300 fixed prompts
User Study on 300 fixed prompts



bottom of page