DreamOmni2: Multimodal Instruction-based Image Editing

This guide will help you install and set up DreamOmni2 on your local machine. The process involves cloning the repository, installing dependencies, downloading model weights, and running inference scripts.

System Requirements

Before installing DreamOmni2, ensure your system meets these requirements:

Operating System: Linux, macOS, or Windows with WSL2
Python: Version 3.8 or higher
CUDA: Version 11.0 or higher (for GPU support)
GPU: NVIDIA GPU with at least 16GB VRAM recommended
Disk Space: At least 20GB free space for models and dependencies
RAM: 16GB minimum, 32GB recommended

Step 1: Clone the Repository

First, clone the DreamOmni2 repository from GitHub to your local machine:

git clone https://github.com/dvlab-research/DreamOmni2
cd DreamOmni2

This creates a local copy of the project and navigates into the project directory. The repository contains all necessary code files, configuration, and documentation.

Step 2: Install Dependencies

Install all required Python packages using pip:

pip install -r requirements.txt

The requirements file contains all necessary dependencies including PyTorch, transformers, diffusers, and other supporting libraries. This step may take several minutes depending on your internet connection and whether you already have some packages installed.

Step 3: Download Model Weights

Download the pre-trained model weights from Hugging Face. Create a models directory and download the weights:

huggingface-cli download --resume-download --local-dir-use-symlinks False xiabs/DreamOmni2 --local-dir ./models

This command downloads all model files to the ./models directory. The download is several gigabytes in size, so ensure you have a stable internet connection and sufficient disk space. The --resume-download flag allows you to resume if the download is interrupted.

Step 4: Verify Installation

After installation, verify that everything is set up correctly by checking the directory structure:

ls -la

You should see directories including dreamomni2 (source code), models (downloaded weights), example_input (sample inputs), and various Python scripts for inference.

Running Multimodal Instruction-based Editing

To perform image editing tasks, use the inference_edit.py script. Here is a basic example:

python3 inference_edit.py \
  --input_img_path "example_input/edit_tests/src.jpg" "example_input/edit_tests/ref.jpg" \
  --input_instruction "Make the woman from the second image stand on the road in the first image." \
  --output_path "example_input/edit_tests/edit_res.png"

Important: For editing tasks, always place the image to be edited in the first position of the input_img_path parameter. Additional reference images should follow. The model uses this ordering to understand which image to modify and which images provide reference information.

Running Multimodal Instruction-based Generation

To generate new images from scratch, use the inference_gen.py script:

python3 inference_gen.py \
  --input_img_path "example_input/gen_tests/img1.jpg" "example_input/gen_tests/img2.jpg" \
  --input_instruction "In the scene, the character from the first image stands on the left, and the character from the second image stands on the right. They are shaking hands against the backdrop of a spaceship interior." \
  --output_path "example_input/gen_tests/gen_res.png" \
  --height 1024 \
  --width 1024

The generation script creates entirely new images based on your instructions and reference images. You can specify custom output dimensions using the --height and --width parameters. Higher resolutions require more GPU memory and processing time.

Running Web Demo

DreamOmni2 includes interactive web demos for both editing and generation. To run the editing demo:

CUDA_VISIBLE_DEVICES=0 python web_edit.py \
  --vlm_path PATH_TO_VLM \
  --edit_lora_path PATH_TO_EDIT_LORA \
  --server_name "0.0.0.0" \
  --server_port 7860

To run the generation demo:

CUDA_VISIBLE_DEVICES=1 python web_generate.py \
  --vlm_path PATH_TO_VLM \
  --gen_lora_path PATH_TO_GENERATION_LORA \
  --server_name "0.0.0.0" \
  --server_port 7861

Replace PATH_TO_VLM, PATH_TO_EDIT_LORA, and PATH_TO_GENERATION_LORA with the actual paths to your model files. The demos will be accessible through a web browser at the specified server address and port.

Command Line Parameters

Here are the key parameters you can customize:

--input_img_path: One or more paths to input images, separated by spaces
--input_instruction: Text description of the desired output or editing operation
--output_path: Where to save the generated or edited image
--height: Output image height in pixels (generation only)
--width: Output image width in pixels (generation only)
--server_name: Server address for web demos (default: 0.0.0.0)
--server_port: Port number for web demos (default: 7860 or 7861)

Troubleshooting

Out of Memory Errors

If you encounter GPU memory errors, try reducing the output resolution or processing fewer images at once. You can also enable gradient checkpointing or mixed precision training if supported.

CUDA Not Available

Ensure you have installed the GPU version of PyTorch and that your CUDA drivers are up to date. Check compatibility between your CUDA version, GPU driver, and PyTorch version.

Download Failures

If model downloads fail, check your internet connection and ensure you have sufficient disk space. The --resume-download flag allows you to continue interrupted downloads.

Import Errors

If you see module import errors, verify that all dependencies were installed correctly. Try reinstalling the requirements file or checking for version conflicts.

Additional Resources

GitHub Repository: github.com/dvlab-research/DreamOmni2
Research Paper: arxiv.org/abs/2510.06679
Model Weights: huggingface.co/xiabs/DreamOmni2
Issues and Support: GitHub Issues page

Tip: Start with the example inputs provided in the repository to familiarize yourself with the model's capabilities before using your own images. This helps you understand the expected input formats and instruction styles.

Installation Guide