While physics-grounded 3D motion synthesis has seen significant progress, current methods face critical limitations. They typically rely on pre-reconstructed 3D Gaussian Splatting (3DGS) representations, while physics integration depends on either inflexible, manually defined physical attributes or unstable, optimization-heavy guidance from video models. To overcome these challenges, we introduce PhysGM, a feed-forward framework that jointly predicts a 3D Gaussian representation and its physical properties from a single image, enabling immediate, physical simulation and high-fidelity 4D rendering. We first establish a base model by jointly optimizing for Gaussian reconstruction and probabilistic physics prediction. The model is then refined with physically plausible reference videos to enhance both rendering fidelity and physics prediction accuracy. We adopt the Direct Preference Optimization (DPO) to align its simulations with reference videos, circumventing Score Distillation Sampling (SDS) optimization which needs back-propagating gradients through the complex differentiable simulation and rasterization. To facilitate the training, we introduce a new dataset PhysAssets of over 24,000 3D assets, annotated with physical properties and corresponding guiding videos. Experimental results demonstrate that our method effectively generates high-fidelity 4D simulations from a single image in one minute. This represents a significant speedup over prior works while delivering realistic rendering results.
Architecture of PhysGM. The model conditions on one or more input views and their corresponding camera parameters, which are processed by a U-Net encoder to produce a shared latent representation z. This latent is then decoded by two parallel heads: (1) a Gaussian Head predicting the initial 3D Gaussian scene parameters ψ, and (2) a Physics Head that predicts a distribution over the object's physical properties θ. The sampled parameters (ψ, θ) initialize a Material Point Method (MPM) simulator to generate the final dynamic sequence. The entire architecture is trained in a two-stage paradigm: first, supervised pre-training on ground-truth data establishes a robust generative prior. Subsequently, a DPO-based fine-tuning stage uses the ranks against a ground-truth video and aligns the model with physically plausible results.
@misc{lv2025physgmlargephysicalgaussian,
title={PhysGM: Large Physical Gaussian Model for Feed-Forward 4D Synthesis},
author={Chunji Lv and Zequn Chen and Donglin Di and Weinan Zhang and Hao Li and Wei Chen and Changsheng Li},
year={2025},
eprint={2508.13911},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2508.13911},
}