A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation

Zijie Wu^1,2 Chaohui Yu² Fan Wang² Xiang Bai¹

¹ Huazhong University of Science and Technology ² DAMO Academy, Alibaba Group

ICCV 2025

We propose AnimateAnyMesh, the first feed-forward framework for text-driven universal mesh animation, enabling generation of high-quality animations for meshes of arbitrary topology within a few seconds.

Video Demo

Video demo of AnimateAnyMesh. Best viewed in 4K!!!

Abstract

Recent advances in 4D content generation have attracted increasing attention, yet creating high-quality animated 3D models remains challenging due to the complexity of modeling spatio-temporal distributions and the scarcity of 4D training data. In this paper, we present AnimateAnyMesh, the first feed-forward framework that enables efficient text-driven animation of arbitrary 3D meshes. Our approach leverages a novel DyMeshVAE architecture that effectively compresses and reconstructs dynamic mesh sequences by disentangling spatial and temporal features while preserving local topological structures. To enable high-quality text-conditional generation, we employ a Rectified Flow-based training strategy in the compressed latent space. Additionally, we contribute the DyMesh Dataset, containing over 4M diverse dynamic mesh sequences with text annotations. Experimental results demonstrate that our method generates semantically accurate and temporally coherent mesh animations in a few seconds, significantly outperforming existing approaches in both quality and efficiency. Our work marks a substantial step forward in making 4D content creation more accessible and practical. All the data, code, and models will be open-released.

Framework

AnimateAnyMesh consists of two key components: DyMeshVAE and Shape-Guided Text-to-Trajectory Model. The former effectively handles meshes of arbitrary topology by decomposing them into initial frames and relative trajectories, which are then compressed into a structured latent space, while the latter learns to generate trajectory features conditioned on both the initial mesh latent and text embeddings with a rectified flow-based training strategy.

More Animation Results

AnimateAnyMesh achieves state-of-the-art performance in text-driven mesh animation, combining high fidelity, versatility, and computational efficiency (a few seconds per animation).

Animation Under Different Seeds

AnimateAnyMesh generates diverse high-quality animations under the same conditions (prompt for the above animations: The girl is dancing) with different seeds.

Examples of Dynamic Mesh Sequences in DyMesh Dataset

We curate over 4M dynamic mesh sequences (16/32 frames) as the proposed DyMesh Dataset. We showcase some examples above (Left: 16 frames. Right: 32 frames).

Animating Generated 3D Objects

AnimateAnyMesh also support generated 3D objects as input. The above 3D assets are generated by Tripo2.5.