Publication Detail

Open-vocabulary scene understanding based on 3D Gaussian Splatting (3DGS) has shown promising potential for applications such as embodied agents and object localization. By integrating open-vocabulary embeddings into spatial 3D gaussians, these models enable a more comprehensive understanding of scenes. However, existing methods often suffer from misalignment due to the gap between RGB and language modalities, leading to incorrect interpretations of similar-looking objects. To address this issue, we propose a cross-modal integration approach that aligns multiple representations through spatial gaussian positioning. We introduce Push-Pull alignment in Gaussian Splatting(PPGS), a novel bimodal framework that bridges RGB and language modalities through cohesive representation fields. Leveraging the illumination-invariant properties of language embeddings, we design the bridge module, which uses the geometrically-grounded positions for the gaussians as a direct bridge between the two modalities. This module significantly enhances cross-modal alignment, improves high-fidelity rendering, and ensures accurate language feature embeddings. Furthermore, our framework dynamically adjusts gradients based on the distinct optimization requirements of RGB and language during joint learning, ensuring stable and efficient convergence. Comprehensive experiments demonstrate that PPGS achieves superior language query accuracy and enhanced visual quality compared to existing language-embedded representations, with Intersection over Union (mIoU) increasing by 6% and Peak Signal-to-Noise Ratio (PSNR) showing gains over mainstream methods, all within only 50% of the training time.