Abstract | ||
---|---|---|
Many adaptations of transformers have emerged to address the single-modal vision tasks, where self-attention modules are stacked to handle input sources like images. Intuitively, feeding multiple modalities of data to vision transformers could improve the performance, yet the inner-modal attentive weights may also be diluted, which could thus undermine the final performance. In this paper, we propose a multimodal token fusion method (TokenFusion), tailored for transformer-based vision tasks. To effectively fuse multiple modalities, TokenFusion dynamically detects uninformative tokens and substitutes these tokens with projected and aggregated inter-modal features. Residual positional alignment is also adopted to enable explicit utilization of the inter-modal alignments after fusion. The design of TokenFusion allows the transformer to learn correlations among multimodal features, while the single-modal transformer architecture remains largely intact. Extensive experiments are conducted on a variety of homogeneous and heterogeneous modalities and demonstrate that TokenFusion surpasses state-of-the-art methods in three typical vision tasks: multimodal image-to-image translation, RGB-depth semantic segmentation, and 3D object detection with point cloud and images. |
Year | DOI | Venue |
---|---|---|
2022 | 10.1109/CVPR52688.2022.01187 | IEEE Conference on Computer Vision and Pattern Recognition |
DocType | Citations | PageRank |
Conference | 0 | 0.34 |
References | Authors | |
0 | 6 |
Name | Order | Citations | PageRank |
---|---|---|---|
Yikai Wang | 1 | 0 | 0.68 |
Xinghao Chen | 2 | 0 | 0.34 |
Le-le Cao | 3 | 27 | 5.54 |
Wen-bing Huang | 4 | 167 | 18.91 |
Fuchun Sun | 5 | 2377 | 225.80 |
Yunhe Wang | 6 | 113 | 22.76 |