REFERENCES

1. Koehn, P.; Och, F. J.; Marcu, D. Statistical phrase-based translation. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. 2003. pp. 127-33. https://aclanthology.org/N03-1017. (accessed 26 Sep 2025).

2. Kalchbrenner, N.; Blunsom, P. Recurrent continuous translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, USA. Association for Computational Linguistics; 2013. pp. 1700-9. https://aclanthology.org/D13-1176/. (accessed 26 Sep 2025).

3. Sutskever, I.; Vinyals, O.; Le, Q. V. Sequence to sequence learning with neural networks. arXiv 2014, arXiv:1409.3215. Available online: https://doi.org/10.48550/arXiv.1409.3215. (accessed 26 Sep 2025).

4. Vaswani, A.; Shazeer, N.; Parmar, N.; et al. Atten-tion is all you need. arXiv 2017, arXiv:1706.03762. Available online: https://doi.org/10.48550/arXiv.1706.03762. (accessed 26 Sep 2025).

5. Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2016, arXiv:1409.0473. Available online: https://doi.org/10.48550/arXiv.1409.0473. (accessed 26 Sep 2025).

6. Pang, Y.; Lin, J.; Qin, T.; Chen, Z. Image-to-image translation: methods and applications. IEEE. Trans. Multimedia. 2021, 24, 3859-81.

7. Goodfellow, I. J.; Pouget-Abadie, J.; Mirza, M.; et al. Generative adversarial networks. arXiv 2014, arXiv:1406.2661. Available online: https://doi.org/10.48550/arXiv.1406.2661. (accessed 26 Sep 2025).

8. Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784. Available online: https://doi.org/10.48550/arXiv.1411.1784. (accessed 26 Sep 2025).

9. Zhu, J. Y.; Park, T.; Isola, P.; Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv 2020, arXiv:1703.10593. Available online: https://doi.org/10.48550/arXiv.1703.10593. (accessed 26 Sep 2025).

10. Kingma, D. P.; Welling, M. Auto-encoding variational bayes. arXiv 2022, arXiv:1312.6114. Available online: https://doi.org/10.48550/arXiv.1312.6114. (accessed 26 Sep 2025).

11. Isola, P.; Zhu, J. Y.; Zhou, T.; Efros, A. A. Image-to-image translation with conditional adversarial networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honululu, USA. July 21-26, 2017. IEEE; 2017. pp. 5967-76.

12. Wang, T. C.; Liu, M. Y.; Zhu, J. Y.; Tao, A.; Kautz, J.; Catanzaro, B. High-resolution image synthesis and semantic manipulation with conditional GANs. arXiv 2018, arXiv:1711.11585. Available online: https://doi.org/10.48550/arXiv.1711.11585. (accessed 26 Sep 2025).

13. Gatys, L. A.; Ecker, A. S.; Bethge, M. A neural algorithm of artistic style. arXiv 2015, arXiv:1508.06576. Available online: https://doi.org/10.48550/arXiv.1508.06576. (accessed 26 Sep 2025).

14. Gatys, L. A.; Ecker, A. S.; Bethge, M. Image style transfer using convolutional neural networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA. June 27-30, 2016. IEEE; 2016. pp. 2414-23.

15. Nie, D.; Trullo, R.; Lian, J.; et al. Medical image synthesis with deep convolutional adversarial networks. IEEE. Trans. Biomed. Eng. 2018, 65, 2720-30.

16. Shi, Z.; Mettes, P.; Zheng, G.; Snoek, C. Frequency-supervised MR-to-CT image synthesis. arXiv 2021, arXiv:2107.08962. Available online: https://doi.org/10.48550/arXiv.2107.08962. (accessed 26 Sep 2025).

17. Shao, X.; Zhang, W. SPatchGAN: a statistical feature based discriminator for unsupervised image-to-image translation. arXiv 2021, arXiv:2103.16219. Available online: https://doi.org/10.48550/arXiv.2103.16219. (accessed 26 Sep 2025).

18. Wang, L.; Chae, Y.; Yoon, K. J. Dual transfer learning for event-based end-task prediction via pluggable event to image translation. arXiv 2021, arXiv:2109.01801. Available online: https://doi.org/10.48550/arXiv.2109.01801. (accessed 26 Sep 2025).

19. Yu, J.; Du, S.; Xie, G.; et al. SAR2EO: a high-resolution image translation framework with denoising enhancement. arXiv 2023, arXiv:2304.04760. Available online: https://doi.org/10.48550/arXiv.2304.04760. (accessed 26 Sep 2025).

20. Anderson, P.; Fernando, B.; Johnson, M.; Gould, S. SPICE: semantic propositional image caption evaluation. arXiv 2016, arXiv:1607.08822. Available online: https://doi.org/10.48550/arXiv.1607.08822. (accessed 26 Sep 2025).

21. Li, S.; Tao, Z.; Li, K.; Fu, Y. Visual to text: survey of image and video captioning. IEEE. Trans. Emerg. Top. Comput. Intell. 2019, 3, 297-12.

22. Żelaszczyk, M.; Mańdziuk, J. Cross-modal text and visual generation: a systematic review. Part 1: image to text. Inf. Fusion. 2023, 93, 302-29.

23. He, X.; Deng, L. Deep learning for image-to-text generation: a technical overview. IEEE. Signal. Process. Mag. 2017, 34, 109-16.

24. Indurthi, S.; Zaidi, M. A.; Lakumarapu, N. K.; et al. Task aware multi-task learning for speech to text tasks. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, Canada. June 06-11, 2021. IEEE; 2021. pp. 7723-7.

25. Gállego, G. I.; Tsiamas, I.; Escolano, C.; Fonollosa, J. A. R.; Costa-jussà, M. R. End-to-end speech translation with pre-trained models and adapters: UPC at IWSLT 2021. arXiv 2021, arXiv:2105.04512. Available online: https://doi.org/10.48550/arXiv.2105.04512. (accessed 26 Sep 2025).

26. Wang, X.; Qiao, T.; Zhu, J.; Hanjalic, A.; Scharenborg, O. Generating images from spoken descriptions. IEEE/ACM. Trans. Audio. Speech. Lang. Process. 2021, 29, 850-65.

27. Ning, H.; Zheng, X.; Yuan, Y.; Lu, X. Audio description from image by modal translation network. Neurocomputing 2021, 423, 124-34.

28. Parmar, G.; Park, T.; Narasimhan, S.; Zhu, J. Y. One-step image translation with text-to-image models. arXiv 2024, arXiv:2403.12036. Available online: https://doi.org/10.48550/arXiv.2403.12036. (accessed 26 Sep 2025).

29. Ngiam, J.; Khosla, A.; Kim, M.; Nam, J.; Lee, H.; Ng, A. Y. Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning, Bellevue, USA. 2011. pp. 689-96. https://people.csail.mit.edu/khosla/papers/icml2011_ngiam.pdf. (accessed 26 Sep 2025).

30. Bengio, Y.; Courville, A.; Vincent, P. Representation learning: a review and new perspectives. IEEE. Trans. Pattern. Anal. Mach. Intell. 2013, 35, 1798-828.

31. Yu, H.; Gui, L.; Madaio, M.; Ogan, A.; Cassell, J.; Morency, L. P. Temporally selective attention model for social and affective state recognition in multimedia content. In Proceedings of the 25th ACM International Conference on Multimedia. Association for Computing Machinery; 2017. pp. 1743-51.

32. Siriwardhana, S.; Reis, A.; Weerasekera, R.; Nanayakkara, S. Jointly fine-tuning “BERT-like” self supervised models to improve multimodal speech emotion recognition. arXiv 2020, arXiv:2008.06682. Available online: https://doi.org/10.48550/arXiv.2008.06682. (accessed 26 Sep 2025).

33. Yu, W.; Xu, H.; Yuan, Z.; Wu, J. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. arXiv 2021, arXiv:2102.04830. Available online: https://doi.org/10.48550/arXiv.2102.04830. (accessed 26 Sep 2025).

34. Lai, S.; Hu, X.; Xu, H.; Ren, Z.; Liu, Z. Multimodal sentiment analysis: a survey. arXiv 2023, arXiv:2305.07611. Available online: https://doi.org/10.48550/arXiv.2305.07611. (accessed 26 Sep 2025).

35. Le, H.; Sahoo, D.; Chen, N. F.; Hoi, S. C. H. Multimodal transformer networks for end-to-end video-grounded dialogue systems. arXiv 2019, arXiv:1907.01166. Available online: https://doi.org/10.48550/arXiv.1907.01166. (accessed 26 Sep 2025).

36. Tsai, Y. H. H.; Liang, P. P.; Zadeh, A.; Morency, L. P.; Salakhutdinov, R. Learning factorized multimodal representations. arXiv 2019, arXiv:1806.06176. Available online: https://doi.org/10.48550/arXiv.1806.06176. (accessed 26 Sep 2025).

37. Pham, H.; Liang, P. P.; Manzini, T.; Morency, L. P.; Póczos, B. Found in translation: learning robust joint representations by cyclic translations between modalities. In Proceedings of the AAAI Conference on Artificial Intelligence, Volume 33. 2019. pp. 6892-9.

38. Shang, C.; Palmer, A.; Sun, J.; et al. VIGAN: missing view imputation with generative adversarial networks. In 2017 IEEE International Conference on Big Data (Big Data), Boston, USA. December 11-14, 2017. IEEE; 2017. pp. 766-75.

39. Zhang, C.; Cui, Y.; Han, Z.; Zhou, J. T.; Fu, H.; Hu, Q. Deep partial multi-view learning. IEEE. Trans. Pattern. Anal. Mach. Intell. 2020, 44, 2402-15.

40. Zhou, T.; Canu, S.; Vera, P.; Ruan, S. Feature-enhanced generation and multi-modality fusion based deep neural network for brain tumor segmentation with missing MR modalities. Neurocomputing 2021, 466, 102-12.

41. Liu, Z.; Zhou, B.; Chu, D.; Sun, Y.; Meng, L. Modality translation-based multimodal sentiment analysis under uncertain missing modalities. Inf. Fusion. 2024, 101, 101973.

42. Lu, Z. Translation-based multimodal learning. Master’s thesis, Oakland University, 2024. https://www.secs.oakland.edu/~li4/research/student/MasterThesis_Lu2024.pdf. (accessed 26 Sep 2025).

43. Cordts, M.; Omran, M.; Ramos, S.; et al. The cityscapes dataset for semantic urban scene understanding. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA. June 27-30, 2016. IEEE; 2016. pp. 3213-23.

44. Tyleček, R.; Šára, R. Spatial pattern templates for recognition of objects with regular structure. In Weickert, J., Hein, M., Schiele, B.; editors. Pattern Recognition. GCPR 2013. Lecture Notes in Computer Science, vol 8142. Springer; 2013. pp. 364-74.

45. Leong, C.; Rovito, T.; Mendoza-Schrock, O.; et al. Unified coincident optical and radar for recognition (UNICORN) 2008 dataset. https://github.com/AFRL-RY/data-unicorn-2008. (accessed 26 Sep 2025).

46. Tan, W. R.; Chan, C. S.; Aguirre, H. E.; Tanaka, K. Improved ArtGAN for conditional synthesis of natural image and artwork. IEEE. Trans. Image. Process. 2019, 28, 394-409.

47. Plummer, B. A.; Wang, L.; Cervantes, C. M.; Caicedo, J. C.; Hockenmaier, J.; Lazebnik, S. Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile. December 07-13, 2015. IEEE; 2015. pp. 2641-9.

48. Lin, T. Y.; Maire, M.; Belongie, S.; et al. Microsoft COCO: common objects in context. In Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T; editors. Computer Vision - ECCV 2014. ECCV 2014. Lecture Notes in Computer Science, vol 8693. Springer; 2014. pp. 740-55.

49. Zhou, B.; Lapedriza, A.; Khosla, A.; Oliva, A.; Torralba, A. Places: a 10 million image database for scene recognition. IEEE. Trans. Pattern. Anal. Mach. Intell. 2017, 40, 1452-64.

50. Gemmeke, J. F.; Ellis, D. P. W.; Freedman, D.; et al. Audio set: an ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, USA. March 05-09, 2017. IEEE; 2017. pp. 776-80.

51. Zadeh, A.; Zellers, R.; Pincus, E.; Morency, L. P. MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv 2016, arXiv:1606.06259. Available online: https://doi.org/10.48550/arXiv.1606.06259. (accessed 26 Sep 2025).

52. Zadeh, A. B.; Liang, P. P.; Poria, S.; Cambria, E.; Morency, L. P. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia. Association for Computational Linguistics; 2018. pp. 2236-46.

53. Busso, C.; Bulut, M.; Lee, C. C.; et al. IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335-59.

54. Tsai, Y. H. H.; Bai, S.; Liang, P. P.; Kolter, J. Z.; Morency, L. P.; Salakhutdinov, R. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy. Association for Computational Linguistics; 2019. pp. 6558-69.

55. Sun, Z.; Sarma, P.; Sethares, W.; Liang, Y. Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. arXiv 2019, arXiv:1911.05544. Available online: https://doi.org/10.48550/arXiv.1911.05544. (accessed on 26 Sep 2025).

56. Yu, J.; Chen, K.; Xia, R. Hierarchical interactive multimodal transformer for aspect-based multimodal sentiment analysis. IEEE. Trans. Affective. Comput. 2023, 14, 1966-78.

57. Jiang, K.; Wang, Q.; An, Z.; Wang, Z.; Zhang, C.; Lin, C. W. Mutual retinex: combining transformer and CNN for image enhancement. IEEE. Trans. Emerg. Top. Comput. Intell. 2024, 8, 2240-52.

58. Xiao, Y.; Yuan, Q.; Jiang, K.; He, J.; Lin, C. W.; Zhang, L. TTST: a top-k token selective transformer for remote sensing image super-resolution. IEEE. Trans. Image. Process. 2024, 33, 738-52.

59. Jiang, K.; Wang, Z.; Chen, C.; Wang, Z.; Cui, L.; Lin, C. W. Magic ELF: image deraining meets association learning and transformer. In Proceedings of the 30th ACM International Conference on Multimedia. Association for Computing Machinery; 2022. pp, 827-36.

60. Wang, Y.; He, J.; Wang, D.; Wang, Q.; Wan, B.; Luo, X. Multimodal transformer with adaptive modality weighting for multimodal sentiment analysis. Neurocomputing 2024, 572, 127181.

61. Wang, D.; Liu, S.; Wang, Q.; Tian, Y.; He, L.; Gao, X. Cross-modal enhancement network for multimodal sentiment analysis. IEEE. Trans. Multimedia. 2023, 25, 4909-21.

62. Khan, S.; Naseer, M.; Hayat, M.; Zamir, S. W.; Khan, F. S.; Shah, M. Transformers in vision: a survey. ACM. Comput. Surv. 2022, 54, 1-41.

63. Song, Y.; Sohl-Dickstein, J.; Kingma, D. P.; Kumar, A.; Ermon, S.; Poole, B. Score-based generative modeling through stochastic differential equations. arXiv 2020, arXiv:2011.13456. Available online: https://doi.org/10.48550/arXiv.2011.13456. (accessed on 26 Sep 2025).

64. Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. arXiv 2020, arXiv:2006.11239. Available online: https://doi.org/10.48550/arXiv.2006.11239. (accessed on 26 Sep 2025).

65. Song, Y.; Ermon, S. Generative modeling by estimating gradients of the data distribution. arXiv 2020, arXiv:1907.05600. Available online: https://doi.org/10.48550/arXiv.1907.05600. (accessed on 26 Sep 2025).

66. Xiao, Y.; Yuan, Q.; Jiang, K.; He, J.; Jin, X.; Zhang, L. EDiffSR: an efficient diffusion probabilistic model for remote sensing image super-resolution. IEEE. Trans. Geosci. Remote. Sens. 2024, 62, 1-14.

67. Shi, Y.; De, Bortoli. V.; Campbell, A.; Doucet, A. Diffusion Schrödinger Bridge matching. arXiv 2023, arXiv:2303.16852. Available online: https://doi.org/10.48550/arXiv.2303.16852. (accessed on 26 Sep 2025).

68. Liu, G. H.; Vahdat, A.; Huang, D. A.; Theodorou, E. A.; Nie, W.; Anandkumar, A. I²SB: Image-to-image Schrödinger Bridge. arXiv 2023, arXiv:2302.05872. Available online: https://doi.org/10.48550/arXiv.2302.05872. (accessed on 26 Sep 2025).

69. Tang, Z.; Hang, T.; Gu, S.; Chen, D.; Guo, B. Simplified diffusion Schrödinger Bridge. arXiv 2024, arXiv:2403.14623. Available online: https://doi.org/10.48550/arXiv.2403.14623. (accessed on 26 Sep 2025).

70. Chen, Z.; He, G.; Zheng, K.; Tan, X.; Zhu, J. Schrodinger bridges beat diffusion models on text-to-speech synthesis. arXiv 2023, arXiv:2312.03491. Available online: https://doi.org/10.48550/arXiv.2312.03491. (accessed on 26 Sep 2025).

71. Özbey, M.; Dalmaz, O.; Dar, S. U. H.; Bedel, H. A.; Özturk, Ş.; Güngör, A. Unsupervised medical image translation with adversarial diffusion models. IEEE. Trans. Med. Imaging. 2023, 42, 3524-39.

72. Dhariwal, P.; Nichol, A. Diffusion models beat GANs on image synthesis. arXiv 2021, arXiv:2105.05233. Available online: https://doi.org/10.48550/arXiv.2105.05233. (accessed on 26 Sep 2025).

73. Hazarika, D.; Zimmermann, R.; Poria, S. MISA: modality-invariant and -specific representations for multimodal sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia. Association for Computing Machinery; 2020. pp. 1122-31.

74. Zhao, T.; Kong, M.; Liang, T.; Zhu, Q.; Kuang, K.; Wu, F. CLAP: contrastive language-audio pre-training model for multi-modal sentiment analysis. In Proceedings of the 2023 ACM International Conference on Multimedia Retrieval. Association for Computing Machinery; 2023. pp. 622-6.

75. He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, USA. June 18-24, 2022. IEEE; 2022. pp. 15979-88.

76. Bischke, B.; Helber, P.; König, F.; Borth, D.; Dengel, A. Overcoming missing and incomplete modal-ities with generative adversarial networks for building footprint segmentation. arXiv 2018, arXiv:1808.03195. Available online: https://doi.org/10.48550/arXiv.1808.03195. (accessed on 26 Sep 2025).

77. Hamghalam, M.; Frangi, A. F.; Lei, B.; Simpson, A. L. Modality completion via Gaussian process prior variational autoencoders for multi-modal glioma segmentation. In Medical Image Computing and Computer Assisted Intervention - MICCAI 2021. Springer, Cham; 2021. pp. 442-52.

78. Zhou, T.; Canu, S.; Vera, P.; Ruan, S. Latent correlation representation learning for brain tumor segmentation with missing MRI modalities. IEEE. Trans. Image. Process. 2021, 30, 4263-74.

79. Sun, J.; Zhang, X.; Han, S.; Ruan, Y. P.; Li, T. RedCore: relative advantage aware cross-modal representation learning for missing modalities with imbalanced missing rates. Proc. AAAI. Conf. Artif. Intell. 2024, 38, 15173-82.

80. Park, K. R.; Lee, H. J.; Kim, J. U. Learning trimodal relation for audio-visual question answering with missing modality. arXiv 2024, arXiv:2407.16171. Available online: https://doi.org/10.48550/arXiv.2407.16171. (accessed on 26 Sep 2025).

81. Kim, D.; Kim, T. Missing modality prediction for unpaired multimodal learning via joint embedding of unimodal models. arXiv 2024, arXiv:2407.12616. Available online: https://doi.org/10.48550/arXiv.2407.12616. (accessed on 26 Sep 2025).

82. Guo, Z.; Jin, T.; Zhao, Z. Multimodal prompt learning with missing modalities for sentiment analysis and emotion recognition. arXiv 2024, arXiv:2407.05374. Available online: https://doi.org/10.48550/arXiv.2407.05374. (accessed on 26 Sep 2025).

83. Lin, X.; Wang, S.; Cai, R.; et al. Suppress and rebalance: towards generalized multi-modal face anti-spoofing. arXiv 2024, arXiv:2402.19298. Available online: https://doi.org/10.48550/arXiv.2402.19298. (accessed on 26 Sep 2025).

84. Lu, Z.; Ewing, R.; Blasch, E.; Li, J. Explainable diffusion model via Schrödinger Bridge in multimodal image translation. In Dynamic data driven applications systems. Springer, Cham; 2026. pp. 391-402.

85. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. Available online: https://doi.org/10.48550/arXiv.1409.1556. (accessed on 26 Sep 2025).

86. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA. June 27-30, 2016. IEEE; 2016. pp. 2818-26.

87. Baldi, P. Autoencoders, unsupervised learning, and deep architectures. In Proceedings of ICML Workshop on Unsupervised and Transfer Learning, Bellevue, USA, July 02, 2012. PMLR; 2012. pp. 37-49. https://proceedings.mlr.press/v27/baldi12a.html. (accessed on 26 Sep 2025).

88. Hafner, S.; Ban, Y. Multi-modal deep learning for multi-temporal urban mapping with a partly missing optical modality. In IGARSS 2023 - 2023 IEEE International Geoscience and Remote Sensing Symposium, Pasadena, USA. July 16-21, 2023. IEEE; 2023. pp. 6843-6.