REFERENCES

1. Malaviya AN, Sawhney S, Mehra NK, Kanga U. Seronegative arthritis in South Asia: an up-to-date review. Curr Rheumatol Rep. 2014;16:413.

2. Taurog JD, Chhabra A, Colbert RA. Ankylosing spondylitis and axial spondyloarthritis. N Engl J Med. 2016;374:2563-74.

3. Bittar M, Deodhar A. Axial spondyloarthritis: a review. JAMA. 2025;333:408-20.

4. Singhal K, Tu T, Gottweis J, et al. Toward expert-level medical question answering with large language models. Nat Med. 2025;31:943-50.

5. Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29:1930-40.

6. Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Nature. 2023;620:172-80.

7. Kraljevic Z, Bean D, Shek A, et al. Foresight - a generative pretrained transformer for modelling of patient timelines using electronic health records: a retrospective modelling study. Lancet Digit Health. 2024;6:e281-90.

8. Shao Y, Cheng Y, Nelson SJ, et al. Hybrid value-aware transformer architecture for joint learning from longitudinal and non-longitudinal clinical data. J Pers Med. 2023;13:1070.

9. Brown T, Mann B, Ryder N, et al. Language models are few-shot learners. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, Editors. Advances in Neural Information Processing Systems 33 (NeurIPS 2020); 2020 Dec 6-12; Virtual form. Red Hook: Curran Associates, Inc.; 2020. pp. 1877-901. Available from https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html [accessed 9 February 2026].

10. Touvron H, Martin L, Stone K, et al. Llama 2: open foundation and fine-tuned chat models. arXiv 2023;arXiv:2307.09288. Available from https://arxiv.org/abs/2307.09288 [accessed 9 February 2026].

11. OpenAI, Josh Achiam, Steven Adler, et al. GPT-4 technical report. arXiv 2023;arXiv:2303.08774. Available from https://arxiv.org/abs/2303.08774 [accessed 9 February 2026].

12. Xie Q, Chen Q, Chen A, et al. Me-LLaMA: foundation large language models for medical applications. Res Sq 2024;rs.3.rs-4240043. Available from https://doi.org/10.21203/rs.3.rs-4240043/v1 [accessed 9 February 2026].

13. Chen Z, Hernández Cano A, Romanou A. MEDITRON-70b: scaling medical pretraining for large language models. arXiv 2023;arXiv:2311.16079. Available from https://arxiv.org/abs/2311.16079 [accessed 9 February 2026].

14. Chen J, et al. HuatuoGPT-II, one-stage training for medical adaption of LLMs. arXiv 2023;arXiv:2311.09774. Available from https://arxiv.org/abs/2311.09774 [accessed 9 February 2026].

15. Uz C, Umay E. “Dr ChatGPT”: is it a reliable and useful source for common rheumatic diseases? Int J Rheum Dis 2023;26:1343-9.

16. Papineni K, Roukos S, Ward T, Zhu WJ. BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL); 2002 Jul 6-12; Philadelphia, Pennsylvania, USA. Cambridge, Massachusetts, USA: Association for Computational Linguistics; 2002. pp. 311-18.

17. Wang X, Chen G, Dingjie S, et al. CMB: a comprehensive medical benchmark in Chinese. In: Duh K, Gomez H, Bethard S, Editors. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; 2024 Jun 16-21; Mexico City, Mexico. Cambridge, Massachusetts, USA: Association for Computational Linguistics; 2024. pp. 6184-205.

18. Zhang N, Chen M, Bi Z, et al. CBLUE: a Chinese biomedical language understanding evaluation benchmark. In: Muresan S, Nakov P, Villavicencio A, Editors. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics; 2022 May 22-27; Dublin, Ireland. Cambridge, Massachusetts, USA: Association for Computational Linguistics; 2022. pp. 7888-915.

19. He Z, Wang Y, Yan A. MedEval: a multi-level, multi-task, and multi-domain medical benchmark for language model evaluation. In: Bouamor H, Pino J, Bali K, Editors. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing; 2023 Dec 6-10; Singapore. Cambridge, Massachusetts, USA: Association for Computational Linguistics; 2023. pp. 8725-44.

20. Xu J, Lu L, Peng X, et al. Data set and benchmark (MedGPTEval) to evaluate responses from large language models in medicine: evaluation development and validation. JMIR Med Inform. 2024;12:e57674.

21. Chen W, Li Z, Fang H, et al. A benchmark for automatic medical consultation system: frameworks, tasks and datasets. Bioinformatics. 2023;39:btac817.

22. Li M, Cai W, Liu R, et al. FFA-IR: towards an explainable and reliable medical report generation benchmark. In: Vanschoren J, Yeung S, Editors. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks; 2021. Available from https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/35f4a8d465e6e1edc05f3d8ab658c551-Abstract.html [accessed 9 February 2026].

23. Christophe C, Kanithi P, Munjal P, et al. Med42-evaluating fine-tuning strategies for medical LLMs: full-parameter vs. parameter-efficient approaches. arXiv 2024;arXiv:2404.14779. Available from https://arxiv.org/abs/2404.14779 [accessed 9 February 2026].

24. Nori H, King N, McKinney SM, et al. Capabilities of GPT-4 on medical challenge problems. arXiv 2023;arXiv:2303.13375. Available from https://arxiv.org/abs/2303.13375 [accessed 9 February 2026].

25. Liu F, Li Z, Zhou H, et al. Large language models in the clinic: a comprehensive benchmark. arXiv 2024;arXiv:2405.00716. Available from https://arxiv.org/abs/2405.00716 [accessed 9 February 2026].

26. Huang F, Zhu J, Wang YH, et al. Recommendations for diagnosis and treatment of ankylosing spondylitis. Zhonghua Nei Ke Za Zhi. 2022;61:893-900.

27. van der Heijde D, Ramiro S, Landewé R, et al. 2016 update of the ASAS-EULAR management recommendations for axial spondyloarthritis. Ann Rheum Dis. 2017;76:978-91.

28. Sieper J, Rudwaleit M, Baraliakos X, et al. The Assessment of SpondyloArthritis international Society (ASAS) handbook: a guide to assess spondyloarthritis. Ann Rheum Dis. 2009;68 Suppl 2:ii1-44.

29. Madrid-García A, Rosales-Rosado Z, Freites-Nuñez D, et al. Harnessing ChatGPT and GPT-4 for evaluating the rheumatology questions of the Spanish access exam to specialized medical training. Sci Rep. 2023;13:22129.

30. Wang A, Wu Y, Ji X, et al. Assessing and optimizing large language models on spondyloarthritis multi-choice question answering: protocol for enhancement and assessment. JMIR Res Protoc. 2024;13:e57001.

31. Mitra A, Del Corro L, Mahajan S. Orca 2: teaching small language models how to reason. arXiv 2023;arXiv:2311.11045. Available from https://arxiv.org/abs/2311.11045 [accessed 9 February 2026].

32. Li J, Wang X, Wu X, et al. Huatuo-26M, a large-scale Chinese medical QA dataset. arXiv 2023;arXiv:2305.01526. Available from https://arxiv.org/abs/2305.01526 [accessed 9 February 2026].

33. Liu J, Zhou P, Hua Y, et al. Benchmarking large language models on CMExam - a comprehensive Chinese medical exam dataset. In: Oh A, Naumann T, Globerson A, Saenko K, Hardt M, Levine S, Editors. Advances in Neural Information Processing Systems 36; 2023 Dec 10-16; New Orleans, Louisiana, USA. Cambridge, Massachusetts, USA: Association for Computational Linguistics; 2023. pp. 52430-52. Available from https://proceedings.neurips.cc/paper_files/paper/2023/hash/a48ad12d588c597f4725a8b84af647b5-Abstract-Datasets_and_Benchmarks.html [accessed 9 February 2026].

34. Li J, Zhong S, Chen K. MLEC-QA: a Chinese multi-choice biomedical question answering dataset. In: Moens MF, Huang X, Specia L, Yih SWT, Editors. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing; 2021 Nov 7-11; Online and Punta Cana, Dominican Republic. Cambridge, Massachusetts, USA: Association for Computational Linguistics; 2021. pp. 8862-74.

35. Leaderboard - C-Eval. Updated on 2025 Jul 26. Available from: https://cevalbenchmark.com/static/leaderboard.html [accessed 9 February 2026].

36. 01.AI, Young A, Chen B, et al. Yi: Open Foundation Models by 01.AI. arXiv 2024;arXiv:2403.04652. Available from https://arxiv.org/abs/2403.04652 [accessed 9 February 2026].

37. Hu EJ, Shen Y, Wallis P, et al. LoRA: low-rank adaptation of large language models. arXiv 2021;arXiv:2106.09685. Available from https://arxiv.org/abs/2106.09685 [accessed 9 February 2026].

Artificial Intelligence Surgery
ISSN 2771-0408 (Online)
Follow Us

Portico

All published articles will be preserved here permanently:

https://www.portico.org/publishers/oae/

Portico

All published articles will be preserved here permanently:

https://www.portico.org/publishers/oae/