
van wickle
ABS 039: Comparative Analysis of Retrieval-Augmented Generation and Large Language Models in Clinical Guidelines for Degenerative Spinal Conditions
Audrey Y Su ¹ , Ashley Knebel BA ¹ , Andrew Y Xu BS ¹ , Marco Kaper BS ¹ , Phillip Schmitt BS ¹ , Joseph E. Nassar BS ¹ , Manjot Singh BS ¹ , Michael J. Farias BS ¹ , Jinho Kim BS ¹ , Bassel G. Diebo MD ² , Alan H Daniels MD ²
¹ Warren Alpert Medical School, Brown University, Providence, RI
² Department of Orthopaedic Surgery, Brown University, Providence, RI
Van Wickle (2025) Volume 1, ABS 039
Introduction: Degenerative spinal diseases often require complex, patient-specific treatment, presenting a compelling challenge for artificial intelligence (AI) integration into clinical practice. While existing literature has focused on ChatGPT-4o performance in individual spine conditions, this study compares ChatGPT-4o, a traditional large language model (LLM), against NotebookLM, a novel retrieval-augmented model (RAG-LLM) supplemented with North American Spine Society (NASS) guidelines, for concordance with all five published NASS guidelines for degenerative spinal diseases.
Methods: A total of 118 questions regarding five degenerative spinal conditions were copied directly from NASS guidelines and presented to ChatGPT-4o and NotebookLM without modification. Chat history was cleared between questions and NotebookLM was supplemented with the corresponding NASS guideline, due to the model requiring user-given sources, while ChatGPT-4o was not, as GPTs are pre-trained. AI responses were evaluated for concordance with NASS guidelines based on four criteria: accuracy, evidence-based conclusion, supplementary, and complete information.
Results: Overall, NotebookLM provided significantly more accurate responses (98.3% vs 40.7%, p<0.05), more evidence-based conclusions (99.1% vs 40.7%, p<0.05), and more complete information (94.1% vs 79.7%, p<0.05), while ChatGPT-4o provided more supplementary information (98.3% vs 67.8%, p<0.05). These discrepancies became most prominent in categories consisting of complex and multipart questions, such as nonsurgical and surgical interventions, wherein ChatGPT often produced recommendations with unsubstantiated certainty.
Discussion: This work reflects the growing potential for AI integration into clinical practice. Overall, the novel RAG-LLM infrastructure of NotebookLM resulted in significantly more accurate and evidence-based recommendations in comparison to a traditional LLM, ChatGPT-4.0, which was more supplementary. These differences were most obvious in complex medical scenarios, such as surgical interventions, where a lack of evidence-based recommendations present a great risk to patient safety. Both models, however, were susceptible to producing incomplete responses. As such, physician expertise remains of utmost importance in clinical decision-making.
Volume 1, Van Wickle
Computational, ABS 039
April 12th, 2025
Other Articles in Computational