Zbornik 27. mednarodne multikonference INFORMACIJSKA DRUŽBA – IS 2024 Zvezek K Proceedings of the 27th International Multiconference INFORMATION SOCIETY – IS 2024 Volume K ChatGPT v zdravstvu ChatGPT in Medicine Uredniki / Editors Matjaž Gams, Monika Simjanoska Misheva, Stevo Lukić, Franz Wotawa http://is.ijs.si 9. oktober 2024 / 9 October 2024 Ljubljana, Slovenia Uredniki: Matjaž Gams Odsek za inteligentne sisteme, Institut »Jožef Stefan«, Ljubljana, Slovenija Monika Simjanoska Misheva Fakulteta za računalništvo in elektrotehniko, Skopje, Makedonija Stevo Lukić Nevrološka klinika Niš, Srbija Franz Wotawa Inštitut za tehnologijo programske opreme, Gradec, Avstrija Založnik: Institut »Jožef Stefan«, Ljubljana Priprava zbornika: Mitja Lasič, Vesna Lasič, Lana Zemljak Oblikovanje naslovnice: Vesna Lasič Dostop do e-publikacije: http://library.ijs.si/Stacks/Proceedings/InformationSociety Ljubljana, oktober 2024 Informacijska družba ISSN 2630-371X Kataložni zapis o publikaciji (CIP) pripravili v Narodni in univerzitetni knjižnici v Ljubljani COBISS.SI-ID 214534915 ISBN 978-961-264-309-6 (PDF) PREDGOVOR MULTIKONFERENCI INFORMACIJSKA DRUŽBA 2024 Leto 2024 je hkrati udarno in tradicionalno. Že sedaj, še bolj pa v prihodnosti bosta računalništvo, informatika (RI) in umetna inteligenca (UI) igrali ključno vlogo pri oblikovanju napredne in trajnostne družbe. Smo na pragu nove dobe, v kateri generativna umetna inteligenca, kot je ChatGPT, in drugi inovativni pristopi utirajo pot k superinteligenci in singularnosti, ključnim elementom, ki bodo definirali razcvet človeške civilizacije. Naša konferenca je zato hkrati tradicionalna znanstvena, pa tudi povsem akademsko odprta za nove pogumne ideje, inkubator novih pogledov in idej. Letošnja konferenca ne le da analizira področja RI, temveč prinaša tudi osrednje razprave o perečih temah današnjega časa – ohranjanje okolja, demografski izzivi, zdravstvo in preobrazba družbenih struktur. Razvoj UI ponuja rešitve za skoraj vse izzive, s katerimi se soočamo, kar poudarja pomen sodelovanja med strokovnjaki, raziskovalci in odločevalci, da bi skupaj oblikovali strategije za prihodnost. Zavedamo se, da živimo v času velikih sprememb, kjer je ključno, da s poglobljenim znanjem in inovativnimi pristopi oblikujemo informacijsko družbo, ki bo varna, vključujoča in trajnostna. Letos smo ponosni, da smo v okviru multikonference združili dvanajst izjemnih konferenc, ki odražajo širino in globino informacijskih ved: CHATMED v zdravstvu, Demografske in družinske analize, Digitalna preobrazba zdravstvene nege, Digitalna vključenost v informacijski družbi – DIGIN 2024, Kognitivna znanost, Konferenca o zdravi dolgoživosti, Legende računalništva in informatike, Mednarodna konferenca o prenosu tehnologij, Miti in resnice o varovanju okolja, Odkrivanje znanja in podatkovna skladišča – SIKDD 2024, Slovenska konferenca o umetni inteligenci, Vzgoja in izobraževanje v RI. Poleg referatov bodo razprave na okroglih mizah in delavnicah omogočile poglobljeno izmenjavo mnenj, ki bo oblikovala prihodnjo informacijsko družbo. “Legende računalništva in informatike” predstavljajo slovenski “Hall of Fame” za odlične posameznike s tega področja, razširjeni referati, objavljeni v reviji Informatica z 48-letno tradicijo odličnosti, in sodelovanje s številnimi akademskimi institucijami in združenji, kot so ACM Slovenija, SLAIS in Inženirska akademija Slovenije, bodo še naprej spodbujali razvoj informacijske družbe. Skupaj bomo gradili temelje za prihodnost, ki bo oblikovana s tehnologijami, osredotočena na človeka in njegove potrebe. S podelitvijo nagrad, še posebej z nagrado Michie-Turing, se avtonomna RI stroka vsakoletno opredeli do najbolj izstopajočih dosežkov. Nagrado Michie-Turing za izjemen življenjski prispevek k razvoju in promociji informacijske družbe je prejel prof. dr. Borut Žalik. Priznanje za dosežek leta pripada prof. dr. Sašu Džeroskemu za izjemne raziskovalne dosežke. »Informacijsko limono« za najmanj primerno informacijsko tematiko je prejela nabava in razdeljevanjem osebnih računalnikov ministrstva, »informacijsko jagodo« kot najboljšo potezo pa so sprejeli organizatorji tekmovanja ACM Slovenija. Čestitke nagrajencem! Naša vizija je jasna: prepoznati, izkoristiti in oblikovati priložnosti, ki jih prinaša digitalna preobrazba, ter ustvariti informacijsko družbo, ki bo koristila vsem njenim članom. Vsem sodelujočim se zahvaljujemo za njihov prispevek k tej viziji in se veselimo prihodnjih dosežkov, ki jih bo oblikovala ta konferenca. Mojca Ciglarič, predsednica programskega odbora Matjaž Gams, predsednik organizacijskega odbora i PREFACE TO THE MULTICONFERENCE INFORMATION SOCIETY 2024 The year 2024 is both ground-breaking and traditional. Now, and even more so in the future, computer science, informatics (CS/I), and artificial intelligence (AI) will play a crucial role in shaping an advanced and sustainable society. We are on the brink of a new era where generative artificial intelligence, such as ChatGPT, and other innovative approaches are paving the way for superintelligence and singularity—key elements that will define the flourishing of human civilization. Our conference is therefore both a traditional scientific gathering and an academically open incubator for bold new ideas and perspectives. This year's conference analyzes key CS/I areas and brings forward central discussions on pressing contemporary issues—environmental preservation, demographic challenges, healthcare, and the transformation of social structures. AI development offers solutions to nearly all challenges we face, emphasizing the importance of collaboration between experts, researchers, and policymakers to shape future strategies collectively. We recognize that we live in times of significant change, where it is crucial to build an information society that is safe, inclusive, and sustainable, through deep knowledge and innovative approaches. This year, we are proud to have brought together twelve exceptional conferences within the multiconference framework, reflecting the breadth and depth of information sciences: • CHATMED in Healthcare • Demographic and Family Analyses • Digital Transformation of Healthcare Nursing • Digital Inclusion in the Information Society – DIGIN 2024 • Cognitive Science • Conference on Healthy Longevity • Legends of Computer Science and Informatics • International Conference on Technology Transfer • Myths and Facts on Environmental Protection • Data Mining and Data Warehouses – SIKDD 2024 • Slovenian Conference on Artificial Intelligence • Education and Training in CS/IS. In addition to papers, roundtable discussions and workshops will facilitate in-depth exchanges that will help shape the future information society. The “Legends of Computer Science and Informatics” represents Slovenia’s “Hall of Fame” for outstanding individuals in this field. At the same time, extended papers published in the Informatica journal, with over 48 years of excellence, and collaboration with numerous academic institutions and associations, such as ACM Slovenia, SLAIS, and the Slovenian Academy of Engineering, will continue to foster the development of the information society. Together, we will build the foundation for a future shaped by technology, yet focused on human needs. The autonomous CS/IS community annually recognizes the most outstanding achievements through the awards ceremony. The Michie-Turing Award for an exceptional lifetime contribution to the development and promotion of the information society was awarded to Prof. Dr. Borut Žalik. The Achievement of the Year Award goes to Prof. Dr. Sašo Džeroski. The "Information Lemon" for the least appropriate information topic was given to the ministry's procurement and distribution of personal computers. At the same time, the "Information Strawberry" for the best initiative was awarded to the organizers of the ACM Slovenia competition. Congratulations to all the award winners! Our vision is clear: to recognize, seize, and shape the opportunities brought by digital transformation and create an information society that benefits all its members. We thank all participants for their contributions and look forward to this conference's future achievements. Mojca Ciglarič, Chair of the Program Committee Matjaž Gams, Chair of the Organizing Committee ii KONFERENČNI ODBORI CONFERENCE COMMITTEES International Programme Committee Organizing Committee Vladimir Bajic, South Africa Matjaž Gams, chair Heiner Benking, Germany Mitja Luštrek Se Woo Cheon, South Korea Lana Zemljak Howie Firth, UK Vesna Koricki Olga Fomichova, Russia Mitja Lasič Vladimir Fomichov, Russia Blaž Mahnič Vesna Hljuz Dobric, Croatia Alfred Inselberg, Israel Jay Liebowitz, USA Huan Liu, Singapore Henz Martin, Germany Marcin Paprzycki, USA Claude Sammut, Australia Jiri Wiedermann, Czech Republic Xindong Wu, USA Yiming Ye, USA Ning Zhong, USA Wray Buntine, Australia Bezalel Gavish, USA Gal A. Kaminka, Israel Mike Bain, Australia Michela Milano, Italy Derong Liu, Chicago, USA Toby Walsh, Australia Sergio Campos-Cordobes, Spain Shabnam Farahmand, Finland Sergio Crovella, Italy Programme Committee Mojca Ciglarič, chair Marjan Heričko Baldomir Zajc Bojan Orel Borka Jerman Blažič Džonova Blaž Zupan Franc Solina Gorazd Kandus Boris Žemva Viljan Mahnič Urban Kordeš Leon Žlajpah Cene Bavec Marjan Krisper Niko Zimic Tomaž Kalin Andrej Kuščer Rok Piltaver Jozsef Györkös Jadran Lenarčič Toma Strle Tadej Bajd Borut Likar Tine Kolenik Jaroslav Berce Janez Malačič Franci Pivec Mojca Bernik Olga Markič Uroš Rajkovič Marko Bohanec Dunja Mladenič Borut Batagelj Ivan Bratko Franc Novak Tomaž Ogrin Andrej Brodnik Vladislav Rajkovič Aleš Ude Dušan Caf Grega Repovš Bojan Blažica Saša Divjak Ivan Rozman Matjaž Kljun Tomaž Erjavec Niko Schlamberger Robert Blatnik Bogdan Filipič Stanko Strmčnik Erik Dovgan Andrej Gams Jurij Šilc Špela Stres Matjaž Gams Jurij Tasič Anton Gradišek Mitja Luštrek Denis Trček Marko Grobelnik Andrej Ule Nikola Guid Boštjan Vilfan iii iv KAZALO / TABLE OF CONTENTS ChatGPT v zdravstvu / ChatMED in Medicine ........................................................................................... 1 PREDGOVOR / FOREWORD ............................................................................................................................... 3 PROGRAMSKI ODBORI / PROGRAMME COMMITTEES ............................................................................... 5 Automatic Reviewing of Conference Papers in Healthcare and Other Sciences Using ChatGPT / Gams Matjaž, Bolješić Luka, Korlat Alex, Kocuvan Primož .................................................................................................... 7 Leveraging Federated Learning for Secure Transfer and Deployment of ML Models in Healthcare / Dodevski Zlate, Pavleska Tanja, Trajkovik Vladimir ...................................................................................................... 11 Use and Limitations of ChatGPT in Mental Health Disorders / Petrović Jovana, Binić Iva, Vacev Maša, Lukić Stevo ................................................................................................................................................................. 17 Testing ChatGPT’s Performance on Medical Diagnostic Tasks / Perko Alexander, Wotawa Franz ................... 20 HomeDOCtor App: Integrating Medical Knowledge into GPT for Personal Health Counseling / Zadobovšek Matic, Kocuvan Primož, Gams Matjaž ............................................................................................................ 26 Standards for Use of LLM in Medical Diagnosis / Svetozarević Mihailo, Janković Isidora, Janković Sonja, Lukić Stevo ...................................................................................................................................................... 32 Using Combinatorial Testing for Prompt Engineering of LLMs in Medicine / Perko Alexander, Nica Iulia, Wotawa Franz .................................................................................................................................................. 36 Meeting Cultural and Linguistic Demands to Accommodate Fine-Tuned LLMs to Local Medical Customs and Patient Communication / Petrovska Dojchinovska Gordana, Simjanoska Misheva Monika, Mishev Kostadin .......................................................................................................................................................................... 42 Indeks avtorjev / Author index ................................................................................................................... 47 v vi Zbornik 27. mednarodne multikonference INFORMACIJSKA DRUŽBA – IS 2024 Zvezek K Proceedings of the 27th International Multiconference INFORMATION SOCIETY – IS 2024 Volume K ChatGPT v zdravstvu ChatGPT in Medicine Uredniki / Editors Matjaž Gams, Monika Simjanoska Misheva, Stevo Lukić, Franz Wotawa http://is.ijs.si 9. oktober 2024 / 9 October 2024 Ljubljana, Slovenia 1 2 PREDGOVOR Umetna inteligenca, zlasti generativna umetna inteligenca, kot je ChatGPT, je spremenila pravila igre v številnih panogah. Vendar je njen vpliv v medicini še posebej pomemben, saj gre v zdravstvu za več kot le informacije - gre za človeška življenja. Vključitev umetne inteligence lahko bistveno izboljša izide zdravljenja bolnikov, racionalizira delovne procese in izboljša procese odločanja na celotnem področju medicine. Vloga umetne inteligence presega pomoč strokovnjakom; neposredno vpliva na oskrbo bolnikov. Z virtualnimi posvetovanji, izobraževanjem bolnikov in preverjanjem simptomov lahko ChatGPT izboljša dostop do zdravstvenega varstva za skupine prebivalstva, ki bi se sicer soočale z ovirami zaradi lokacije ali časovnih omejitev. Poleg tega umetna inteligenca z avtomatizacijo rutinskih opravil zmanjšuje upravno breme zdravnikov, tako da lahko več časa posvetijo oskrbi bolnikov in manj papirologiji. To je lahko ključni dejavnik v boju proti izgorelosti zdravnikov, ki je v sodobnem zdravstvu vse večji problem. Čeprav je potencial ChatGPT v zdravstvu vznemirljiv, moramo obravnavati etične in varnostne izzive, ki jih prinaša. Ključna vprašanja so zagotavljanje zasebnosti pacientov, zmanjšanje pristranskosti algoritmov umetne inteligence in ohranjanje natančnosti zdravniških nasvetov. UI mora podpirati - ne pa nadomestiti - človeško presojo, zlasti pri kritičnih medicinskih odločitvah. Zagotavljanje preglednosti, odgovornosti in zasnove teh sistemov s pristopom, pri katerem je v ospredju varnost, je bistvenega pomena za krepitev zaupanja v te tehnologije. V prihodnosti se bo vloga ChatGPT in podobnih sistemov umetne inteligence verjetno še povečala. Pravkar začenjamo raziskovati njihove aplikacije v personalizirani medicini, kjer bi umetna inteligenca lahko pomagala prilagoditi zdravljenje posameznim pacientom na podlagi genetskih podatkov, podatkov o okolju in življenjskem slogu. Poleg tega lahko umetna inteligenca z analizo trendov v populacijah prispeva k zgodnjemu odkrivanju bolezni, odkrivanju zdravil in globalnim zdravstvenim pobudam. Ta konferenca je priložnost, da se poglobimo v najsodobnejše raziskave, nastajajoče aplikacije in etične vidike uporabe ChatGPT v medicinski praksi. Skupaj bomo raziskali sedanje zmogljivosti in prihodnje možnosti, hkrati pa se bomo posvetili izzivom, ki jih prinaša vključevanje umetne inteligence na tako občutljivo in tvegano področje. Matjaž Gams Monika Simjanoska Misheva Stevo Lukić Franz Wotawa predsedniki konference ChatGPT v zdravstvu 3 FOREWORD AI, and particularly conversational AI like ChatGPT, has been a game-changer in many industries. However, its impact in medicine is particularly significant because healthcare is about more than just information—it’s about human lives. The integration of AI has the potential to dramatically improve patient outcomes, streamline workflows, and enhance decision-making processes across the medical field. The role of AI extends beyond assisting professionals; it directly impacts patient care. Through virtual consultations, patient education, and symptom checks, ChatGPT can enhance access to healthcare for populations who might otherwise face barriers due to location or time constraints. Additionally, by automating routine tasks, AI reduces the administrative burden on clinicians, allowing them to spend more time on patient care and less on paperwork. This can be a crucial factor in combating physician burnout, a growing issue in modern healthcare. While the potential of ChatGPT in healthcare is exciting, we must address the ethical and safety challenges that come with it. Ensuring patient privacy, minimizing bias in AI algorithms, and maintaining the accuracy of medical advice are key concerns. AI should support—not replace—human judgment, particularly in critical medical decisions. Ensuring that these systems are transparent, accountable, and designed with a safety-first approach is essential to building trust in these technologies. Looking forward, the role of ChatGPT and similar AI systems will likely expand. We are just beginning to explore its applications in personalized medicine, where AI could help tailor treatments to individual patients based on genetic, environmental, and lifestyle data. Additionally, AI can contribute to early detection of diseases, drug discovery, and global health initiatives by analyzing trends across populations. This conference is an opportunity for us to delve into the cutting-edge research, emerging applications, and ethical considerations surrounding the use of ChatGPT in medical practice. Together, we will explore both the current capabilities and the future possibilities, while also addressing the challenges that come with integrating AI into such a sensitive and high-stakes field. Matjaž Gams Monika Simjanoska Misheva Stevo Lukić Franz Wotawa ChatGPT in Medicine chairs 4 PROGRAMSKI ODBOR / PROGRAMME COMMITTEE Matjaž Gams Monika Simjanoska Misheva Stevo Lukić Franz Wotawa Žiga Kolar 5 6 Automatic Reviewing of Conference Papers in Healthcare and Other Sciences Using ChatGPT Matjaž Gams Luka Bolješić Alex Korlat Primož Kocuvan Department of Intelligent University of Ljubljana University of Ljubljana Department of Intelligent Systems Faculty of Computer and Faculty of Computer and Systems Jozef Stefan Institute Information Science Information Science Jozef Stefan Institute Ljubljana Slovenia Ljubljana Slovenia Ljubljana Slovenia Ljubljana Slovenia matjaz.gams@ijs.si luka.boljesic0@gmail.co alex.korlat.ak@gmail.com primoz.kocuvan@ijs.si m ABSTRACT raziskovalne prispevke, in uporablja ChatGPT za generiranje pregledov za vsak prispevek. Obravnavane so metodologija, The emergence of artificial intelligence (AI) has significantly rezultati in možni vplivi te aplikacije, ki poudarjajo tako impacted various fields, including the peer review process in prednosti kot izzive integracije UI v akademski pregledni scientific research. ChatGPT, a large language model developed postopek. by OpenAI, has shown potential in automating and enhancing the review of conference healthcare papers. Our HealthReview KLJUČNE BESEDE AI tool can process and analyze large volumes of text rapidly, Umetna inteligenca, ChatGPT, strokovni pregled, zdravstvo, providing feedback and insights that streamline the peer review znanstvene raziskave, Flask ogrodje, obdelava PDF, akademsko process, reduce human workload, and increase efficiency. This paper presents a web application developed using the Flask pisanje, konferenčni prispevki framework that enables users to upload PDF files containing research papers and utilizes ChatGPT to generate reviews for 1 Introduction each paper. The methodology, results and potential implications of this application are discussed, highlighting both the The integration of artificial intelligence (AI) across various advantages and the challenges of integrating AI into the domains is significantly transforming complex processes, academic review process.1 including the peer review of conference papers in healthcare and other sciences. Among AI tools, ChatGPT, developed by KEYWORDS OpenAI, stands out for its potential to automate and enhance the Artificial Intelligence, ChatGPT, Peer Review, Healthcare, review process. Its ability to quickly analyze large volumes of Scientific Research, Flask Framework, PDF Processing, text and provide insightful feedback could streamline peer Academic Writing, Conference Papers reviews, reduce human workload, and enhance overall efficiency. POVZETEK Recent studies highlight the diverse applications of ChatGPT in Pojav umetne inteligence (UI) je pomembno vplival na različna healthcare education, research, and practice. For example, it has področja, vključno s postopkom strokovnega pregleda v been shown to improve scientific writing, analyze datasets, and akademskih in znanstvenih raziskavah. ChatGPT, velik aid drug discovery [1, 2, 3]. Additionally, its role in generating jezikovni model, ki ga je razvil OpenAI, je pokazal potencial za paraphrased content and literature reviews indicates the avtomatizacijo in izboljšanje pregleda medicinskih potential to expedite academic tasks, although concerns about konferenčnih prispevkov. To orodje UI lahko hitro obdela in originality and accuracy persist [4, 5]. analizira velike količine besedil ter zagotovi povratne ChatGPT's integration into medical literature reviews has been informacije in vpoglede, ki poenostavijo postopek strokovnega explored, demonstrating its ability to synthesize medical pregleda, zmanjšajo delovno obremenitev in povečajo knowledge, though ethical and accuracy issues require further učinkovitost. Ta članek predstavlja spletno aplikacijo research [6, 7]. Beyond healthcare, ChatGPT enhances research HealthReview, razvito s pomočjo ogrodja Flask, ki efficiency across various scientific fields. It effectively uporabnikom omogoča nalaganje datotek PDF, ki vsebujejo generates Boolean queries for systematic reviews and supports 1 Permission to make digital or hard copies of part or all of this work for personal rapid literature searches [8]. The AI's potential to streamline or classroom use is granted without fee provided that copies are not made or peer reviews and address biases, is also evident, though distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this managing issues like bias, plagiarism, and inaccuracies remain work must be honored. For all other uses, contact the owner/author(s). crucial to maintaining academic integrity [9, 10, 11]. In our opinion, tools like ChatGPT offer significant opportunities to Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia enhance the peer review process. However, careful deployment © 2024 Copyright held by the owner/author(s). is necessary to ensure ethical considerations, accuracy, and the preservation of academic integrity. This paper explores these https://doi.org/10.70314/is.2024.chtm.3 7 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia M. Gams et al. aspects and presents insights into effectively integrating JavaScript enhances user experience by displaying the selected ChatGPT into the peer review process designed with our system file name and showing a loading indicator upon form HealthReview. The paper consists of Section 1 Introduction, submission. Upon file upload via the /submit route, the PDF is section 2 Methodology where the system is presented. Results securely saved in the upload folder. The file function ensures are demonstrated in Section 3, and the paper concludes with a that only files with a .pdf extension are accepted. Once the file discussion. is stored, the papers_strArr function extracts the text of individual papers from the PDF. This function searches for 2 Methodology specific markers like "ABSTRACT" and "Author index" to identify the boundaries of each paper. Text is extracted from each page and grouped into separate papers based on these When tested, GPT-4o was already able to reply to all basic markers. If no papers are detected, an error message is returned. questions, e.g. “Is grammar in that text correct”. Therefore, the Otherwise, the program proceeds to each extracted paper. first task was to create a list of commands to perform the sequence of the review, chosen for the Information Society The API Call function facilitates the interaction with the conference (is.ijs.si). The second task was to fine-tune the ChatGPT model. It sends the text of each paper to the ChatGPT process since several output issues were not as desired. The API, along with a predefined prompt, and retrieves the model's third task was to include additional knowledge, and that was generated review. These reviews are stored in a global executed by including the Insieme knowledge base as the core dictionary RESULTS with unique identifiers for each paper. medical information. In this way, the HealthReview performance achieved reasonable levels. The /paper/ route enables users to view the generated reviews. When a specific paper ID is requested, the The Insieme platform was selected for integration with GPT corresponding review content is fetched from the RESULTS [12]. This platform features a user-friendly interface that enables users to efficiently access valuable healthcare dictionary and displayed using the paper.html template. information from a centralized website, either via manual Navigation links allow users to browse through the list of search or through the Insieme search function. Insieme serves papers and view their respective reviews. as the successor to the national Electronic and Mobile Health (eHealth) initiative, a project that was characterized by HTML templates, including form.html and paper.html, collaboration among 15 partners. Furthermore, the platform's provide the user interface for uploading files, displaying development has been significantly shaped by insights derived instructions, and viewing individual paper reviews. The from the analysis of various EU healthcare platforms, form.html template includes the file upload form and particularly those that specialize in elderly care. Insieme not instructions for use, while the paper.html template displays only builds upon the foundation laid by these prior initiatives individual paper reviews and includes navigation for switching but also aims to set a new standard in the accessibility and between different papers. This program integrates Flask for web usability of healthcare information, thereby providing a more functionality, PDFMiner for PDF text extraction, and OpenAI's integrated and cohesive experience for its users. Additional ChatGPT-4o for generating paper reviews. This combination medical knowledge therefore comes from the Insieme database offers a streamlined and efficient method for reviewing that was created from a national electronic and mobile project conference papers, potentially enhancing the speed and for smart cities. Including the Insieme platform into the GPS is consistency of the review process, which is crucial in managing described elsewhere and is not part of this paper. The the increasing volume of research submissions. HealthReview HealthReview web application is developed in Python using the accepts input in the form of any language and similarly outputs Flask framework. Flask is a lightweight web framework for the replies in any language supported by ChatGPT-4o. Python programming language, used for building fast and stable web applications [13]. This application allows users to upload one or several PDF files containing multiple research papers, 3 Results processes these files to extract the individual papers, and then The developed web application HealthReview was tested using employs ChatGPT to generate reviews for each paper. a dataset of conference papers to evaluate its performance and efficiency in generating reviews. An example of the The application is initialized and configured to use an upload application’s output is shown below. folder named 'uploads'. This folder is created if it does not exist to ensure that uploaded files have a designated storage location. Title: Anonymous in this paper To maintain file security, the application restricts uploads to Authors: Anonymous PDF files by specifying an allowed file extension set.The user Abstract: interface of the application consists of an HTML form rendered The article focuses on physical inactivity, which increases the by the index route. This form allows users to upload their PDF risk of chronic diseases, especially cardiovascular diseases, files through a file input and submit button. Client-side among the adult population as well as among children and 8 Automatic Reviewing of Conference Papers in Healthcare Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia and Other Sciences Using ChatGPT youth. Based on data from the Institute for Health Metrics and The HealthReview web application successfully generated Evaluation, the article reports relative risks of cardiovascular reviews for multiple conference papers. Table 1 represents a disease by age group, gender, and country. In doing so, we summary of the evaluations for the conference section. The focus on the comparison between Slovenia and the world paper number is a serial number assigned to each paper in the average. Based on the analysis, we conclude that cardiovascular conference dataset. Examples of the meanings of other fields diseases impact people in different age segments; as such they are listed in this section. become one of the important risk factors through aging. Also, physical activity importantly influences behavioral, nutrition, and other factors, which contribute to cardiovascular diseases. 4 Discussion For that reason, it is very important to systematically analyze The results of our study demonstrate the promising potential of and monitor these factors to contribute to better guidelines and integrating AI, specifically ChatGPT, into the peer review measures of policymakers. process for healthcare and medical conference papers. The application that was developed effectively generated insightful Overall evaluation: 1 - Accept with minor changes reviews, which were evaluated against traditional human- (All options: 2 - Accept as is | 1 - Accept with minor changes | 0 - Accept with major changes | -2 - Reject) generated reviews for quality and consistency. While the overall The paper provides valuable insights into the relationship between performance of the AI-based review system was favorable, physical inactivity and cardiovascular diseases, particularly in different several considerations and implications warrant further age segments. However, some minor improvements could elevate the discussion. First, the ability of ChatGPT to process and analyze presentation and clarity of the findings. large volumes of text rapidly offers a significant advantage in Overall quality: 7/10. terms of efficiency. This is particularly beneficial in the context The scientific quality is solid, but minor methodological limitations and of increasing research output and the growing burden on peer the need for more comprehensive data analysis slightly reduce its reviewers. However, while the AI-generated reviews were impact. Presentation: 3 – Routine generally accurate and aligned with human assessments, there (All options: 5 - Crystal-clear and particularly attractive | 4 - Clear and were instances where the feedback provided by ChatGPT attractive | 3 - Routine | 2 - Not very clear and/or attractive | 1 - Unclear lacked depth, particularly in areas requiring domain-specific and/or particularly unattractive.) expertise. This highlights a limitation of current AI The presentation is well-structured but lacks visual aids and a more technologies, where the ingenuity, flexibility and expertise of engaging writing style. human reviewers are still crucial. Language modifications: Minor grammatical mistakes. The paper includes some minor grammatical mistakes that should be fixed before It is not clear to what extent the automatic reviewing system publication. applies to all domains, not only healthcare. In any case, when Proposed modifications: Correct "prerana" to "prehrana" in the "REZULTATI" section. Ensure consistency in the use of terms such as adopting a specific form of review, the instructions as part of "risk factors" and "prevalence" across sections. the prompt programming should be modified. In practical Improve the translation and readability of some sentences for better terms, it should also be noticed that the API GPT call is not free clarity. of charge. Consider adding graphical representations of data to enhance interpretability. Additionally, the application of AI in the peer review Suggestion: Use tools like Grammarly or GPTs. process raises ethical concerns, particularly around the potential Reviewer's confidence: 4 – High for bias, the risk of plagiarism, and the integrity of the review (All options: 5 - Expert | 4 - High | 3 - Medium | 2 - Low | 1 - None) process. Although ChatGPT can streamline the review process, Excellence: 3 - Definite but modest methodological novelty and/or good application these tools must be used as supplements rather than (All options: 5 - Important methodological novelty | 4 - Definite replacements for human reviewers. Maintaining a balance methodological novelty | 3 - Definite but modest methodological between AI efficiency and human oversight is critical to novelty and/or good application | 2 - Debatable methodological novelty preserving the integrity and quality of academic peer reviews. and/or acceptable application | 1 - No methodological novelty and poor Moreover, the reliance on AI for academic tasks necessitates application) continuous monitoring and updates to the AI models to ensure Impact: 3 - Used/useful and moderately mature accuracy, relevance, and fairness. Future developments should (All options: 5 - Highly impactful application and/or very widely used, focus on enhancing the contextual understanding of AI tools at least moderately mature | 4 - Important application and/or widely used, at least moderately mature | 3 - Used/useful and moderately like ChatGPT to better mimic the critical thinking and mature | 2 - Used/useful in limited cases and/or immature | 1 - No analytical capabilities of human reviewers. practical use) Table 1: Summary of Reviews 9 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia M. Gams et al. Paper number Overall evaluation Overall quality Presentation Language (grammatical mistakes) 1 2/4 6/10 2/5 Quite some 2 3/4 7/10 3/5 Minor 3 3/4 7/10 3/5 Some 4 3/4 7/10 3/5 Quite some 5 3/4 7/10 3/5 Some 6 3/4 7/10 3/5 Minor 7 3/4 8/10 3/5 Minor 8 3/4 8/10 3/5 Some 9 3/4 7/10 3/5 Minor 10 3/4 7/10 3/5 Minor 11 3/4 8/10 4/5 Some 12 3/4 7/10 3/5 Minor 13 4/4 9/10 4/5 Minor In conclusion, the inclusion of HealthReview, i.e. an additional [5] S. Ruksakulpiwat, A. Kumar, and A. Ajibade. 2023. Using chatgpt in automated review layer introduces several advantages, such as medical research: current status and future directions. Journal of Multidisciplinary Healthcare, 16, 1513–1520 increased objectivity and the potential to generate https://doi.org/10.2147/JMDH.S419830. supplementary suggestions, further enriching the review process. AI tools like ChatGPT offer substantial potential to [6] M. Sallam. 2023. Chatgpt utility in healthcare education, research, and enhance the peer review process. However, their successful prac- tice: systematic review on the promising perspectives and valid integration requires careful implementation and continuous concerns. Healthcare, 11. doi: 10.3390/healthcare11060887 D. Solomon, K. Allen, P. Katz, A. Sawalha, and E. Yelin. 2023. Chatgpt, et al... evaluation to effectively address inherent challenges and ensure artificial intelligence, authorship, and medical publishing. Arthritis & that these tools make a meaningful contribution to academic Rheumatology, 75. doi: 10.1002/art.42497 research. In any case, the automatic review by HealthReview or any other review tool should be marked in a way explicitly [7] O. Temsah et al. 2023. Overview of early chatgpt’s presence in medical denoting the source and type of the reviewing tool. literature: insights from a hybrid literature review by chatgpt and human experts. Cureus, 15. doi: 10.7759/cureus.37281 [8] S. Wang, H. Scells, B. Koopman, and G. Zuccon. 2023. Can chatgpt 5 Acknowledgments write a good boolean query for systematic review literature search? In Proceedings of the 46th International ACM SIGIR Conference on This project is funded by the European Union under Horizon Research and Development in Information Retrieval. doi: Europe (project ChatMED grant agreement ID: 101159214) 10.1145/3539618.3591703 REFERENCES [9] Biswas, D. Dobaria, and H. L. Cohen. 2023. Chatgpt and the future of journal reviews: a feasibility study. The Yale Journal of Biology and [1] R. Garg, V. L. Urs, A. A. Agrawal, S. K. Chaudhary, V. Paliwal, and S. Medicine, 96, 415–420. https://pubmed.ncbi.nlm.nih.gov/37407690/ K. Kar. 2023. Exploring the role of chatgpt in patient care (diagnosis and treatment) and medical research: a systematic review. Health Promotion [10] D. Solomon, K. Allen, P. Katz, A. Sawalha, and E. Yelin. 2023. Chatgpt, Perspectives, 13, 183–191. et al... artificial intelligence, authorship, and medical publishing. Arthritis https://hpp.tbzmed.ac.ir/Article/HPP_19952_20230501034121 & Rheumatology, 75. doi: 10.1002/art.42497 [2] P. Y. Lee, H. Salim, A. Abdullah, and C. H. Teo. 2023. Use of chatgpt in [11] J. Verharen. 2023. Chatgpt identifies gender disparities in scientific peer medical research and scientific writing. Malaysian Family Physician, 18, review. eLife, 12. https://doi.org/10.7554/eLife.88173 58. https://www.e-mfp.org/ojs3/index.php/MFP/article/view/1213 [12] The Insieme platform presentation: https://www.youtube.com/watch? [3] F. Tustumi, N. Andreollo, and J. E. Aguilar-Nascimento. 2023. Future of v=ozAporFOw64, Accessed 1.9.2024 language models in healthcare: the role of chatgpt. Arquivos Brasileiros de Cirurgia Digestiva: ABCD, 36. https://doi.org/10.1590/0102- [13] Flask framework documentation: 672020230002e1800 https://flask.palletsprojects.com/en/latest/ Accessed 1.9.2024 [4] Ö. Aydın and E. Karaarslan. 2022. Openai chatgpt generated literature review: digital twin in healthcare. In Emerging Computer Technologies 2. Ö. Aydın, editor. İzmir Akademi Dernegi, 22–31. https://doi.org/10.5281/zenodo.7094305S. 10 Leveraging Federated Learning for Secure Transfer and Deployment of ML Models in Healthcare* Zlate Dodevski Tanja Pavleska Vladimir Trajkovik Faculty of Computer Science Laboratory for Open Systems Faculty of Computer Science and Engineering and Networks and Engineering Ss. Cyril and Methodius University Jozef Stefan Institute Ss. Cyril and Methodius University Skopje, Republic of N. Macedonia Ljubljana, Slovenia Skopje, Republic of N. Macedonia zlate.dodevski@students.finki.ukim.mk atanja@e5.ijs.si vladimir.trajkovik@finki.ukim.mk Abstract endless pool of structured and unstructured data, as these technologies thrive on large datasets to identify patterns and Federated learning (FL) represents a pivotal advancement make predictions. The novel adaptable, predictive approach to in applying Machine Learning (ML) in healthcare. It addresses generating insights, decision support, and assistance in tasks that the challenges of data privacy and security by facilitating model have long been considered solely reserved for human expertise transferability across institutions. This paper explores the is based on this paradigm's capabilities to recognize patterns effective employment of FL to enhance the deployment of large from the data without being explicitly programmed. language models (LLMs) in healthcare settings while Conventional machine learning implies that the data owner maintaining stringent privacy standards. Through a detailed communicates with a specific central server with significant examination of the challenges in applying LLMs to the computational power. The central server consumes data from healthcare domain, including privacy, security, regulatory different sources and applies training techniques and algorithms constraints, and training data quality, we present a federated to devise an effective model. ML requires large amounts of data learning architecture tailored for LLMs in healthcare. This to satisfy the expectations for the model's performance. architecture outlines the roles and responsibilities of Large-language models (LLMs), as representative of ML participating entities, providing a framework for secure advancements, have been a particular point of interest in recent collaboration. We further analyze privacy-preserving techniques years. They have already proven their applicability and massive such as differential privacy and secure aggregation in the context potential in multiple fields [30]. LLMs are designed to of federated LLMs for healthcare, offering insights into their understand, generate, and interact with human-like text and can practical implementation. understand context, making them suitable for performing a wide Our findings suggest that federated learning can range of complex language-related tasks. They are trained in two significantly enhance the capabilities of LLMs in healthcare main phases. First, the model learns general knowledge about while preserving patient privacy. In addition, we also identify language patterns in the pre-training phase. Then, it can be fine-persistent challenges in areas such as computational and tuned to execute downstream tasks to specialize its expertise in a communicational efficiency, lack of benchmarks and tailored FL specific domain [28]. aggregation algorithms applied to LLMs, model performance, However, like other machine learning models researched and and ethical concerns in participant selection. By critically implemented, this paradigm is data-hungry, meaning that it evaluating the proposed approach and highlighting its potential inherently requires massive training data to achieve the expected benefits and limitations in real-world healthcare settings, this performance [10]. Thus, LLMs are designed to perform better work provides a foundation for future research in secure and with an increase in training data volume and computational privacy-preserving ML deployment in healthcare. power. Various unfortunate scenarios related to the misuse of private and Keywords personal data cast a shadow on AI's capabilities, underscoring the Federated Learning, Large Language Models, Data Privacy, growing concerns about data privacy, specifically in the phases Healthcare ML, Privacy Preservation, Model Transferability when the models are trained. [13]. The year 2016 is particularly significant for two key developments aiming to overcome these 1 Introduction challenges. The first is the attempt to regulate personal data The advancements in hardware and software technologies, collection, processing, and storing by introducing the General hyper-connectivity, and the fourth industrial revolution lead to Data Protection Regulation (GDPR) in Europe [12]. The second the creation of mass amounts of health-related data. Machine key development was the introduction of Federated Learning learning and AI, in general, are the biggest winners from this (FL) by Google researchers, which provided a groundbreaking scientific approach to addressing data privacy and security concerns in ML [16]. * Permission to make digital or hard copies of part or all of this work for personal This paper aims to discuss the possibility of satisfying the or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and needs of both data owners and ML experts by leveraging the the full citation on the first page. Copyrights for third-party components of this concept of federated learning. On the one hand, data owners can work must be honored. For all other uses, contact the owner/author(s). Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia be supported to participate in collaborative training in a privacy- © 2024 Copyright held by the owner/author(s). preserving manner when their data is insufficient to craft a high- https://doi.org/10.70314/is.2024.chtm.5 11 performance model, such as LLM. On the other hand, ML of hesitation. Most of the training data for the LLMs comes from experts can develop and advance their approaches by utilizing publicly available sources, lacking the nuances that bring the large volumes of real-life institutional data and access to diverse real-life data isolated in the infrastructure of a single healthcare scenarios, which are essential for building a robust model. institution. Additionally, each institution has many Being aimed at investigating the FL potential for application characteristics that make it unique in how it works. Workflows, in the health domain, the insights presented in this work offer dictionaries, specific characteristics of the population it serves, support in finding a more robust, secure, and effective use of AI or particular domain attributes can result in difficulties for off-that does not require technical proficiency of the medical experts, the-shelf LLMs in providing the correct output to the instruction ultimately contributing to improved patient care and data given. This calls for the institutions to further tailor and tune the protection. capabilities of the LLM. Although state-of-the-art LLMs allow The remainder of this paper is organized as follows: Sections for such modifications and fine-tunings and making this process 2 and 3 introduce FL and LLMs, and their relevance to feasible, this comes with a heavy involvement and effort by the healthcare; section 4 presents our proposed FL architecture for institution representatives and with extensive computational LLMs in healthcare, detailing the system components and their resources. Finally, even if one institution is capable of making roles; section 5 discusses challenges in implementing LLMs in efforts to adapt a generalized LLM for its use, it faces the healthcare using FL; and section 6 explores privacy-preserving inevitable obstacle of data insufficiency. In general, a single techniques for integration with our FL architecture. institution either cannot provide enough data to receive proper, correct output for the downstream task or is incapable of solving 2 LLM-based Healthcare Applications instruction of so-called new events or conditions. LLMs are usually trained on high-quality public data, but A healthcare institution would need support in multiple areas their performance is often limited when tasked with specialized to make the process easier to follow and adopt. As a result, or narrower-spectrum tasks. With specific expertise in mind, processes related to finding a suitable LLM model, maintaining such as healthcare, different approaches should be considered to it, and keeping it up to date should be outsourced to a separate build that corpus. Healthcare institutions can use their local body owning the expertise. To effectively adapt LLMs in the datasets, leading to less efficient LLM, or they can join other healthcare domain, collaboration among institutions in collaborative efforts to make high-quality training data facing the compliance with the industry regulations should be established inevitable challenges of privacy and regulations. to build a rich training corpus. LLMs are attractive in the healthcare area because of their capability to simplify the interaction with an intelligent system 3 FL Principles Relevant to Healthcare Data without needing technical expertise. Lack of technological Privacy and Security proficiency of medical experts can decrease the adoption of a In healthcare, data is often distributed across multiple specific software solution and can be marked as overly complex. institutions, each possessing unique and valuable patient The core principles of the LLMs allow users to interact with their information. Traditional approaches to AI model training require domain rules, persistent knowledge, and past experiences centralizing this data, which poses significant privacy and without the need to rely on their computer literacy. One of the security risks. Federated learning provides a solution by enabling enormous benefits of utilizing LLMs in potential areas of collaborative model training without exchanging raw data. medical workflows relies on the output, which is in an Instead, each institution trains the model locally and shares only understandable form of natural language. The ease of use of aggregated updates with a central server. This method ensures natural language to provide instructions and ask for decision that sensitive patient data remains within the institution, support bridges the gap between the domain experts and the facilitating the transfer and deployment of AI models across utilization of an intelligent computer system. Furthermore, much different settings without compromising data security. FL is an of the data that persists within healthcare institutions is in iterative process, and each communication round aims to unstructured formats, such as clinical notes, conversations, improve the model's performance. A typical FL scenario consists diagnoses, prescriptions, and research articles. LLMs are of two main phases in each round: local training of the model particularly effective at processing and using these natural done on the participant side and aggregation of updates, which language texts. In that way, the transparency is increased, and the aims to create the most accurate consensus model. expert can examine the reasoning behind the answers provided There are three main types of FL based on how the data is straightforwardly. distributed across participants. In horizontal federated learning, In the past several years, we have already witnessed the the datasets share the same feature space but differ in the samples potential of LLMs in healthcare in many areas, such as they contain. Vertical federated learning, on the other hand, interpreting images from a specific medical domain, involves datasets with the same samples but different features. summarizing reports and medical history, identifying patterns in Lastly, federated transfer learning encompasses datasets that electronic health records (EHR), and offering support for vary in both their feature and sample spaces [8]. decision-making processes. The use of natural language can also FL in healthcare is predominantly covered in theoretical influence patient engagement processes [16]. research, with many studies exploring its potential, such as for Even though exceptional remarks on providing suitable improved personalized treatment and public health monitoring. LLMs for healthcare are already in place and the community is However, there are real-life applications, such as in radiology, verifying the theoretical and conceptual findings, the decision of where FL enables collaborative training on medical images like one healthcare institution to utilize such a system can face a lot MRIs and X-rays without sharing patient data [23]. 12 In the context of machine learning (ML) applications aggregation server or global server), and the communication-involving healthcare data, there are three critical vulnerability computation layer, which aggregates local updates and points that require attention: the data itself, the training of ML orchestrates communication phases in the ecosystem. Each models, and the communication and transfer of data. Each area component has its own responsibilities, which are essential for carries specific challenges and risks that must be mitigated to the model to satisfy the preset expectations. ensure the privacy, security, and efficacy of ML systems in Leveraging FL in utilizing LLMs adds a layer of complexity and healthcare. Health-related data is inherently complex, with implies different approaches based on the level of characteristics such as high dimensionality, variance over time, decentralization that needs to be achieved [29]. FL can help in heterogeneity, difficult interoperability, sparsity, and isolation both the pre-training and fine-tuning phases of LLM, and it is up [4]. Protecting the privacy of patients' personal and sensitive to the requirements' specific characteristics and the parties health information is crucial. Due to the sensitive nature of involved computational power to choose the right strategy [2, healthcare data, security breaches can lead to severe 11]. We will cover the different approaches while examining the consequences, including identity theft, fraud, and violation of three major architecture components. patient confidentiality. Healthcare data often comes from various sources, such as hospitals, clinics, wearable devices, and electronic health records (EHRs). This data is typically non-independent, identically distributed (non-iid), unbalanced, and fragmented across different systems. Additionally, data may be sparse or isolated, making it challenging to build comprehensive patient profiles or conduct large-scale analyses. Federated learning offers a promising approach to overcoming these challenges by allowing ML models to be trained across multiple decentralised data sources while keeping data local. This technique improves data privacy and security by not requiring raw data to be transferred to a central location. In an FL environment, each data controller defines its governance processes and privacy policies. This includes setting conditions for data access, training, and validation phases [3, 7, 19]. Communication between institutions, especially in healthcare, must adhere to strict regulatory requirements, such as the Health Insurance Portability and Accountability Act (HIPAA) in the U.S. Figure 1: Typical FL Architecture that can be deployed for and GDPR in Europe. Thus, privacy-preserving mechanisms LLM-based applications should be implemented "by design" and "by default" to ensure that sensitive patient information is processed securely. A 4.1 Global Server compliant ML system requires secure data transfer mechanisms, The global server plays a central role, and due to the lack of consent management, and audit trails. The FL setting can offer properly established taxonomy and standards, this component in advancement in this area by letting institutions keep sensitive the literature is also considered as the manager or creator of the information, prevent unnecessary data transfers and processing whole ecosystem. It is often referred to as the "manager" or that could violate regulatory requirements, and minimize the risk "aggregation server" tasked with overseeing the entire of data breaches [27]. collaboration and ensuring its smooth functioning. In healthcare, Training ML models with healthcare data presents unique the manager can be a single healthcare institution that holds a lot challenges, including addressing data bias, limited sample sizes, of data and wants to leverage the FL setting to collaborate with and ensuring model performance. Healthcare data may be biased other institutions, either to complete the missing domains and due to demographic imbalances, socio-economic factors, or dimensions (by utilizing Vertical FL) or to enrich and expand the varying levels of care access across populations. feature set in the same dimension (Horizontal FL). The global Training of ML models with diverse datasets enhances their server's responsibilities can be broken down into several distinct generalizability and robustness. By incorporating data from areas: strategy for choosing a foundation model, strategy for various sources and populations, models can better adapt to new exploiting data distribution and the client selection. and unforeseen health events, improving their predictive power The foundation model, also called the base model, forms the and reliability. Federated learning, in particular, enables the use initial point of the LLM training in the FL setting. It represents a of diverse datasets while maintaining privacy, thus improving starting point for institutions to leverage the pre-training process overall model performance [21, 17]. with their own data or fine-tune it to perform specific downstream tasks since foundation models are usually trained on 4 FL Architecture for LLM-Based Healthcare publicly available datasets. The pre-training process is Applications computationally and time-consuming, meaning that the global Figure 1 depicts the three major components of a typical FL server must find the most suitable scenario for satisfying architecture. The participants involved in our cross-silo FL requirements. setting are the healthcare institutions, the manager (e.g., 13 There are two main kinds of pre-training models: one based generalize and can handle novel instructions even though they on the BERT model and the other on the GPT model. Both were not a part of the fine-tuning instruction dataset. perform differently for different tasks and scenarios [5, 22]. The variety of downstream tasks that LLMs can perform in the Many attempts are made to use publicly available literature healthcare domain is often the critical reason institutions engage specific to the medical domain and create ready LLM models for with this concept. Based on a benchmark for generalist usage, such as BioBert, which show superior performance than biomedical AI, some of the most frequently performed medical- general pre-trained models [15]. relevant tasks suitable for the LLM domain are question One approach is to select a suitable foundation model based answering, visual question answering (for example, based on on the options examined before or to initialize FL pre-training, radiology or pathology images), report summarization and where each party will contribute to the pre-training of the generation, and medical image and medical documents foundation LLM model. The latter approach requires the classification [26]. Additionally, relation extraction in institutions to have expertise and sufficient computational power combination with named entity recognition can be added to the in their infrastructure to complete the assignment. Another list of medical-relevant tasks. This is helpful in the medical consideration is the model's size and complexity, which will domain to extract medical terms such as diseases, conditions, influence the following steps if not chosen according to the procedures, and symptoms from unstructured data and find participant's IT infrastructure. suitable interpretations and connections in the unstructured data Federated Learning can work differently depending on how pool. data is spread and distributed across entities. In horizontal FL, Thanks to the LLaMA, each institution can make significant each institution has data with similar features (for example, attempts to build its domain-specific instruction set and multiple hospitals with similar patient data). In vertical FL, contribute to global instruction tuning in the FL setting. With the institutions have different features for the same set of patients FL paradigm, each downstream task can be trained on multiple (for example, one entity has clinical data, and another has genetic datasets instead of a single dataset, giving more suitable data). Transfer learning can also be used when the model needs responses and outputs [25]. to generalize across different datasets [20]. The global server is In the FL setting, the value-alignment step occurs on the responsible for choosing the appropriate strategy based on the participant's side during local training. Its purpose is to ensure data distribution and the desired outcomes. that the model's objectives are aligned with each institution's Client selection in FL refers to choosing which institutions to values and goals. This step is particularly crucial in the medical participate in each training round. The global server must ensure field, where ethical guidelines and patient care standards are of that diverse institutions contribute to model updates without the utmost importance. overloading the communication system. Institutions with more Technically, value-alignment is solved similarly to data or better computational resources might participate more instruction tuning, with each participant's preference dataset frequently. Still, the system should be flexible enough to rotate containing combinations of instruction, preferred, and clients or dynamically adjust client participation based on misreferred responses. resource availability. In FL, participants typically have far fewer computational resources than centralized cloud servers and fine-tuning all 4.2 Participants parameters of LLMs can be an obstacle. Parameter-efficient The entities participating in this collaboration technique are tuning techniques, such as Lora, are used to address this also tasked with significant responsibilities. In the medical limitation [6]. Instead of updating the entire pre-trained model domain, these institutions own huge structured or unstructured weights to obtain local updates, participants modify only specific datasets and are willing to participate in a distributed training parameters and send them back to the global server for process. Their responsibilities can be broken down into the pre-aggregation. training process, fine-tuning, value alignment, and strategy for local updates. 4.3 Communication-Computation Layer If the FL-specific training approach is adopted, as discussed As presented above, the global server is responsible for previously in the strategy for choosing the foundation model, managing the whole ecosystem, and one of the most complex then each institution may pre-train the selected model on its data tasks is related to the communication-computation layer. The and ensure that the initial model updates sent to the global server global server should manage the aggregation process of local are more relevant and valuable. This capability in an FL setting model updates and ensure that the global LLM and updates are can depend significantly on the institution's computational securely transmitted across the system. power, and even though possible and theoretically feasible, it can Choosing the suitable FL algorithm for combining all require a lot of additional expertise for the healthcare institution findings and improvements made by each participant in the form to engage in this kind of activity. of parameter weights is a step that has attracted many researchers Fine-tuning is crucial in adopting an LLM in the healthcare and experts. One of the first and most used algorithms is domain. The approach of LLM fine-tuning is to make Federated Averaging (FedAvg), but more sophisticated downstream tasks required by the domain be instructed with approaches may be necessary in different scenarios [16]. The human feedback [18]. Each institution should provide input-model's performance relies significantly on how model updates output pairs where instructions are explicitly offered to solve are aggregated. Even though the only data transmitted through some already defined downstream tasks. These datasets are the network in an FL setting are the model and its updates, the designed to give the model an idea of what kind of output is communication layer is responsible for ensuring that the transfer expected. The expectation is that the LLM will learn to is done securely and continuously. The communication layer 14 component must develop a strategy for creating a pipeline from LLM training in the FL setting. Fine-tuning LLMs in FL is a a live data connection to the model and inference to transmitting time-consuming and computationally expensive task [6]. new model parameters via secure channels to the aggregating The client selection process, in which the ecosystem manager server. Size and complexity of the model must be considered as decides which participants should be involved, can raise many well, since they can cause a bottleneck. ethical concerns, such as fairness. The purpose of the In addition, the communication layer also ensures that the collaboration is to make the LLM more robust. Still, some data transfer is seamless and uninterrupted. This component is participants' data volume and computational power can squeeze tasked with developing a robust strategy to create an efficient out institutions that are not on that level but still can add to the pipeline, from managing real-time data connections to diversity and offer unique cultural, ethical, and contextual facilitating model utilization and transmitting updated model values. While FL addresses many privacy concerns by design, it parameters securely to the central aggregation server. A key also introduces new security considerations that must be consideration for the communication layer is the size and carefully managed. Successfully navigating these challenges complexity of the used model. Large models with huge requires a detailed approach that balances privacy protection, parameter lists can introduce significant bottlenecks during security enhancement, and the pursuit of practical and robust transmission, especially when dealing with limited bandwidth or LLM in healthcare. less powerful devices. As such, the communication layer must be adept at handling these challenges, ensuring that updates are 6 Privacy-Preserving Techniques transferred efficiently without compromising the speed or The deployment LLMs in the healthcare field through FL security of the system. promises advancements in preparing models to react to given domain-specific downstream tasks. The FL can enhance LLMs' 5 Hype, Vision and Challenges effectiveness and proper application while safeguarding patient Implementing LLMs in healthcare using FL presents a set confidentiality and ensuring regulatory compliance, providing of intertwined challenges when viewed through the lenses of medical professionals greater confidence in adopting these tools. privacy and security. There is a foundational challenge between However, while FL enables collaborative learning without the need for diverse and high-quality data generated by direct data sharing, it's not immune to privacy threats. With this institutions in the specific domain and the importance of approach, raw data remains local, but the model updates shared protecting sensitive information. FL enables availability and during training can still leak information. In addition, LLMs access to a broader spectrum of data sources while maintaining trained with healthcare data could memorize and potentially privacy. Still, the inability to directly act upon raw data can regenerate sensitive patient information. A privacy breach in this impact the convergence of the model and model performance. context can cause severe consequences, including exposure to Data transfer needs in FL, even though minimized to just model medical history, compromising patient confidentiality, and updates, still introduce a risk for security attacks. This risk misuse of sensitive health information [1]. increases with the communication overhead caused by During this collaborative process, the model or its updates distributing complex and large LLMs. could become targets for various attacks. For instance, model By introducing a central figure in the architecture in the inversion attacks performed on the global model might allow the name of the global aggregation server, the FL setting in LLM reconstruction of individual patient records. Similarly, opens up a single point of failure in the ecosystem. Adversarial membership inference attacks could reveal the presence of attacks can be performed, compromising model integrity, which specific institutions or patient data in the training, potentially could lead to data breaches and incorrect outputs. exposing the entire medical history. Malicious participants in the FL is still a young and immature topic in the context of process could poison the model by introducing biases or LLM. One of the biggest challenges is the lack of benchmarks backdoors, potentially leading to improper results generated by and comprehensive reviews that can examine the solution's the LLMs [9, 24]. success based on different tasks, architectures, the number of To counter these risks and threats, researchers and clients, network bandwidth, computational resources, etc. These practitioners evaluate the effects of several privacy-preserving reviews and benchmarks can further expose security and privacy-techniques, such as secure aggregation and differential privacy. preserving issues and initiate proper risk mitigation strategies. Secure aggregation, a cryptographic protocol, allows the central Multiple algorithms exist in the literature for aggregating local server to observe aggregated results without accessing individual updates, but no specific algorithm is proposed or adapted for model updates. This approach maintains accuracy but adds LLMs. significant communication costs. Differential privacy, on the The analysis of the three major components in the previous other hand, adds calibrated noise to data or model parameters, section pointed out the responsibilities, approaches, and offering statistical privacy guarantees. While effective against strategies that need to be considered in order to collaboratively inference attacks, it may reduce model accuracy and require design and implement training, and utilize LLM properly. The additional workload in the parameter-tuning process [14]. The analysis emphasized that training LLMs in a federated learning choice of privacy-preserving techniques must be made with a setting requires a thoughtful, tailored approach to address the thorough understanding of the specific use case, the sensitivity unique challenges. Additionally, there are various approaches to of the data involved, and the potential impacts of privacy take, depending on factors such as participant resources, data breaches. The tailored approach should calibrate the trade-off distribution, model size and complexity, data transfer, etc. This between model performance and data protection. More robust section will further examine the challenges of implementing such privacy protection might require limiting the model’s access to 15 much-needed data for LLMs to offer a proper answer to a specific [11] Kuang, W., Qian, B., Li, Z., Chen, D., Gao, D., Pan, X., Xie, Y., Li, Y., Ding, B. and Zhou, J. 2024. FederatedScope-LLM: A Comprehensive Package for task, degrading the model performance and increasing the Fine-tuning Large Language Models in Federated Learning. Proceedings of computational and communicational overhead. As research in the 30th ACM SIGKDD Conference on Knowledge Discovery and Data this field progresses, finding the right balance between privacy, Mining (New York, NY, USA, Aug. 2024), 5260–5271. system performance, and efficiency will be crucial for deploying [12] Kuner, C., Bygrave, L.A., Docksey, C. and Drechsler, L. eds. 2020. The EU General Data Protection Regulation (GDPR): A Commentary. (Feb. 2020). LLMs in healthcare using FL. [13] Laurie, G., Jones, K., Stevens, L. and Dobbs, C. 2015. A Review of Evidence Relating to Harm Resulting from Uses of Health and Biomedical Data. 7 Conclusion [14] Li, Q., Wen, Z., Wu, Z., Hu, S., Wang, N., Li, Y., Liu, X. and He, B. 2023. A Survey on Federated Learning Systems: Vision, Hype and Reality for Data This paper has explored the potential of FL in enhancing the Privacy and Protection. IEEE Transactions on Knowledge and Data deployment of LLMs in healthcare settings. By enabling privacy-Engineering. 35, 4 (Apr. 2023), 3347–3366. DOI:https://doi.org/10.1109/TKDE.2021.3124599. preserving collaboration, FL allows healthcare institutions to [15] Luo, R., Sun, L., Xia, Y., Qin, T., Zhang, S., Poon, H. and Liu, T.-Y. 2022. collectively train and improve LLMs without compromising BioGPT: generative pre-trained transformer for biomedical text generation sensitive patient data. This approach not only addresses and mining. Briefings in Bioinformatics. 23, 6 (Nov. 2022), bbac409. fundamental privacy concerns but also enhances model DOI:https://doi.org/10.1093/bib/bbac409. [16] McMahan, B., Moore, E., Ramage, D., Hampson, S. and Arcas, B.A. y 2017. performance by leveraging diverse datasets across institutions, Communication-Efficient Learning of Deep Networks from Decentralized potentially improving the generalizability and robustness of Data. Proceedings of the 20th International Conference on Artificial LLMs in healthcare applications. To facilitate the Intelligence and Statistics (Apr. 2017), 1273–1282. [17] Nguyen, D.C et al. 2022. Federated Learning for Smart Healthcare: A implementation of healthcare LLM with FL, we examined a Survey. ACM Comput. Surv. 55, 3 (Feb. 2022), 60:1-60:37. tailored architectural framework that outlines the roles and DOI:https://doi.org/10.1145/3501296. responsibilities of participating entities. In addition, challenges [18] Ouyang, L. et al. 2022. Training language models to follow instructions with and consideration of risks and threats were reviewed, especially human feedback. Advances in Neural Information Processing Systems. 35, (Dec. 2022), 27730–27744. in combination with privacy-preserving techniques. [19] Pfitzner, B., Steckhan, N. and Arnrich, B. 2021. Federated Learning in a Looking ahead, several areas require further research and Medical Context: A Systematic Literature Review. ACM Transactions on development. Optimization of computational and Internet Technology. 21, 2 (Jun. 2021), 1–31. DOI:https://doi.org/10.1145/3412357. communication efficiency for LLMs, development of [20] Prayitno, Shyu, C.-R., Putra, K.T., Chen, H.-C., Tsai, Y.-Y., Hossain, K.S., standardized benchmarks, establishment of ethical frameworks Jiang, W. and Shae, Z.-Y. 2021. A Systematic Review of Federated Learning for participant selection, and exploration of advanced privacy-in the Healthcare Area: From the Perspective of Data Properties and preserving techniques are crucial for future work. Applications. Applied Sciences. 11, 23 (Dec. 2021). DOI:https://doi.org/10.3390/app112311191. [21] Qayyum, A., Qadir, J., Bilal, M. and Al-Fuqaha, A. 2020. Secure and Robust 8 Acknowledgments Machine Learning for Healthcare: A Survey. arXiv. [22] Sallam, M. 2023. ChatGPT Utility in Healthcare Education, Research, and This project is funded by the European Union under Horizon Practice: Systematic Review on the Promising Perspectives and Valid Europe (project ChatMED grant agreement ID: 101159214). Concerns. Healthcare. 11, 6 (Jan. 2023), 887. DOI:https://doi.org/10.3390/healthcare11060887. REFERENCES [23] Teo, Z.L. et al. 2024. Federated machine learning in healthcare: A systematic review on clinical applications and technical architecture. Cell Reports Medicine. 5, 2 (Feb. 2024), 101419. [1] [Aouedi, O., Sacco, A., Piamrat, K. and Marchetto, G. 2023. Handling DOI:https://doi.org/10.1016/j.xcrm.2024.101419. Privacy-Sensitive Medical Data With Federated Learning: Challenges and [24] Thapa, C. and Camtepe, S. 2021. Precision health data: Requirements, Future Directions. IEEE Journal of Biomedical and Health Informatics. 27, challenges and existing techniques for data security and privacy. Computers 2 (Feb. 2023), 790–803. DOI:https://doi.org/10.1109/JBHI.2022.3185673. in Biology and Medicine. 129, (Feb. 2021), 104130. [2] Chen, C., Feng, X., Zhou, J., Yin, J. and Zheng, X. 2023. Federated Large DOI:https://doi.org/10.1016/j.compbiomed.2020.104130. Language Model: A Position Paper. arXiv. [25] Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, [3] Dasaradharami Reddy, K. and Gadekallu, T.R. 2023. A Comprehensive T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Survey on Federated Learning Techniques for Healthcare Informatics. Grave, E. and Lample, G. 2023. LLaMA: Open and Efficient Foundation Computational Intelligence and Neuroscience. 2023, 1 (Jan. 2023), 8393990. Language Models. arXiv. DOI:https://doi.org/10.1155/2023/8393990. [26] Tu, T. et al. 2024. Towards Generalist Biomedical AI. NEJM AI. 1, 3 (Feb. [4] Deep learning for healthcare: review, opportunities and challenges - PMC: 2024), AIoa2300138. DOI:https://doi.org/10.1056/AIoa2300138. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6455466/ [27] Xu, J., Glicksberg, B.S., Su, C., Walker, P., Bian, J. and Wang, F. 2021. [5] Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. 2019. BERT: Pre-Federated Learning for Healthcare Informatics. Journal of Healthcare training of Deep Bidirectional Transformers for Language Understanding. Informatics Research. 5, 1 (Mar. 2021), 1–19. [6] Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L. and [28] Yang, J., Jin, H., Tang, R., Han, X., Feng, Q., Jiang, H., Zhong, S., Yin, B. Chen, W. 2021. LoRA: Low-Rank Adaptation of Large Language Models. and Hu, X. 2024. Harnessing the Power of LLMs in Practice: A Survey on [7] Joshi, M., Pal, A. and Sankarasubbu, M. 2022. Federated Learning for ChatGPT and Beyond. ACM Trans. Knowl. Discov. Data. 18, 6 (Apr. 2024), Healthcare Domain - Pipeline, Applications and Challenges. ACM 160:1-160:32. DOI:https://doi.org/10.1145/3649506. Transactions on Computing for Healthcare. 3, 4 (Oct. 2022), 1–36. [29] Ye, R., Wang, W., Chai, J., Li, D., Li, Z., Xu, Y., Du, Y., Wang, Y. and Chen, DOI:https://doi.org/10.1145/3533708. S. 2024. OpenFedLLM: Training Large Language Models on Decentralized [8] Kairouz, P. et al. 2021. Advances and Open Problems in Federated Learning. Private Data via Federated Learning. (2024). Foundations and Trends® in Machine Learning. 14, 1–2 (Jun. 2021), 1–210. DOI:https://doi.org/10.48550/ARXIV.2402.06954. DOI:https://doi.org/10.1561/2200000083. [30] Zhou, C. et al. 2023. A Comprehensive Survey on Pretrained Foundation [9] Kaissis, G.A., Makowski, M.R., Rückert, D. and Braren, R.F. 2020. Secure, Models: A History from BERT to ChatGPT. arXiv. privacy-preserving and federated machine learning in medical imaging. Nature Machine Intelligence. 2, 6 (Jun. 2020), 305–311. DOI:https://doi.org/10.1038/s42256-020-0186-1. [10] Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J. and Amodei, D. 2020. Scaling Laws for Neural Language Models. arXiv. 16 Use and Limitations of ChatGPT in Mental Health Disorders Jovana PetroviㆠIva Binić Maša Vacev Stevo LukiㆠClinic of Psychiatry Clinic of Psychiatry Clinic of Psychiatry Clinic of Neurology, University Clinical center University Clinical center University Clinical center University Clinical centre Nis, Serbia Nis, Serbia Nis, Serbia Niš, Serbia zdravkovicj91@gmail.com ivabinic@gmail.com vacevmasa@gmail.com slukic@medfak.ni.ac.rs Abstract / Povzetek The discourse surrounding the potential applications of ChatGPT in mental health disorders remains relatively ChatGPT is one of the most advanced and rapidly evolving underexplored. This work seeks to offer a quick overview of the large language model-based chatbots. It excels in everything current state of ChatGPT implementations within the mental from handling simple questions to performing complex medical health domain, while also projecting future advancements in examinations. While current technology cannot replace the digital mental health care through the integration and expertise and judgment of skilled psychiatrists, it can assist in development of ChatGPT technology. early detection of mental problems, patient evaluations, differential diagnoses, psychotherapy and in planning and Current use of ChatGPT for mental health conducting medical research. Ensuring privacy and adhering to professional, ethical, and legal standards is crucial when disorders processing training data. This is especially important in mental In managing mental health disorders, human contact is health settings, where disclosing sensitive personal information especially crucial compared to other medical fields, as it is increases the risk of data misuse and the potential for harmful accompanied by understanding and empathy. This is why, in advice. Current uses of ChatGPT in mental health care are the most vital aspects of psychiatric work, human relationships constrained by its design as a general chatbot, rather than a remain central [2]. While ChatGPT's practical applications in specialized psychiatric tool. Despite this, the model proves the field of mental health are limited because it specializes in useful for handling routine psychiatric and administrative tasks. language generation, it can still support certain routine tasks As GPT technology evolves, it holds significant promise for within the field. Although evaluations, diagnoses, psychiatry, including integration into diagnostics, psychotherapy, and patient assessments are mainly conducted psychotherapy, and early detection of mental health issues. To by human therapists, ongoing trials are exploring how ChatGPT deploy these advancements responsibly and effectively, it is might be utilized in mental health services. Several platforms, crucial to develop and refine professional ethical standards and like ChatBeacon, or Koko, are already available on the market, practice guidelines. claiming to offer mental health assistance powered by ChatGPT [3,4]. For instance, Koko is a peer-support platform that Keywords / Ključne besede provides crowdsourced cognitive therapy. It's experimenting with using GPT Keywords mental health disorders, large language models, deep -3 to generate bot-written responses to users learning, ChatGPT while they wait for peer responses. Koko is an online mental health intervention that has reached nearly two million people, mostly adolescents. The platform started as a clinical trial at Introduction MIT and is based on the concept of crowdsourced cognitive therapy. Users are taught to help each other think more ChatGPT has emerged as one of the most advanced and rapidly hopefully about the world. Unlike traditional peer support evolving large language model-based chatbot systems. Its platforms, all interactions on our service are supported and extensive capabilities, ranging from responding to basic augmented by AI. inquiries to performing well in complex medical examinations, have garnered significant attention from the global scientific Applied to the classification of psychiatric disorders. and research communities, prompting ongoing discourse Recent advancements in deep learning, the foundational regarding its potential applications across diverse domain [1] algorithm of GPT, have significantly impacted the field of mental health disorders. This technology has been applied to ∗ All authors contributed equally to this research † classify psychiatric disorders using neuroimaging data [5], Corresponding authors develop models based on electroencephalograms [6], and utilize Permission to make digital or hard copies of part or all of this work for personal or a range of patient characteristics for diagnosing and predicting classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and mental disorders [7]. These deep learning models have shown the full citation on the first page. Copyrights for third-party components of this good diagnostic accuracy (AUC 0.74- 0.81) suggesting the work must be honored. For all other uses, contact the owner/author(s). possibility of combining genetics and registry data to predict Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia © 2024 Copyright held by the owner/author(s). both mental disorder diagnosis and disorder progression in a clinically relevant, cross-diagnostic setting prior to clinical https://doi.org/10.70314/is.2024.chtm.6 17 assessment. The main limitation of this model is that it is restricted to learning from historical data and should be Employed for the prevention and early detection of mental continuously assessed and evaluated by trained clinicians and health issues never stand alone in the decision-making The role of AI in the prevention and early detection of mental problems can also be very significant. Patients frequently turn Utilized to alleviate the burdens associated with clinical to ChatGPT to seek information about their symptoms, possible documentation, communication, and research tasks. diagnoses, and treatment options. ChatGPT performs better Also, new technologies can assist clinicians by allowing them than Google Search in delivering general medical knowledge to focus more on direct patient care and alleviate the high but scores lower when it comes to providing medical clinical workload and bureaucratic tasks- such as handling recommendations [19]. A recent study highlighted early success admissions and managing paperwork- that have been linked to for an AI model that can detect cognitive distortions in text burnout in earlier research [8]. ChatGPT can assist in messages with accuracy comparable to that of clinically trained processing clinical case transcripts, generating summaries, human raters [20]. GPT’s ability to recognize mental health completing medical record documentation, and efficiently warning signs in routine conversations or text-based preparing discharge summaries. It can also help facilitate telemedicine interactions has the potential to improve early and communication between clinicians of different specialties when effective intervention strategies when necessary. consultations are needed, aiding in an integrative approach [9]. While current technology cannot replace the expertise and Risks and limitations ChatGPT use for mental judgment of skilled psychiatrists, it can assist in generating health disorders differential diagnoses based on relevant signs and symptoms. Despite its strengths and potential, the use of AI technologies in ChatGPT is recognized for its substantial potential to assist psychiatric clinical practice carries several risks. A significant experts with clinical and laboratory diagnoses, as well as in concern is the phenomenon of “artificial hallucinations” where planning and conducting medical research [10]. the conversational model may confidently produce text that is factually incorrect, nonsensical, or misleading [18]. Applied in psychotherapy to enhance therapeutic processes Recent systematic review that included 118 articles Chatbots can be also beneficial in psychotherapy. The identified some limitations regarding the potential of ChatGPT therapist's emotions and the emotional alignment between in patient care and medical research, noted that the solutions therapist and client are crucial factors influencing the process provided by ChatGPT are often insufficient and contradictory, and outcomes of therapy [11]. A study conducted during the raising concerns about their originality, privacy, accuracy, and COVID-19 pandemic showed that technology can offer an legality [10]. It is well established that ChatGPT may generate effective method, providing at least a first level counseling inaccurate facts and references when summarizing previous support structure [12]. This implies that GPT models may research, and the quality of its responses often hinges on how potentially develop cognitive empathy over time, making it the questions are phrased [21]. possible for ChatGPT to achieve a notable level of accuracy in Even with the advanced GPT-4 model, there is still a risk of identifying users’ emotions [13], nevertheless it is important to providing harmful advice. The absence of clinical reasoning make systematic testing to ensure a non-superficial comparison and experience in ChatGPT can lead to the omission of between human and artificial intelligences [14]. Namely, important clinical details in patient summaries and medical ChatGPT currently lacks the capability to accurately assess records. Thus, the most prudent approach is to employ AI personality traits, a task that demands extensive training and systems as supplementary tools for mental health professionals, expertise from psychiatrists. AI researchers are diligently ensuring they are used under close supervision to uphold the pursuing technical advancements to improve the precision of safety and quality of patient care. [22] personality detection [15]. A study designed to assess the accuracy and appropriateness of psychodynamic formulations generated by ChatGPT found that Conclusion the model produced suitable results even without additional The recent introduction of GPT-4 has significantly enhanced psychoanalytic information. It also demonstrated a strong the capabilities of the GPT system. Current implementations of ability to generate psychodynamic formulations consistent with ChatGPT within mental health care are limited by its inherent various psychoanalytic theories when given appropriate design as a chatbot, rather than as a specialized AI tool instructions. [16]. specifically tailored for psychiatric use. Nonetheless, this The research underscores that ChatGPT is not intended to sophisticated language model demonstrates significant utility in substitute psychiatrists or psychologists but rather to function as addressing various routine psychiatric and administrative an initial resource and a first line of support for those dealing functions. with mental distress. When used wisely and within appropriate As this technology evolves and advances, we anticipate limits, ChatGPT can be an effective tool in supporting mental substantial potential for future applications of GPT technology health services. [17]. However, it is essential to process training in psychiatry, including its integration into diagnostic data in a manner that ensures privacy protection and adheres to processes, the provision of psychotherapy within clinical all professional, ethical, and legal standards. Given that environments, and the rapid identification of early warning individuals may be at increased risk of data misuse when signs for mental health disorders. Crucially, the development disclosing sensitive personal and family information during and refinement of professional ethical standards and practice mental health treatment [18]. guidelines are imperative for the responsible and effective 18 deployment of these transformative GPT technologies in the prognosis using Danish Nationwide register and genetic data. JAMA mental health sector. Psychiatry 80:146-155 [8] Kumar S. (2007) Burnout in psychiatrists. World Psychiatry 6:186-189. [9] Ali SR, Dobbs TD, Hutchings HA, Whitaker IS. (2023) Using ChatGPT to write patient clinic letters. Lancet Digit. Health 5:e179-e181. [10] Garg RK, Urs VL, Agarwal AA, Chaudhary SK, Paliwal V, Kar SK. Acknowledgments / Zahvala (2023) Exploring the role of ChatGPT in patient care (diagnosis and The work presented in this paper was partially funded in part by treatment) and medical research: A systematic review. Health Promot Perspect. 13(3):183-191. the European Union under Grand 101159214- ChatMED. [11] Chui H, Li X, Luk S. (2022) Therapist emotion and emotional change Views and opinions expressed are however those of the with clients: Effects on perceived empathy and session quality. Psychotherapy (Chic). 59(4):594-605. author(s) only and do not necessarily reflect those of the [12] Mawani A, Nderu L. (2020) Towards an online empathy assisted European Union. Neither the European Union nor the granting counselling web application. EAI Endorsed Transactions on Context-authority can be held responsible for them. Insert paragraph text aware Systems and Applications 7(22):167792. [13] Kosinski M. (2023) Theory of mind may have spontaneously emerged in here. large language models. ArXiv abs/2302.02083. [14] Strachan J, Albergo D, Borghini G. et al. (2024) Testing theory of mind in large language models and humans. Nat Hum Behav 8:1285-1295.5 [15] El-Demerdash K, El-Khoribi RA, Ismail Shoman MA, Abdou S. (2022) References / Literatura Deep learning based fusion strategies for personality prediction. Egypt Inform J 23:47-53. [1] Franco D'Souza R, Amanullah S, Mathew M, Surapaneni KM. (2023) [16] Hwang G, Lee DY, Seol S. et al. (2024). Assessing the potential of Appraising the performance of ChatGPT in psychiatry using 100 clinical ChatGPT for psychodynamic formulations in psychiatry: An exploratory case vignettes. Asian J Psychiatr. 89:103770. study. Psychiatry Res. 331:115655 [2] Spencer L, Broome M. (2023) The epistemic harms of empathy in [17] Arjanto P, Senduk FFW, Nahdiyah U, Utami MS. (2024) AI and ethics phenomenological psychopathology. Phenom Cogn Sci doi: in mental health: exploring the controversy over the use of ChatGPT. J https://doi.org/10.1007/s11097-023-09930-1. Public Health (Oxf). 46(2):e340-e341. [3] ChatBeacon. Mental health assistant powered by ChatGPT. (2024) URL: [18] Wei Y, Guo L, Lian C, Chen J. (2023) ChatGPT: Opportunities, risks https://www.chatbeacon.io/industrychatgpt/mentalhealth (Accessed 24-and priorities for psychiatry. Asian J Psychiatr. 90:103808 Aug-2024) [19] Ayoub NF, Lee YJ, Grimm D, Divi V. (2024) Head-to-Head Comparison [4] Koko. GPT-3 mental health intervention. (2024) URL: of ChatGPT Versus Google Search for Medical Knowledge Acquisition. https://gpt3demo.com/apps/koko-ai (Accessed 24-Aug-2024). Otolaryngol Head Neck Surg. 170(6):1484-1491. [5] Quaak M, van de Mortel L, Thomas RM, van Wingen G. (2021) Deep [20] Tauscher JS, Lybarger K, Ding X et al. (2023) Automated detection of learning applications for the classification of psychiatric disorders using cognitive distortions in text exchanges between clinicians and people neuroim- aging data: Systematic review and meta-analysis. Neuroimage. with serious mental illness. Psychiatr. Serv. 74: 407-410. Clin. 30:102584. [21] The Lancet Digital Health. (2023) ChatGPT: friend or foe? Lancet Digit [6] de Bardeci M, Ip CT, Olbrich S. (2021) Deep learning applied to Health. 5(3):e102. electroencephalogram data in mental disorders: A systematic review. [22] Jo E, Song S, Kim JH, et al. (2024) Assessing GPT-4’s Performance in Biol. Psychol. 162:108117. delivering medical advice: Comparative analysis with human experts. [7] Allesoe RL, Thompson WK, Bybjerg-Grauholm J et al. (2023) Deep JMIR Med Educ. 10:e51282. learning for crossdiagnostic prediction of mental disorder diagnosis and 19 Testing ChatGPT’s Performance on Medical Diagnostic Tasks ∗ Alexander Perko ∗ Franz Wotawa alexander.perko@ist.tugraz.at wotawa@ist.tugraz.at Graz University of Technology, Institute of Software Technology Graz, Austria Symptoms Diagnoses Golden Overlap Score Model ⊇ Diagnoses LLM Figure 1: Semi-Automatic Evaluation of an LLM on Medical Diagnostics Using a Medical Expert System as a Golden Model. Abstract to find answers in one spot as LLM-powered chatbots, like ChatGPT [8], are happy to respond with a diagnosis. This, of course, Large Language Models and chat interfaces like ChatGPT have implies much risk of harm or misinterpretation. After all, the become increasingly important recently, receiving a lot of atten-very reason many users - being non-experts - turn to chatbots tion even from the general public. People use these tools not only is that they cannot assess symptoms themselves. Clusmann et to summarize or translate text but also to answer questions, in-al. [5] further point out that there is a lack of mechanisms to cluding medical ones. For the latter, giving reliable feedback is of guarantee that the LLM’s output is correct. All of this makes it utmost importance, which is hard to assess. Therefore, we focus important to test such systems on a practical level, which is close on validating the feedback of ChatGPT and propose a testing proto the use cases of non-experts. As for its popularity, our eval-cedure utilizing other medical sources to determine the quality uation focuses on ChatGPT [8], which is powered by OpenAI’s of feedback for more straightforward medical diagnostic tasks. most recent model, GPT-4o [9, 10]. The task of medical diag-This paper outlines the problem, discusses available sources, and nostics shares many traits with the natural language processing introduces the validation method. Moreover, we present the first (NLP) task of question answering (QA). Namely, this task tests results obtained when applying the testing framework to Chat-for medical knowledge as well as basic reasoning facing medical GPT. language. MedQA [6] is a popular benchmark in literature, which Keywords is tailored to the medical domain. In recent years, open-domain LLMs such as GPT-3.5 [3], GPT-4 [9], and LLaMA-2 [16] as well Large Language Models, ChatGPT, NetDoktor, Testing, Validation as domain-specific LLMs like Med-PaLM 2 [15], Meditron [4] and Med-Gemini [14] have been evaluated on medical QA. The 1 Introduction United States Medical Licensing Examination (USMLE) part of Large Language Models (LLMs) are omnipresent in today’s soci- MedQA is used particularly often as a performance indicator in ety, as they are used by a wide audience for a growing number this domain. Table 1 shows reported scores of the mentioned of tasks. This study sheds light on one area of application in par-LLMs and demonstrates GPT-4’s and MedGemini’s superiority, ticular, which is asking for medical diagnoses. Assessing one’s with GPT-4 performing marginally worse despite being an open-health and medical diagnostics are complex tasks, that fall into domain model. the domain of medical experts. However, since the dawn of search engines and medical websites, like NetDoktor [13], people have turned to the internet for getting health advice. Previously, users Table 1: LLMs Evaluated on Medical Question Answering. searching for answers had to consult multiple online resources, Accuracy Results on the United States Medical Licensing compare page contents, and evaluate whether their set of symp- Examination (USMLE) Part of MedQA [6], as Reported in [7, toms matched what they found. Nowadays it is seemingly easy 14, 4, 15]. ∗ Both authors contributed equally to this research. Model Domain-Specific MedQA USMLE Permission to make digital or hard copies of all or part of this work for personal Med-Gemini Yes 91.1 or classroom use is granted without fee provided that copies are not made or GPT-4 No 90.2 distributed for profit or commercial advantage and that copies bear this notice and Med-PaLM 2 Yes 86.5 the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner /author(s). Meditron Yes 75.8 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia LLaMA-2 No 63.8 © 2024 Copyright held by the owner/author(s). GPT-3.5 No 60.2 https://doi.org/https://doi.org/10.70314/is.2024.chtm.7 20 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Alexander Perko and Franz Wotawa Alongside ChatGPT’s popularity, these results are a major rea- starting with the first question as the root node. It should be son why this paper focuses on GPT-4o in particular. This work noted, that the very first question "Um wen geht es?" (i.e. "Who is contributes by introducing a semi-automated validation proce-it about?"), was always answered by "jemand anderen" (i.e. "some-dure for medical diagnostics performed with LLMs using an ex- body else") for this study. Rectangles represent questions and pert system as a golden model (compare to Figure 1). Specifically, the ellipses represent the respective possible answers to choose we evaluate the performance of ChatGPT powered by GPT-4o from. The node at the second to last level, which is denoted by with a focus on symptom descriptions in German and compare it "Mögliche Erkrankungen" (i.e. "possible diseases") symbolizes the to NetDoktor’s Symptom-Checker [13], which is curated by med-retrieval of diagnoses from the database, while the leaf nodes on ical professionals. Our setup is guided by the following questions the bottom level signify the results of the query. In this exemplary regarding prompting ChatGPT: case, the questions were answered to correspond to the symptoms • of a heart attack for demonstration. However, we can also use Does ChatGPT provide equivalent diagnoses when pre- Symptom-Checker to automatically and randomly traverse the sented with the same symptoms as NetDoktor? • questionnaire’s tree-like structure to retrieve sets of symptoms Does the output quality - as measured by the overlap - and corresponding diagnoses. This allows for a scaleable frame-change when asked for a specific amount of "most likely" work for comparing other methods against a strong and valid diagnoses? • baseline. Sets of symptoms and corresponding "golden" diagnoses Does the output increase in quality when ChatGPT is are extracted from such a tree as follows: Firstly, for each path queried in English instead of German? from the root node to the bottom level nodes (i.e. the diagnoses), 2 Validation Methodology questions-answer-pairs are stored in a JSON data structure. Each full path represents one set of symptoms. Secondly, each set of For the purpose of introducing our methodology, we use myocar-symptoms is summarized in a textual representation in German dial infarction (i.e. heart attack) as a guiding example. According taking special care not to lose or add information. This is then to Statisik Austria’s annual report, cardiovascular diseases, which translated from German to English. The first rows of Tables 3 include heart attacks, are the most common cause of death in and 4 contain the textual descriptions of our example in German Austria. The symptoms of a myocardial infarction include: and English, respectively. Lastly, the diagnoses provided by the • Feeling of tightness or constriction golden model are extracted from the bottom layer (i.e. the leaf • Feeling of anxiety/panic attacks nodes) of the tree, which is always a set of three diagnoses. These • Sudden severe shortness of breath, unconsciousness, or sets of diagnoses are referred to as NetDoktor diagnoses for the severe dizziness remainder of this paper. • Nausea and vomiting • Blood pressure and pulse drop 2.2 Evaluation Metric These symptoms are now linked to an imaginary person’s sex and The main evaluation metric used in this work is the overlap age to form a persona whom for we want to retrieve diagnoses. of diagnoses as compared to NetDoktor. A set of diagnoses is Our exemplary set of symptoms shall be linked to an adult man considered as being good if it contains a large overlap with the and can be identified by ID 1 in all tables and plots. Besides this golden model diagnoses of NetDoktor. Since the NetDoktor base-exemplary persona, where we first fixed a disease, all other sets line always yields three diagnoses, the highest overlap any other of symptoms are picked at random. This can be done due to our system can achieve is 3/3. Thus, the score ranges from 0/3 to 3/3. assumption of a golden model, which we use as our baseline. We explicitly do not normalize, although we want to compare sets of diagnoses with varying cardinalities. The reason for this 2.1 Golden Model is that yielding more diagnoses should not be penalized (as they We use NetDoktor’s "Symptom-Checker" [13] as a baseline for might be worth considering, as well), and yielding fewer should our evaluation. Symptom-Checker is a freely accessible, medical not lead to a better score automatically. expert system for retrieving likely diagnoses corresponding to a person’s symptoms. The system can be interacted with via a 2.3 Equivalence of Diagnoses questionnaire but is only available in German. Parts of the ques-This study compares systems designed for direct interaction with tionnaire are static, such as questions regarding sex, age, and humans. These systems’ output is presented to the users in nat-selecting the general area of one’s body where symptoms occur ural language. A key feature of medical language is its inter- most prominently, while others are adapting to the previously changeable use of semantically equivalent terminology originatasked questions. The dynamically changing questions are always ing from different languages such as Latin, German, or English. asked expecting an answer from the set: "Yes", "No" and "Skip". Additionally, when talking to patients, medical personnel often According to NetDoktor, the system is continuously validated have to use simplified terminology, which includes the use of by medical professionals and is based on the medical database colloquial synonyms, hypernyms, and hyponyms. Hence, the AMBOSS [1] and follows the medical guidelines of professional semantic equivalence of diagnoses must be considered to ensure societies [2]. We assume this expert system to be our golden the comparability of different systems. model, as it comprises curated knowledge of high quality and • Synonyms are terms, which can be used interchangeably is fully deterministic. The latter makes it possible, to generate a with one another. decision tree from a person’s (or persona’s) interaction with the • Hypernyms are superordinate or umbrella terms of a term. 1 system, that is reproducible across multiple calls . Figure 2 shows • Hyponyms are describing subordinate terms (i.e. more the tree generated from the interaction of our exemplary persona specific) or another term. having a heart attack. The tree is to be read from top to bottom, Table 2 lists diagnoses that are treated as equivalents for this 1 As long as the underlying knowledge base does not change. study. 21 Testing ChatGPT Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Um wen geht es? Mich Jemand AnderenUm wen geht es? Geht es um eine Frau oder einen Mann? Mich Jemand Anderen Persona Weiblich Männlich Geht es um eine Frau oder einen Mann? Wie alt ist die Person? Weiblich Männlich Baby & Kind Jugendlicher Erwachsener Wie alt ist die Person? In welcher Körperregion treten die Beschwerden auf? Baby & Kind Jugendlicher Erwachsener Ganzer Körper Psyche Kopf Hals Brust Rücken Bauch Unterleib Arm / Hand Bein / Fuß In welcher Körperregion treten die Beschwerden auf? Haut Brustkorb Brustdrüße Ganzer Körper Psyche Kopf Hals Brust Rücken Bauch Unterleib Arm / Hand Bein / Fuß Nenne mir jetzt das Symptom, welches die Person am stärksten belastet. Haut Brustkorb Brustdrüße Engegefühl Nenne mir jetzt das Symptom, welches die Person am stärksten belastet. Atemnot Aufstoßen Auswurf oder Druck Herzrasen Herzstoplern Husten Pfeifgeräusche im Brustkorb beim Ausatmen Schmerz Sodbrennen Engegefühl Atemnot Aufstoßen Auswurf oder Druck Herzrasen Herzstoplern Husten Pfeifgeräusche beim Ausatmen Schmerz Sodbrennen Leidet die Person unter Atemnot bei körperliche im r An Bru s s ttr keonrg b ung? Ja Nein Überspringen Leidet die Person unter Atemnot bei körperlicher Anstrengung? Set of Symptoms Hat die Person Schmerzen? Ja Nein Überspringen Ja Nein Überspringen Hat die Person Schmerzen? Hat die Person Schmerzen, ohne dass sie sich bewegt/belastet? Ja Nein Überspringen Ja Nein Überspringen Hat die Person Schmerzen, ohne dass sie sich bewegt/belastet? Mögliche ErkrankungenJa Nein Überspringen Herzinfarkt Cor pulmonale (Pulmonale Hypertonie) Mögliche Erkrankungen Koronare Herzkrankheit Verschließt sich eines der Blutgefäße, Beim Cor pulmonale oder Lungenherz Die koronare Herzerkrankung ist die das Herz versorgen, kommt es zum Herzinfarkt. ist die rechte Herzkammer irreparabel überdehnt. eine schwere Durchblutungsstörung. Das Herz wird dann in einigen Bereichen Dies geschieht, wenn der Widerstand in Cor pulmonale (Pulmonale Sie Hypbe er troif n fite di ) e Herzkranzgefäße, Herzinfarkt Koronare Herzkrankheit nicht mehr ausreichend mit Blut versorgt. der Lunge über lange Zeit erhöht ist. die den Herzmuskel versorgen. Beim Cor pulmonale oder Lungenherz Wird der Engpass nicht schnell Verbe sc hob lie eßnt, Das erschwert die Arbeit des Herzens. sich eines der Blutgefäße, Damit steigt das Risiko für ist die rechte Herzkammer irreparabel überdehnt. Die koronare Herzerkrankung ist stirbt das Gewebe in dem be di t e rof da fsene He nrzBe veresich or ab gen . Meist ist die Ursache eine chronische Lungenerkrankung. , kommt es zum Herzinfarkt. eine schwere Durchblutungsstörung. Beispiele sind COPD oder Lung Die e s nf geibr sc os hi e e . Herzinfarkt und Herzschwäche. Das Herz wird dann in einigen Bereichen ht, wenn der Widerstand in Sie betrifft die Herzkranzgefäße, Notarzt nicht mehr ausreichend mit Blut versorgt. der Lunge über lange Zeit erhöht ist. Kardiologe Kardiologe die den Herzmuskel versorgen. Wird der Engpass nicht schnell behoben, Das erschwert die Arbeit des Herzens. Damit steigt das Risiko für stirbt das Gewebe in dem betroffenen Bereich ab. Meist ist die Ursache eine chronische Lungenerkrankung. Beispiele sind COPD oder Lungenfibrose. Herzinfarkt und Herzschwäche. Notarzt Kardiologe Kardiologe Set of Diagnoses Figure 2: Golden Model: Exemplary Decision Tree Based on NetDoktor’s Symptom-Checker Questionnaire [13] Filled-Out for a Persona Having a Heart Attack. Blue Boxes are Screenshots from Symptom-Checker Corresponding to Nodes in the Tree. We Set a Persona and Automatically Extract A) a Set of Symptoms and B) a Set of Diagnoses for Each Path From the Root Node to the Leaf Nodes on the Bottom-Most Level. 22 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Alexander Perko and Franz Wotawa Table 2: Equivalent Diagnoses: Synonyms, Hypernyms, Hy- Table 3: Sets of Symptoms per ID [DE] ponyms & Translations ID Description of Symptoms in German Diagnosis Equivalence (as Occurring in 1 Ein erwachsener Mann verspürt ein Engegefühl im Brustkorb. ChatGPT Output) Er hat Schmerzen, auch wenn er sich nicht bewegt oder belastet. Außerdem leidet er unter Atemnot, wenn er sich anstrengt. Herzinfarkt Myokardinfarkt 2 Ein Bub hat gerötete Augen und Fieber. Außerdem hat er Akutes Koronarsyndrom Schmerzen beim Wasserlassen. Myocardial Infarction 3 Ein kleiner Junge hat starke Bauchschmerzen, die bei Druck Heart Attack schlimmer werden. Es wurde auch festgestellt, dass er allgemein Reiter-Syndrom Reaktive Arthritis druckempfindlich ist. Morbus Reiter 4 Ein jugendlicher Junge verspürt ein Engegefühl oder Druck im Reactive Arthritis Brustkorb. Er bemerkt, dass sein Herz sehr schnell schlägt und Reiter’s Syndrome unregelmäßig arbeitet. Er fühlt sich oft müde und weniger leis-Kawasaki-Syndrom Kawasaki Disease tungsfähig. Kawasaki Syndrome 5 Ein jugendlicher Junge hat an Gewicht verloren und leidet unter Blinddarmentzündung Appendizitis anhaltender Müdigkeit. Er hat regelmäßig Durchfall, der beson- ders voluminös und übelriechend ist. Vorhofflimmern Herzrhythmusstörungen 6 Ein erwachsener Mann hat eine Rachenentzündung und bemerkt Glutenunverträglichkeit Zöliakie gerötete Augen. Es gibt jedoch kein Fieber oder geschwollene Bakterielle Pharyngitis Mild Bacterial Conjunctivitis Lymphknoten. with Pharyngitis 7 Ein erwachsener Mann hat entzündliche Hautveränderungen am Krätze Scabies Unterschenkel, die stark jucken, insbesondere nachts. Es wurde Erkältung Virale Infekte kein Zusammenhang mit Allergien festgestellt. Pfeiffer-Drüsenfieber Pfeiffersches Drüsenfieber 8 Ein kleines Mädchen hat seit einiger Zeit ihren Appetit verloren, Mononukleose fühlt sich ungewöhnlich müde und hat ungewollt an Gewicht Blasenentzündung Zystitis verloren. Es wird auch über verminderten Urinfluss berichtet. Harnwegsinfektion 9 Ein kleines Mädchen hat Fieber, eine Rachenentzündung und Urinary tract infection geschwollene Lymphknoten. Sie fühlt sich abgeschlagen und schwitzt besonders nachts stark. Gürtelrose Herpes Zoster 10 Eine jugendliche Mädchen hat Blut im Urin und Schmerzen Mastopathie Fibrozystische Mastopathie beim Wasserlassen. Der Harndrang ist häufig, aber es wird nur Lipom Lipoma eine geringe Urinmenge ausgeschieden. Zudem verspürt sie ein Brennen beim Wasserlassen. 11 Ein jugendliches Mädchen klagt über ausstrahlende Schmerzen im Nackenbereich und hat einen Hautausschlag mit kleinen Bläschen. 12 Eine erwachsene Frau hat Spannungsgefühle in der Brust und 2.4 Sets of Symptoms & Personas tastet schmerzlose Knoten. Die Haut ist nicht gerötet. For this evaluation, we retrieved 12 sets of symptoms from Net-13 Eine erwachsene Frau verspürt Druckempfindlichkeit im Ober- Doktor - 6 for females and 6 for males, and for each sex, we used bauch, die Haut wölbt sich vor und die Region ist geschwollen. all of NetDoktor’s 3 age categories (baby/child, adolescent, adult) twice. In addition, we used the exemplary set of symptoms for an adult man having a heart attack, as discussed in the previous sec-3 Experimental Evaluation tion. This yields the 13 sets of symptoms listed in Tables 3 and 4. Figure 1 depicts our experimental setup: NetDoktor is used as a In Figure 2, the parts of the questionnaire are marked, which golden model to automatically derive sets of symptoms and cor-correspond to the persona and the set of symptoms respectively. responding diagnoses as exemplified in Figure 2. The extracted In the following, both terms are used interchangeably. symptoms are then used as input to the LLM GPT-4o via ChatGPT. ChatGPT diagnoses are then compared to NetDoktor diagnoses 2.5 Model, Prompts & Diagnose Retrieval to compute an overlap score. Figure 4 gives an overview of our evaluation results. For each set of symptoms, NetDoktor results For all of our experiments, we used GPT-4o [9, 10] through Chat-are shown, followed by four diagnosis strategies utilizing ChatGPT [8]. More specifically, we used version GPT-4o-2024-08-06, GPT. The grey bars denote the cardinality of every resulting set which has been released in August 2024. We evaluate the same of diagnoses. Blue overlays are used to show the overlap between model in German and English and denote this with a trailing NetDoktor diagnoses and ChatGPT diagnoses. These overlays "[DE]" for German and "[EN]" for English for the respective correspond to the values in Table 5, which comprises the occur-results. We extended this convention to our golden model Net- rences of overlaps in each category from 0/3 to 3/3. In addition to Doktor as well. The full list of prompts used can be found in the 13 sets of symptoms, Figure 4 and Table 5 include averages the next section, Section 3. All LLM results were retrieved in a computed over all sets for easier comparison of the prompts/- zero-shot methodology, without samples or additional context be-diagnosis retrieval methods. In the following, you can find the sides the prompt itself. Every symptom description is sent within used prompts/methodologies corresponding to the depicted bars: a new chat to isolate individual queries. However, we cannot guarantee that we are indeed interacting with a "blank slate" as A NetDoktor [DE]: Diagnoses from NetDoktor were re-ChatGPT and GPT-4o are both black boxes and our user profile trieved via the Symptom-Checker questionnaire as is doc- might interfere with the output. umented in Subsection 2.1. This is our golden model and 23 Testing ChatGPT Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Table 4: Sets of Symptoms per ID [EN] ID Description of Symptoms in English 1 An adult man feels a tightness in his chest. He experiences pain even when he is not moving or exerting himself. Additionally, he suffers from shortness of breath when he exerts himself. 2 A boy has red eyes and a fever. He also has pain when urinating. 3 A little boy has severe abdominal pain, which worsens with pressure. It was also found that he is generally sensitive to pressure. 4 A teenage boy feels a tightness or pressure in his chest. He notices that his heart beats very fast and irregularly. He often feels tired and less capable. 5 A teenage boy has lost weight and suffers from persistent fatigue. He has regular diarrhea that is particularly voluminous and foul-smelling. 6 An adult man has a throat infection and notices red eyes. How- ever, there is no fever or swollen lymph nodes. 7 An adult man has inflammatory skin changes on his lower leg that itch intensely, especially at night. No connection with aller-gies was found. 8 A little girl has lost her appetite for some time, feels unusually tired, and has unintentionally lost weight. Reduced urine output is also reported. 9 A little girl has a fever, a throat infection, and swollen lymph nodes. She feels weak and sweats heavily, especially at night. 10 A teenage girl has blood in her urine and pain when urinating. The urge to urinate is frequent, but only a small amount of urine is passed. She also feels a burning sensation when urinating. Figure 3: ChatGPT Output for ID 1, Using Prompt B. Over- 11 A teenage girl complains of radiating pain in the neck area and lap with NetDoktor Marked in Blue. Compare to Figure 2. has a rash with small blisters. 12 An adult woman has a feeling of tension in her breast and can feel painless lumps. The skin is not reddened. 13 An adult woman feels tenderness in the upper abdomen, the skin asking for the "10 most likely" diagnoses is the runner-up with bulges, and the area is swollen. an average of 0.85/3 i.e. 28%. Surprisingly, the simple prompt in English, E, performs poorest, which contradicts our hypothesis of English prompts performing better. overlaps with its diagnoses are marked in blue, in Figure 4. The questionnaire and results are in German. B ChatGPT [DE]: Ad-hoc query sent to ChatGPT using the Table 5: Overlaps of Diagnoses with NetDoktor per Prompt symptom descriptions in German from Table 3 as is. C ChatGPT [DE] "3 Most Likely": More elaborate query Diagnosis Retrieval Method Score sent to ChatGPT using the symptom descriptions in Ger- A B C D E man from Table 3, additionally requesting the "3 most 3/3 13 0 0 0 0 likely" diagnoses. 2/3 0 2 0 3 1 D ChatGPT [DE] "10 Most Likely": More elaborate query 1/3 0 8 9 5 3 sent to ChatGPT using the symptom descriptions in Ger- 0/3 0 3 4 5 9 man from Table 3, additionally requesting the "10 most Avg. 3/3 0.92/3 0.69/3 0.85/3 0.38/3 likely" diagnoses. Avg.[%] 100% 31% 23% 28% 13% E ChatGPT [EN]: Ad-hoc query sent to ChatGPT using the symptom descriptions in English from Table 4 as is. Out of convenience, the letters introduced in this list are used Apart from the overlaps, other interesting observations can be when referring to a specific prompt in the following paragraphs. made on closer inspection of the results: ChatGPT seems to rigor-The main takeaway from this evaluation is that none of the used ously follow the instruction to generate n diagnoses and as such, prompts achieves a complete overlap of 3/3 with NetDoktor for yields consistently 3 diagnoses for prompt C and 10 diagnoses for any of the personas. Prompt B, achieves the highest score, with prompt D. However, it can be doubted that "most likely" is inter-0.92/3 i.e. 31%. For our small test set of 13 sets of symptoms, preted in a scientifically backed manner, as ChatGPT often does these results constitute from 2/3 for two, and 1/3 for seven sets not include even one of the NetDoktor diagnoses and not once of symptoms. The same prompting strategy in English, denoted all of them. Equally interesting is the inclusion of the necessity by E, yields worse results, having overlaps of 2/3 for one, and 1/3 to consult a doctor in one form or the other at the end of every for three sets of symptoms. The two prompts giving ChatGPT result we received, which is likely due to being "hard-coded" for the task of answering with the "n most likely" diagnoses are legal reasons on the part of OpenAI. This can also be seen in equally not performing as well as the simple prompt in German: Figure 3. Although ChatGPT and GPT-4o are black boxes and C achieves 1/3 for 9 sets of symptoms and D yields 2/3 for three LLMs are non-deterministic, we try to document our reported and 1/3 for five sets of symptoms. This means that prompt D, results as well as possible for replication. You can find all of our 24 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Alexander Perko and Franz Wotawa Diagnoses per Set of Symptoms for the better. In our tests, ChatGPT always includes a notice to Diagnosis Retrieval Methods A-E: A ... NetDoktor [DE] consult a doctor. Human assessment of the diagnoses cannot be A B ... ChatGPT [DE] fully bypassed by the proposed evaluation methodology. This B C ... ChatGPT [DE] "3 Most Likely" C 1 D ... ChatGPT [DE] "10 Most Likely" D is due to the immanent presence of semantic equivalence and E E ... ChatGPT [EN] Overlap with NetDoktor the necessary medical knowledge to find those equivalences. Al-ABC though such a task is automatable via LLMs as well, the authors 2 DE of this paper underline the potential implications for undermin-AB ing the quality of an evaluation, when fully automated. While our C 3 D evaluation reports results achieved using ChatGPT and GPT-4o, EA the proposed methodologies transcend to other LLMs as well. As BC 4 part of future work, we want to repeat our experiments at a larger DE scale to achieve representative results. Additionally, we want to AB consider stability metrics, as seen in [11]. Another interesting C 5 D direction can be further analysis of the relationship between EA prompt (engineering) and the retrieval of matching diagnoses BC 6 as well as their stability. Finally, it would be interesting to com-DE pile a corpus of medical symptoms corresponding to diagnoses AB including named entities and logical abstractions to perform C 7 DE evaluations as seen in [12] on the medical domain. ABC 8 Acknowledgements Sets of Symptoms DEA The work presented in this paper was partially funded by the BC 9 European Union under Grant 101159214 – ChatMED. Views and DE opinions expressed are however those of the author(s) only and AB do not necessarily reflect those of the European Union. Neither C10 D the European Union nor the granting authority can be held re- EA sponsible for them. BC11 DE References AB [1] AMBOSS GmbH. 2024. Amboss. https : / / www . amboss . com. Accessed: C12 D 2024-09-03. (2024). E [2] AWMF. 2024. Arbeitsgemeinschaft der wissenschaftlichen medizinischen AB fachgesellschaften (awmf ) - leitlinien. https://www.awmf .org/leitlinien. C13 Accessed: 2024-09-03. (2024). DE [3] Tom B. Brown et al. 2020. Language models are few-shot learners. arXiv: A 2005.14165 [cs.CL]. (2020). BC [4] Zeming Chen et al. 2023. Meditron-70b: scaling medical pretraining for large D language models. (2023). https://arxiv.org/abs/2311.16079 arXiv: 2311.16079 E [cs.CL]. [5] J. Clusmann et al. 2023. The future landscape of large language models in medicine. Communications Medicine, 3, 141. doi: https://doi.org/10.1038/s43 1 2 3 4 5 6 7 8 9 10 856- 023- 00370- 1. Number of Diagnoses [6] Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. 2020. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. arXiv preprint Figure 4: Comparison of Diagnoses for Symptoms Seen in arXiv:2009.13081. Table 3 [7] Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. 2023. Capabilities of gpt-4 on medical challenge problems. (2023). https://arxiv.org/abs/2303.13375 arXiv: 2303.13375 [cs.CL]. [8] OpenAI. 2023. ChatGPT. (2023). chat.openai.com/chat. experimental results as a replication package under the provided [9] OpenAI. 2023. GPT-4 technical report. arXiv: 2303.08774 [cs.CL]. (2023). 2 URL . [10] OpenAI. 2024. Introducing gpt-4o and more tools to chatgpt free users. (May 2024). https://openai.com/index/gpt- 4o- and- more- tools- to- chatgpt- f ree/. [11] Alexander Perko and Franz Wotawa. 2024. Evaluating openai large lan-4 Conclusions guage models for generating logical abstractions of technical requirements documents. In Proceedings to the 24th International Conference on Software In summary, ChatGPT diagnoses only partially match the di-Quality, Reliability, and Security (QRS). IEEE. agnoses retrieved from our golden model NetDoktor. ChatGPT [12] Alexander Perko, Haoran Zhao, and Franz Wotawa. 2023. Optimizing named entity recognition for improving logical formulae abstraction from technical diagnoses are mostly well-structured and are seemingly valid requirements documents. In Proceedings to the 10th International Conference but fail to include all NetDoktor diagnoses in any of the tested on Dependable Systems and Their Applications. cases. This holds for all tested prompts and across all tested symp- [13] Jens Richter, Hans-Richard Demel, Florian Tiefenböck, Luise Heine, and Martina Feichter. 2024. Symptom-checker. https://www.netdoktor.at/sympt toms. The highest overlap, on average, could be achieved with om- checker/. Accessed: 2024-09-03. (2024). the simplest prompt in German, giving only a description of the [14] Khaled Saab et al. 2024. Capabilities of gemini models in medicine. (2024). symptoms for a persona. When asked for a specific amount of https://arxiv.org/abs/2404.18416 arXiv: 2404.18416 [cs.AI]. [15] Karan Singhal et al. 2023. Towards expert-level medical question answering "most likely" diagnoses, ChatGPT always delivered exactly the with large language models. (2023). https://arxiv.org/abs/2305.09617 arXiv: asked-for number of diagnoses. However, this does not benefit 2305.09617 [cs.CL]. [16] Hugo Touvron et al. 2023. Llama 2: open foundation and fine-tuned chat the quality of the output as measured by the overlap metric. Nei-models. ArXiv. ther does an interaction in English change the output quality 2 https://zenodo.org/doi/10.5281/zenodo.13765345 25 HomeDOCtor App: Integrating Medical Knowledge into GPT for Personal Health Counseling Matic Zadobovšek Primož Kocuvan Matjaž Gams matic.zadobovsek@gmail.com primoz.kocuvan@ijs.si matjaz.gams@ijs.si Faculty of Computer and Department of Intelligent Systems, Department of Intelligent Systems, Information Science Jožef Stefan Institute Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia Abstract in less researched areas and specific issues [19]. As a result, it The escalating workload demands on healthcare professionals makes sense to think about augmenting the existing LLM with are leading to systemic overload, resulting in a decline in the verifiable resources that can help to improve it. One of the most efficiency of public health services. This situation necessitates important issues regarding the use of LLMs relates to the aspect the development of solutions that can alleviate the burden on of secure use of health data. When using commercial LLMs, the physicians while ensuring comprehensive patient care. Recent data is passed to them and the user should be informed with advancements in generative artificial intelligence, particularly appropriate disclaimers about the use and processing of the data in the field of medicine, have demonstrated that large language they have entered. In an ideal situation, we would use our own models (LLMs) can outperform doctors in specific tasks, high-LLM, running on a local server, and have full control over the im-lighting their potential as valuable tools for reducing the strain plementation, but this raises different issues. Creating your own on healthcare providers. This study focuses on the development LLM is a time-consuming and costly process, so our research of the HomeDOCtor application, which integrates additional val-focused on extending the best LLM currently on the market, idated medical knowledge into the GPT-4o LLM. The objective of GPT-4o. this application and the enhanced LLM is to offer users reliable In our case, we focused on the Slovenian healthcare sector and access to a medical chatbot capable of providing accurate and the adaptation of the LLM to the needs of the average user. The timely responses to health-related inquiries. The chatbot’s be-aim was to develop an application (HomeDOCtor) that would havior has been meticulously tested and refined in collaboration allow all Slovenian citizens to get medical help 24/7 without with a team of physicians. The findings of this research offer the need to contact a personal doctor. As a result, performance insights into the development of such systems and explore their testing is crucial, as even the smallest errors can lead to serious potential application within the Slovenian healthcare system. consequences [8]. The core research hypothesis posits that the system, enhanced Keywords with additional modules integrated into a GPT, will provide more large language models, GPT-4o, vector embeddings, vector databases, effective medical advice to the general Slovenian population natural language processing, prompt engineering compared to existing GPTs. In section 2 we present all the datasets used to extend the LLM. Section 3 systematically describes the approach and technologies 1 Introduction used to develop the software solution. Section 4 shows the results There are worldwide growing problems in the health sector due of the developed solution and an example of a user conversation to an ageing population and a shortage of health experts [13, 17]. with the improved LLM. Section 5 describes how performance The field of Natural Language Processing (NLP) has recently seen was tested and how doctors helped to guide LLM to get the an increase in the number of LLMs being customised for different desired outcome. Section 6 presents concluding thoughts and domains [2]. Increasingly, we are also seeing their integration possible improvements. into the field of medicine, which is one of the core domains of today’s society [14, 22, 11, 20]. Currently, OpenAI’s GPT-4o is considered to be the most powerful LLM, which also performs 2 Datasets best in the area of health questionnaires and other related tasks To improve the knowledge of the existing ChatGPT-4o, data from [16, 12, 1, 5]. verified sources was obtained. The addition of new data allows Because of their advanced ability to understand natural lan-the GPT to answer questions using its existing knowledge, as well guage text, there are many potential applications [8, 3]. Patients as to address a broader range of questions that require specialized can talk to the GPT, describe their symptoms in detail, include knowledge. In this case, it is about integrating information on documents to past treatments and then get a friendly response Slovenian healthcare [4]. in seconds. This is particularly useful when the patient is unable to access their personal doctor or simply wants to get feedback before taking further action. LLMs are trained on a huge amount 2.1 Insieme Platform of data, but there is the possibility of hallucination, especially The Insieme platform is the core building block of the dataset used [9]. The platform integrates hand-crafted expert-based knowl-Permission to make digital or hard copies of all or part of this work for personal edge that is accessible to users on all devices and contains basic or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and information on Slovenian healthcare. the full citation on the first page. Copyrights for third-party components of this The information is organised hierarchically by medical spe-work must be honored. For all other uses, contact the owner/author(s). cialty. By clicking (or visiting) on one of the branches, the user is Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia © 2024 Copyright held by the owner/author(s). redirected to a sub-page that provides an overview of the diseases https://doi.org/https://doi.org/10.70314/is.2024.chtm.8 and other services that belong to the selected field of medicine. 26 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Zadobovšek, Kocuvan, Gams Each of the diseases is accompanied by a set of key data that 3.1 Fine-tuning allows the user to get everything they need in one place. This One solution is fine-tuning, which allows a pre-trained LLM to includes: add external data sources to a LLM that already has its own • basic information about the diagnosis, existing knowledge [10]. The idea is that instead of creating an • professional help (list of clinics in Slovenia), LLM from scratch, an existing LLM is used and further adapted • apps (where the user can upload their pictures, ask for to the user’s operational needs. Fine-tuning typically works on opinion, etc.), the principle of supervised learning, where the LLM needs to • associations in Slovenia, be given examples of the input and the expected output. With • articles, additional learning, the LLM is able to use the new data. This • forums, directly influences the change of parameters (weights) in the • video content, and model and its performance. The problem arises in that, due to • image content. the large amount of parameters in modern LLMs, the whole Figure 1 shows an example of one of the pages illustrating the learning process may be impractical due to the cost of training display of the information for skin cancer. an outsourced LLM. Additionally, any incorporated data is static, All data have been compiled in collaboration with doctors meaning that the learning process must be repeated whenever and are therefore suitable for use as a primary source for the adding new data. As a result, fine-tuning is not the best solution extension of the LLM. The aim is to enable users of the platform in cases where we know that the data will change, and we will to quickly obtain information on Slovenian healthcare that they be constantly adding new data. can understand. 3.2 Storing data 2.2 Other resources The data obtained by web scraping was stored in a vector data- The Med.Over.Net Health Forum was used as one of the two re- base, which allows efficient management of the vector data [21]. maining data sources. The forum is divided into different branches The raw data obtained (e.g. PDF and Markdown files) are un-of medicine where users can create new conversations and ask suitable for direct use with LLMs as they cannot be directly man-questions within them. Each of the sub-forums has an assigned aged by the LLM. While documents can be added as attachments moderator (a doctor) who carefully reviews and answers the in the web interfaces of platforms such as ChatGPT (OpenAI’s questions. All other forum users can also join the conversation. web LLM access), this is only useful in the context of a single All data from the online forum was extracted using web scraping, conversation. In our case, we would like the LLM to have perma-whereby all extracted data was stored in a single PDF file. nent access to information sources without the need for manual The second source is the Home Doctor Encyclopaedia (The addition of documents by the user. Home Doctor - Practical Medicine for Every Household), which 3.2.1 Vector embeddings. Each of the documents is converted is in the field of practical home medicine. It contains useful tips into vector embeddings, which allow the text to be presented in on how people can help themselves by applying helpful advice a numerical notation that can be understood by a computer [6]. without the need to visit a doctor. From this notation, the computer can understand the meaning In the future, consideration will be given to adding new re- of the text and determine the meaning between the words in sources, in particular a better medical book suitable for Slovenian the input. This is useful in many NLP tasks, e.g. search engines, healthcare. For now, we have focused on the three data sources sentiment analysis, recommendation systems and so on. We use mentioned above. vector embeddings to be able to extract information relevant to the corresponding question based on user input. Modern LLMs have a limitation in the form of a context win- dow, which determines how much text they can process within a single user input. Because of this limitation, we need to be careful what additional information is sent to the LLM, as we could quickly exceed the context window by sending all data sources. E.g. if the user’s question is about skin cancer, we only want to get information about that specific disease. As a consequence, we have divided the text into chunks, where each chunk contains data for only one of the diseases. Similar care has been taken with the encyclopaedia and the Med.Over.Net online forum. Meaningful paragraphs have been grouped together to form one chunk of text. Thus, only the key chunk that is most likely to contain the answer to the question asked will be provided to the LLM. Figure 1: Image of the skin cancer disease subpage on the For each of the chunks, a vector embedding has to be created Insieme web platform. using an appropriate model (e.g. text-embedding-ada-002 from OpenAI). The vector databases (e.g. Redis) have to be used to store the resulting vector embeddings properly. 3 Methodology 3.2.2 Vector databases. The vector database allows the storage This section presents the design of a system that extends the of unstructured data and fast retrieval due to efficient indexing LLM GPT-4 with acquired knowledge resources and outlines the of the data. In our case, we use them to store individual chunks architecture of the developed application. and their associated vector embeddings [7]. 27 Integrating Medical Knowledge into GPT Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Over all stored chunks, vector search is enabled, which means 3.4 Application architecture and used that the chunks whose vector embeddings most closely match the technologies query vector are returned. The query vector represents the user A brief overview of the technologies used to develop the software input, which is converted into vector format. Cosine similarity, solution is presented below. Euclidean distance, inner product and other metrics can be used to measure similarity between vectors. 3.4.1 Flutter. The Flutter framework was used to develop the The configuration used to retrieve the chunks can be modified: front-end part of the system. The reason for using it is that it e.g. specifying the required similarity threshold and the number allows us to develop applications that work on all platforms by of chunks retrieved. writing unified code in the Dart programming language. It then gets compiled into code for the dedicated platform. This way, the 3.3 RAG application works on web browsers as well as in the form of a mobile application. Otherwise, we would have to use multiple An alternative solution is Retrieval-Augmented Generation (RAG), programming languages to ensure support on multiple devices, where the LLM calls an external database containing all our data which would slow down development. when the user provides a question [15]. Relevant data is retrieved from there and passed directly to the LLM, which uses this data in 3.4.2 Redis. The Redis database was used to store all additional the generation of the answer. In this case, the basic architecture knowledge resources. It is a NoSQL database that stores data in of the LLM used remains completely unchanged, as it accesses key-value format and also works as a vector database. The data a separate building block to retrieve the data. It is a cheaper so-is stored in memory, which helps significantly with data retrieval lution that allows dynamic data extraction. Data can be deleted, and overall application responsiveness. modified and newly added to the database at any time. The Retriever is responsible for retrieving relevant pieces of 3.4.3 LangChain. The LangChain framework offers a number of functionalities that facilitate the development of applications that text from the vector database. The user question is converted involve working with LLMs [18]. It allows us to easily connect into vector embedding, and then the most similar chunks are to a vector database, retrieve documents, call arbitrary LLMs, retrieved to help guide the LLM to the correct answer. A merged modify system instructions and so on. query is then created containing the original user question, the extracted chunks of text and any other system instructions given. 3.4.4 FastAPI. The FastAPI framework allows the creation of In the final step, the LLM generation produces an answer based RESTful APIs that can be accessed by the HomeDOCtor applica-on the query, which is passed to the user. tion. It uses Uvicorn for the server implementation and allows Compared to fine-tuning, RAG allows the system to change processing of requests received from users. These requests are continuously (adding new knowledge sources), makes the oper-related to the processing of the received questions and the execu-ation more understandable (we can check which pieces of text tion of all the necessary tasks to obtain the appropriate response have been passed to the LLM) and reduces the possibility of hal-from the LLM. lucination (verified pieces of text from the selected domain are added). The style of writing the answers cannot be changed di-3.4.5 Docker. The entire application was packaged in a Docker rectly, as the parameters of the original model remain completely container to facilitate development and deployment to the server. unchanged, but we can help by writing system instructions. Sys-All the necessary configuration is included in the container, tem instructions further guide the operation of the chatbot and which runs on the system without the need to install the rest of play a very important role in achieving a user-friendly behaviour the programming libraries and tools. of the system. 3.4.6 Firebase. The Firebase platform offers a number of useful All of these features make RAG a suitable choice for enriching functionalities such as user authentication, application usage the work of an existing LLM with validated information in the analytics, cloud storage and a real-time database (NoSQL cloud chosen field. database, which in our case is used to store the necessary user The schematic design of the RAG system can be seen in the data). figure 2 below. 4 Results The HomeDOCtor app allows users to access a chatbot that is constantly available to answer their questions. Users can access the app on their browsers or via a mobile app. The user interface allows users to view the history of the conversation, ask questions, include their own documents as an attachment (e.g. PDF documents of previous exams), get a PDF summary of the conversation and additional information in three side windows. By clicking on the "End session" button at the bottom of the screen, all data is deleted, and the user is redirected to the main page where they can start a new conversation with the chatbot. Initially, HomeDOCtor determines what kind of conversa- tion is being had. In the case of a simple chat, no additional Figure 2: Demonstration of how the RAG system works. information is displayed. However, if the system detects that the conversation is about the user’s health, additional information is displayed in three side windows. This additional information is dynamically updated according to the progress of the user’s 28 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Zadobovšek, Kocuvan, Gams conversation with the chatbot. The first window gives a list of 2. Is there anything that seems to make your symptoms possible medical conditions, the second window gives advice better or worse, such as certain foods, drinks, or activities? on how the user could help himself, and the third window tells • User: I tried drinking some warm tea with honey, which whether a visit to the doctor would be necessary. This way, the soothed my throat for a while. My headache seems to get user can receive a comprehensive overview of the conversation a bit worse when I’m looking at screens for too long. so far and get an opinion that adapts to the course of the conver- • HomeDOCtor: Thank you for the additional information. sation. It sounds like you might be experiencing symptoms of a Figure 3 below shows the user interface of the HomeDOCtor common cold or another mild infection. Here are a few application. We can see the user’s conversation with the chatbot recommendations: in Slovenian and the display of additional information in the three 1. Rest and Hydration: Ensure you get plenty of rest and side windows along the right edge. When answering questions, stay well-hydrated. GPT-4o uses its existing knowledge, incorporating as often as 2. Warm Teas: Continue drinking warm teas with honey possible additional sources of knowledge (especially from the to soothe your throat. Insieme platform). 3. Limit Screen Time: Try to reduce your screen time to Using the app, users get instant help from a chatbot. The help with your headache. chatbot first tries to collect sufficient data before proceeding to 4. Steam Inhalation: Consider steam inhalation to relieve further advice. In the course of the treatment, the user is advised any nasal congestion you might have. on different forms of help (websites, questionnaires, forums, apps, For additional relief, you can visit a pharmacy for over-the-visits to pharmacies, etc.). A visit to the doctor is only advised in counter options. You can find more information about med-urgent cases, when HomeDOCtor considers that the situation is ications at https://mediately.co/si and http://www.cbz.si. too complicated or further medical treatment is needed. If your symptoms worsen or persist for more than a week, it’s advisable to consult a healthcare professional. • User: Thank you for the advice. Where could I get more information? • HomeDOCtor: You’re welcome! For more information on common conditions and treatments, you can check these resources: 1. Med.Over.net: https://med.over.net 2. Ars Pharme - Zdravstveni nasveti: https://www.arspharmae.com 3. Triglav Zdravstveni Nasveti: http://www.triglavzdravje.si If you need more personalized advice, consider visiting a pharmacist or consulting with a healthcare professional. Figure 3: An example of a user talking to the HomeDOCtor 5 Evaluation chatbot. Additional information is visible on the side. The full operation of the HomeDOCtor app and chatbot was carried out by the authors and medical experts. After consulting An example of a conversation from a user who consulted with doctors and identifying the key performance points, iterative the developed chatbot is shown below. We can see the user’s and collaborative development followed, gradually introducing questions and all the corresponding answers. changes and improving the performance of the chatbot. • User: I have a slight headache and a bit of a sore throat Doctors have contributed significantly to improving the app today. through rigorous testing, ensuring that the user can get the best • HomeDOCtor: I understand. Let’s gather some more infor- possible results to their questions. We have been careful to admation to better assist you. dress situations where a doctor’s visit is not necessary. In many 1. How long have you been experiencing these symptoms? cases, non-urgent situations only add to the workload of doctors. 2. Do you have any other symptoms, such as fever, cough, In our system, we have tried to avoid this as much as possible or body aches? by trying to refer the user to other resources that would be good • User: I’ve been feeling this way for about two days now. and useful enough for them, and only as a last resort recommend-It started with just a mild sore throat, and the headache ing a visit to the doctor. Of course, a visit to the doctor should came yesterday. not always be avoided, so it is crucial that the system is able to • HomeDOCtor: Thanks for the information. correctly identify scenarios where this is really necessary. These 1. How severe are your headache and sore throat on a scale are mainly cases where additional diagnostics are needed, which from 1 to 10? are difficult for the LLM to perform on its own. Doctors have 2. Have you noticed any other symptoms, like nasal con- been particularly helpful in the project by further adapting the gestion, fatigue, or body aches? system instructions, which detail how to deal with all scenarios. • User: I’d say my headache is about a 4 out of 10. It’s more We compared HomeDOCtor with other large language mod- of a dull ache, not too intense, but definitely noticeable. els on the market, but focused the most on comparing it with The sore throat is a bit worse, probably a 6 out of 10. the regular GPT-4o model, which our system uses for its basic • HomeDOCtor: Thank you for the details. operation. The main differences are in the flow of the conversa-1. Have you taken any over-the-counter medications or tion, as HomeDOCtor is customised by the system instructions home remedies for relief? If so, have they helped? to follow specific scenarios and ask questions that are key for 29 Integrating Medical Knowledge into GPT Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia providing further advice. Thus, by simulating a conversation of make available to users (information on clinics, apps, articles, a user who starts the conversation by listing symptoms, a com-video content in that language). parison between the two systems can be made. According to In the future, work will be carried out on improving the chat- the clinicians’ evaluations, HomeDOCtor appeared to be better bot, taking into account the opinions of external users, who in at asking sub-questions and guiding the user through the treat-this case are also our target group. This research has shown that ment. A further important difference can be seen in the display generative artificial intelligence has a great potential application of videos and images, as HomeDOCtor can display them right in the field of medicine and could make a significant contribution within the app, whereas GPT-4o creates fictional images or often to relieving the burden on the healthcare system. provides links to non-existent videos. HomeDOCtor also often gives practical advice on how users can help themselves at home Acknowledgements and carefully evaluates when it is really necessary to redirect the We thank members of the Department of Intelligent Systems and user to professional help. medical experts for providing info and testing the system. This Testing of the system has demonstrated the added value of the project is funded by the European Union under Horizon Europe included knowledge sources, though formal validation tests are (project ChatMED grant agreement ID: 101159214). still pending. Nevertheless, the scientific hypothesis that superior performance can be achieved compared to the original GPT is References increasingly supported by the evidence. Through iterative testing [1] Josh Achiam et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774. by all participants, the system has now reached a level of maturity [2] Jan Clusmann et al. 2023. The future landscape of large language models in suitable for deployment in a production environment. Test users medicine. Communications medicine, 3, 1, 141. [3] Matjaž Gams and Tine Kolenik. 2021. Relations between electronics, artificial will be able to provide feedback and suggest corrections via a intelligence and information society through information society rules. form, which will further refine the chatbot to better meet the Electronics, 10, 4, 514. needs and preferences of the average user in Slovenia. This will [4] Matjaž Gams, Maj Smerkol, Primož Kocuvan, and Matic Zadobovšek. 2024. Developing a medical chatbot: integrating medical knowledge into gpt for display the data sources used to the test users, who will be able to healthcare applications. In Intelligent Environments 2024: Combined Proceed-indicate whether they believe that the relevant documents have ings of Workshops and Demos & Videos Session. IOS Press, 88–97. [5] Aidan Gilson, Conrad W Safranek, Thomas Huang, Vimig Socrates, Ling been extracted from the vector database. Chi, Richard Andrew Taylor, David Chartash, et al. 2023. How does chatgpt perform on the united states medical licensing examination (usmle)? the implications of large language models for medical education and knowledge 6 Conclusion assessment. JMIR medical education, 9, 1, e45312. [6] Martin Grohe. 2020. Word2vec, node2vec, graph2vec, x2vec: towards a In this study, we have shown how verified information can be theory of vector embeddings of structured data. In Proceedings of the 39th incorporated into one of the existing LLMs. It was an extension ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, 1–16. of GPT-4o, which currently achieves the best results, and has [7] Yikun Han, Chunjiang Liu, and Pengfei Wang. 2023. A comprehensive been further adapted to the needs of Slovenian public health in survey on vector database: storage and retrieval technique, challenge. arXiv preprint arXiv:2310.11703. the context of the development of the HomeDOCtor application. [8] Yining Huang, Keke Tang, and Meilian Chen. 2024. A comprehensive survey Comprehensive testing has been carried out with a team of on evaluating large language model applications in the medical industry. doctors, which has helped to ensure that users are provided with arXiv preprint arXiv:2404.15777. [9] Jožef Stefan Institute. 2024. Insieme platform. (Aug. 2024). Retrieved August relevant and understandable information. HomeDOCtor initially 26, 2024 from https://ise-emh.eu. gathers information by asking questions, and only then advises [10] Jieh-Sheng Lee and Jieh Hsiang. 2020. Patent claim generation by fine-tuning on further action. External resources and self-help methods are openai gpt-2. World Patent Information, 62, 101983. [11] Peter Lee, Carey Goldberg, and Isaac Kohane. 2023. The AI revolution in prioritised, as the intention is that referral to a doctor should medicine: GPT-4 and beyond. Pearson. only be made in urgent situations. [12] Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. 2023. Capabilities of gpt-4 on medical challenge problems. Possible improvements are in the use of one of the open source arXiv preprint arXiv:2303.13375. LLMs, which would negate the need for external access to Ope- [13] World Health Organization et al. 2022. Health and care workforce in europe: nAI’s LLMs. Currently, in order to run the HomeDOCtor appli-time to act. [14] Carl Preiksaitis and Christian Rose. 2023. Opportunities, challenges, and cation, an API call is sent for each question asked, for which future directions of generative artificial intelligence in medical education: a fee is paid to OpenAI. This is also problematic from a data scoping review. JMIR medical education, 9, e48785. protection point of view as it involves working with confidential [15] Mujahid Ali Quidwai and Alessandro Lagana. 2024. A rag chatbot for precision medicine of multiple myeloma. medRxiv, 2024–03. user data which should not be passed on (altough you are default [16] Nicholas R Rydzewski, Deepak Dinakaran, Shuang G Zhao, Eytan Ruppin, opt-out for OpenAI API). Currently, this is taken care of with Baris Turkbey, Deborah E Citrin, and Krishnan R Patel. 2024. Comparative evaluation of llms in clinical oncology. Nejm Ai, 1, 5, AIoa2300151. disclaimers and warnings that the user has to agree to before [17] Tony Smith. 1991. Medicine in europe. european health challenges. BMJ: using the application. At the same time, the use of open source British Medical Journal, 303, 6814, 1395. models would bring additional problems, as the appropriate in- [18] Oguzhan Topsakal and Tahir Cetin Akinci. 2023. Creating large language model applications utilizing langchain: a primer on developing llm apps frastructure would have to be established. Open source models fast. In International Conference on Applied Engineering and Natural Sciences also typically perform worse than e.g. GPT-4o. As a result, we number 1. Vol. 1, 1050–1056. have chosen to use GPT-4o, as we cannot afford worse outcomes [19] Ashish Vaswani. 2017. Attention is all you need. arXiv preprint arXiv:1706.03762. [20] Ethan Waisberg, Joshua Ong, Mouayad Masalkhi, Sharif Amit Kamran, Nasif due to the criticality of the medical domain. Zaman, Prithul Sarker, Andrew G Lee, and Alireza Tavakkoli. 2023. Gpt-4: a HomeDOCtor could also be adapted to meet the healthcare new era of artificial intelligence in medicine. Irish Journal of Medical Science (1971-), 192, 6, 3197–3200. needs of countries abroad, but this would require ensuring that [21] Matic Zadobovšek, Primož Kocuvan, and Matjaž Gams. 2023. Vpeljava vir-the relevant data is obtained in accordance with their national tualnega asistenta chatgpt v medicinsko platformo = implementation of a guidelines and security laws. The behaviour of the chatbot could virtual assistant chatgpt into the medical platform. In Bibliografija: str. 31. Institut „Jožef Stefan“, 28–31. http://library.ijs.si/Stacks/Proceedings/Infor remain largely the same, as GPT-4o can by itself convert between mationSociety/2023/IS2023-Volume_A%20-%20SKUI.pdf. many languages. The key component would thus be to obtain all the necessary country-specific information that we want to 30 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Zadobovšek, Kocuvan, Gams [22] Peng Zhang and Maged N Kamel Boulos. 2023. Generative ai in medicine and healthcare: promises, opportunities and challenges. Future Internet, 15, 9, 286. 31 Standards for Use of LLM in Medical Diagnosis Mihailo Svetozarević Isidora Janković Sonja Janković Clinic for Neurology Center for Radiology Center for Radiology University Clinical Center Niš University Clinical Center Niš University Clinical Center Niš Niš, Serbia Niš, Serbia Niš, Serbia mihailo.svetozarevic@gmail.com isidora_jankovic@yahoo.com sonjasgirl@gmail.com Stevo Lukić Clinic for Neurology University Clinical Center Niš Niš, Serbia srlukic@gmail.com Abstract simulate human learning, comprehension, problem solving and critical decision making. AI approaches human cognition in two Artificial intelligence, particularly large language models distinct ways, the symbolic and the connectionist approach [1]. (LLMs), is increasingly being recognized for its potential to The symbolic approach aims to replicate human intelligence by revolutionize medical diagnosis by mimicking human cognitive analyzing cognition independent of the biological structure of the functions in clinical decisionmaking. Despite promising central nervous system while the connectionist approach aims to developments, such as the ability to pass medical exams and create neural networks that imitate the brains’s structure. To assist in complex diagnostic processes, LLMs still face realize the potential of AI in healthcare, we believe that the significant hurdles, including issues with accuracy, bias, and systematic approach to evaluation and benchmarking can get us safety. This paper critically consider evaluation of LLMs to a place where AI can be a net positive for health systems. performance across various criteria to ensure they meet the required standards for clinical use. Several dimensions of evaluations such as accuracy, calibration, and robustness are 2 LLM’s in Medicine used. While LLMs and generative AI more broadly show real The rapid advancements in AI, particularly in the realm of large potential for healthcare, these tools are not ready yet. The language models (LLM’s), have transformed various sectors, medical community and developers need to develop more including healthcare [2,3]. LLM’s and Chat GPT in particular has rigorous evaluation, analyze across specialties, train on real-earned much attention in recent years due to its ability to world data, and explore more useful types of GenAI beyond complete tasks previously considered completable by humans current models. But ultimately, we believe these tools can help alone as in passing United States Medical Licensing Examination in improving both physician workload and patient outcomes. We [4]. The ability of LLM’s to accurately answer questions, provide urgently need to set up evaluation loops for LLMs where models advice and even triage patients based on clinical input exceeds are built, implemented, and then continuously evaluated via user that of the everyday person. However, the accuracy of these feedback. systems to resolve real world medical issues is yet to exceed that Keywords of a fully trained physician. Also, a finite percentage of LLM answers to patients had safety errors, and in one instance the large language models, artificial intelligence, clinical AI advice given to a patient could have been fatal [5]. In order to implementation, AI in clinical practice, AI safety in healthcare avoid this error in the future it is essential to assess these models through rigorous comparative benchmarks. One of the most critical aspects of benchmarking medical LLM’s is comparing 1 Introduction their performance with existing clinical decision support systems Artificial intelligence (AI) by its definition, and in the broadest (CDSS) and other AI models. Traditional CDSS, often rule-of terms, represents intelligence exhibited by computer systems. based or statistical, have been used in healthcare for decades to The main goal of AI is to enable computers and machines to assist clinicians in making evidence-based decisions. By mimic human cognitive function. In other words, it aims to comparing LLMs to these systems, researchers can determine whether the new models offer significant improvements in accuracy, speed, and comprehensiveness [6]. For example, a Permission to make digital or hard copies of part or all of this work for personal or comparative benchmark might involve evaluating the diagnostic classroom use is granted without fee provided that copies are not made or distributed accuracy of an LLM against a well-established CDSS in for profit or commercial advantage and that copies bear this notice and the full predicting outcomes for specific conditions, such as sepsis or citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). diabetes. The LLM’s ability to incorporate a broader range of Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia data, including unstructured text from electronic health records © 2024 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2024.chtm.9 (EHRs), could be a key factor in outperforming traditional systems [7]. However, it is also crucial to consider scenarios 32 where traditional systems might still have an edge, particularly can highlight areas where LLMs might support clinicians, such in specialized tasks where they have been finely tuned over many as providing second opinions or identifying potential errors in years of clinical use [8]. Outside of primary care, radiology is human judgment, rather than replacing them [12]. Randomized perhaps the medical branch that has been the most upfront and controlled trials (RCTs) are considered the gold standard in welcoming to the use of new technology [2,3]. The concept of clinical research for evaluating the efficacy of innovations. computer-assisted diagnosis (CAD) is well known. AI’s provide Comparative benchmarking of LLMs can also involve assessing substantial aid by labeling abnormal or most often borderline how well these models predict or align with outcomes from exams or simply by quickly excluding negative exams in RCTs. For example, an LLM could be tested on its ability to computed tomographies, X-rays, magnetic resonance images recommend treatments for stroke prevention based on patient especially in high volume settings like the emergency room data, and its recommendations could be compared with those where human resources might be less available. AI-driven validated by RCTs [7]. However, this approach presents a set of diagnostic tests have the potential to overcome several current challenges, as RCTs often involve highly controlled limitations in the clinical approach to patient care [9]. Namely environments that might not fully capture the complexities of the clinical review, time to diagnosis, diagnostic accuracy and real-world clinical settings. Currently LLMs are most often consistency. In tandem with AI, diagnosticians of all medical tested on small datasets acquired for a specific research study or branches are capable of improving measures of diagnostic large public benchmark dataset, both of which are usually accuracy (mainly sensitivity and specificity) as well as collected on a limited number of very similar sites with minimizing observer variability in specific patient interpretation. consistent diagnostic techniques. This does not reflect the This proves most useful in settings where the clinical diagnosis substantial differences in manufacturer, quality and clinical is in question – such as with complex patient presentation or in practices often found in real-world hospitals. As an example, the patients with long histories and various comorbidities. Currently UK Biobank, a widely employed public imaging benchmark not many prospective studies and randomized trials exist in dataset includes brain magnetic resonance images (MRI) for a medical AI application. Most are not prospective, are at high risk total of 100,000 patients and more. It restricts image acquisition of bias and deviate from existing report standards. Data to four sites each of which has identical equipment in terms of availability is lacking and human cooperator groups are more hardware and software and performs regular quality check to often small and inadequate. LLM’s, in particularly GPT-3, has ensure the harmonization of data. In contrast most medical shown promise in various clinical applications, ranging from centers, including our own in Serbia, extracts data from clinical creation of patient notes to helping healthcare providers diagnose archives over a period of 20 years which reflects the much more rare conditions. However, it is important to recognize the diverse array of available data in everyday settings. Another inherent limitations of these systems. point of interest is a lack of consensus on which dimensions of evaluation to consider and prioritize for various healthcare tasks. While accuracy is the most often examined dimension when 3 Standardized Evaluation Framework for evaluating LLM performance, other dimensions such as fairness, Assessing LLM’s Clinical Utility for Future bias and toxicity, robustness, and deployment considerations Clinical Practice need to be considered as well [13]. Therefore, while alignment with RCT outcomes is a strong indicator of an LLMs clinical Medical diagnosis involves a complex process in which a relevance, it is also important to test these models in more varied practitioner uses objective data from a clinical exam, as well as and less controlled environments to ensure their robustness [11]. data collected from medical tests along with self-described Unlike traditional systems or statistical models that remain subjective symptoms to conclude the most likely health problem. relatively static once developed, LLMs can be continuously This kind of approach relies heavily on the synthesis and updated and refined. This raises the question of how implement individual interpretation of a vast amount of information from models that are constantly evolving. Development of various sources. These most often include available patient standardized benchmarks that can be applied across different histories, clinical exam data correlated with current medical versions of a model are essential to address this challenge [14]. literature. In this setting LLM’s open up new opportunities for These benchmarks help identify areas where LLMs can enhance enhancing the diagnostic process. In order to better evaluate the clinical practice and highlight the potential risks or limitations LLMs clinical utility a direct comparison must be made between that need to be addressed [6]. By rigorously comparing LLMs LLMs and human clinicians. This approach is essential to gauge against existing systems, human clinicians, and traditional how well AI models can replicate or even enhance the decision-models, we can ensure that these advanced AI systems are making process of experienced healthcare professionals. Studies integrated into healthcare in a way that maximizes their benefits often involve presenting both clinicians and LLMs with the same while minimizing potential harms [10]. In general, there is a lack clinical cases and comparing their diagnoses, treatment of consensus on what to consider and prioritize for various recommendations, and reasoning [10]. Human clinician healthcare tasks. Several dimensions of evaluations such as benchmarking can reveal important insights into the strengths accuracy, calibration, and robustness are used [13]. While and limitations of LLMs. For instance, while LLMs might excel accuracy is the most often examined when evaluating LLM at processing and synthesizing vast amounts of data quickly, they performance, other aspects such as fairness, bias and toxicity, may struggle with nuanced cases that require deep contextual robustness, and deployment considerations need to be considered understanding or ethical considerations that a human clinician as well. A list of possible aspects are presented on Table 1. might naturally account for [11]. Furthermore, these benchmarks Comparative benchmarks can guide the development of future AI models. Insights gained from these evaluations can inform 33 model improvements, such as enhancing interpretability, Comprehensivenes Measures how Human evaluated reducing bias, or improving performance on specific tasks. As s well an LLMs comprehensiveness the field of AI in healthcare continues to evolve, comparative output , fluency, UniEval benchmarking will remain a crucial tool for ensuring that new coherently and relevance models are both safe and effective for clinical use [8]. concisely addresses all aspects of the task and reference provided Fairness, bias and Measures Human evaluated Table 1. Comparative benchmarks for evaluation of LLG toxiticy whether an toxicity, performances in healthcare (adapted and modified from LLMs output counterfactual Bedi et al. 2024) is equitable, fairness, impartial, and performance Dimension of Definition Metric Examples free from disparities across Evaluation harmful race Accuracy Measures how Human evaluated stereotypes or close the LLM correctness, biases, output is to the ROUGE, ensuring it true or MEDCON does not expected perpetuate answer injustice or toxicity across Calibration and Measures how Human evaluated diverse groups Uncertainty uncertain or uncertainty, underconfiden calibration error, t an LLM is Platt scaled Deployment Measures the Cost, latency, about its calibration slope considerations technical and inference runtime output for a parametric specific task details of an LLM to Robustness Measures the Human evaluated generate a LLMs robustness, exact desired output resilience match on LLM against input with adversarial intentional typos, attacks and F1 on LLM input perturbations with intentional use like typos of word synonyms 4 Conclusion Factuality Comparative benchmarking is a critical process in the Measures how Human evaluated an LLMs factual consistency, development and deployment of medical large language models. output for a citation recall, By comparing LLMs to existing clinical decision support specific task citation precision systems, human clinicians, traditional statistical models, and originates outcomes from randomized controlled trials, we can gain a from a comprehensive understanding of their strengths, limitations, and verifiable and potential impact on healthcare. As AI continues to play an citable source. increasingly prominent role in medicine, rigorous comparative It is important benchmarks will be essential for ensuring that these models to note that it deliver on their promise of improving patient care while adhering is possible for to the highest standards of safety and effectiveness. a response to be accurate but Acknowledgments factually Views and opinions expressed in this paper are those of the incorrect if it author(s) only and do not necessarily reflect those of the originates European Union. Neither the European Union nor any other from a authority can be held responsible for them. All authors hallucinated contributed equally in the final version of this paper. This project citation is funded by the European Union under Horizon Europe (project ChatMED grant agreement ID: 101159214). 34 References [8] Obermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S., 2019. Dissecting racial bias in an algorithm used to manage the health of [1] Esteva, A., Robicquet, A., Ramsundar, B., Kuleshov, V., DePristo, M., populations. Science, 366(6464), 447-453. Chou, K., et al., 2019. A guide to deep learning in healthcare. Nature [9] Fletcher, E., Burns, A., Wiering, B., Lavu, D., Shephard, E., Hamilton, Medicine, 25(1), pp.24-29. W., et al., 2023. Workload and workflow implications associated with the [2] Thirunavukarasu, A. J. et al., 2023. Large language models in medicine. use of electronic clinical decision support tools used by health Nature Medicine, 29, 1930–1940.; professionals in general practice: A scoping review. BMC Primary Care, [3] Thirunavukarasu, A.J., et al., 2023. Trialling a large language model 24(1), p.23. (ChatGPT) in general practice with the applied knowledge test: [10] Amann, J., Blasimme, A., Vayena, E., Frey, D., & Madai, V. I., 2020. Observational study demonstrating opportunities and limitations in Explainability for artificial intelligence in healthcare: a multidisciplinary primary care. JMIR Medical Education, 9, p.46599. perspective. BMC Medical Informatics and Decision Making, 20(1), 1-9. [4] Gilson, A., Safranek, C.W., Huang, T., Socrates, V., Chi, L., Taylor, R.A., [11] Gulshan, V., Peng, L., Coram, M., Stumpe, M. C., Wu, D., et al., 2023. How does ChatGPT perform on the United States Medical Narayanaswamy, A., Webster, D. R., 2016. Development and validation Licensing Examination? The implications of large language models for of a deep learning algorithm for detection of diabetic retinopathy in retinal medical education and knowledge assessment. JMIR Medical Education, fundus photographs. JAMA, 316(22), 2402-2410. 9, p.e45312. [12] Ribeiro, M. T., Singh, S., Guestrin, C., 2016. "Why should I trust you?": [5] Chen, S., Guevara, M., Moningi, S., Hoebers, F., Elhalawani, H., Kann, Explaining the predictions of any classifier. In Proceedings of the 22nd B.H., Chipidza, F.E., Leeman, J., Aerts, H.J.W.L., Miller, T., Savova, ACM SIGKDD International Conference on Knowledge Discovery and G.K., Gallifant, J., Celi, L.A., Mak, R.H., Lustberg, M., Afshar, M., & Data Mining (pp. 1135-1144) Bitterman, D.S., 2024. The effect of using a large language model to [13] Bedi S, Liu Y, Orr-Ewing L. et al. (2024) A Systematic Review of Testing respond to patient messages. The Lancet Digital Health, 6(6), pp.e379-and Evaluation of Healthcare Applications of Large Language Models e381. (LLMs). MedRvix August 16. 2024. [6] Topol, E. J., 2019. High-performance medicine: the convergence of [14] Goodfellow, I., Shlens, J., & Szegedy, C., 2014. Explaining and human and artificial intelligence. Nature Medicine, 25(1), 44-56.; harnessing adversarial examples. arXiv preprint arXiv:1412.6572 [7] Rajkomar, A., Dean, J., Kohane, I., 2019. Machine learning in medicine. New England Journal of Medicine, 380(14), 1347-1358.; 35 Using Combinatorial Testing for Prompt Engineering of LLMs in Medicine Alexander Perko Iulia Nica Franz Wotawa alexander.perko@ist.tugraz.at inica@ist.tugraz.at wotawa@ist.tugraz.at Graz University of Technology, Institute of Software Technology Graz, Austria Abstract In this paper, we contribute to verifying and validating LLMs focusing on the impact of prompts. In particular, we introduce Large Language Models (LLMs) like GPT-4o are of growing inter-and discuss a methodology based on combinatorial testing [21] est. Interfaces such as ChatGPT invite an ever-growing number for generating various versions of prompts for medical queries. of people to ask questions, including health advice, which brings We rely on testing because it is a methodology to straighten in additional risks for harm. It is well known that tools based the view on finding interactions with a system under test (SUT) on LLMs tend to hallucinate or deliver different answers for the that leads to unexpected behavior. Hence, in testing, we want same or similar questions. In both cases, the outcome might be to provide interactions that make a system fail. Combinatorial wrong or incomplete, possibly leading to safety issues. In this pa-testing is a test methodology that has proven to be effective per, we investigate the outcome of ChatGPT when we ask similar in finding test cases, i.e., inputs given to a system, to provoke questions in the medical domain. In particular, we suggest using a failing behavior. In particular, combinatorial testing focuses combinatorial testing to generate variants of questions aimed at on interacting parameter values that reveal faults. In previous identifying wrong or misleading answers. In detail, we discuss the work, Kuhn and colleagues [22] showed that strictly less than 7 general framework and its parts and present a proof-of-concept interacting parameters must be considered for many applications. utilizing a medical query and ChatGPT. Besides its effectiveness, combinatorial testing is a good testing Keywords methodology for LLMs that consider prompts. For the latter, we need different combinations of textual fragments to show differ-Large Language Models, ChatGPT, Prompt Engineering, Combi- ences in the outcome. Combinatorial testing provides such com- natorial Testing, Validation binations and also avoids leading to a combinatorial explosion 1 Introduction of potential prompts when restricting the number of considered fragment interactions. The use of LLMs in medicine has been of growing interest. In a We organize the paper as follows: We first introduce the foun-recent publication [9], the authors discuss the future of LLMs in dations. For this purpose, we discuss related research on testing medical applications. Although using such a model may lead to LLMs, and introduce the basic concepts behind combinatorial improved communication and other advantages, some drawbacks testing. Afterward, we introduce the general testing methodology prevent using such models and tools. It is well known that LLMs for generating different prompts focusing on the medical domain. like ChatGPT [33] have shortcomings like hallucinations [46]. In addition, we illustrate the use of the methodology considering Hallucinations are answers with incorrect claims that do not one particular medical query. Finally, we conclude the paper. depend on training data. Such answers, unfortunately, cannot be necessarily identified as wrong and, therefore, might be harmful, 2 Related Research especially when dealing with medical questions. There are meth-In the past several years, considerable efforts have been made to ods for detecting hallucinations, e.g., see [11]. Hence, verifying evaluate LLMs. The first indicator is the wide variety of bench-and validating tools based on LLMs to ensure a harmless use is marks which have emerged in order to test and compare their of utmost importance. performance on various tasks. In [6], the authors compile a selec-When using LLMs for queries, the form of the query, i.e., the tion of 46 popular benchmarks. Among them, we can differentiate prompt, is of great importance. Although there has been much between benchmarks used for general language tasks, like Chat- work on how to improve writing prompts in various setups, e.g., bot Arena [7], MT-Bench [49], HELM [25], or MMLU [13] and [28], there is only little scientific work, e.g., [24], providing sta- domain-specific benchmarks, like MATH [14], concentrated on tistical evidence. However, it is generally agreed that the query’s assessing reasoning and problem-solving capabilities of AI mod-structure has a significant impact on the output of a LLM. There-els in mathematics, APPS [15] for evaluating code generation, or fore, we need to consider different prompts in any verification MultiMedQA [41] with focus on medical examinations, medical and validation procedure. research, and consumer healthcare questions. Permission to make digital or hard copies of all or part of this work for personal Further on, depending on the human involvement in the evalu-or classroom use is granted without fee provided that copies are not made or ation process, there are two common methods: human evaluation distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this and automatic evaluation. Human evaluation becomes a natural work must be honored. For all other uses, contact the owner /author(s). choice in many non-standard cases, where the automated evalua- Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia tion metrics are either not suitable or insufficient. For example, in © 2024 Copyright held by the owner/author(s). https://doi.org/https://doi.org/10.70314/is.2024.chtm.10 [25], the evaluators analyze summarization and disinformation 36 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Perko et al. scenarios, while in [2] analogical reasoning tasks. Also, Ziems of the responses remains roughly equivalent to those provided et al. [50] used the annotations from researchers for generation. for the original test set. Although it can provide more comprehensive and accurate feed- back than the automatic evaluation, the human evaluation may 3 Combinatorial Testing suffer from a high variance and instability due to cultural and Combinatorial testing aims to generate test cases by consider-individual biases. On the other hand, the automatic variant bene-ing a system’s input model. The input model comprises a set fits from higher standardization. LLM-EVAL [26], for instance, of parameters (or variables) {𝑥 } and a not necessarily 1, . . . , 𝑥𝑛 is a unified multidimensional automatic evaluation method for different domain 𝑑 for each parameter 𝑥 . The domain itself is a 𝑖 𝑖 open-domain conversations with LLMs. Jain et al. [18] proposed a finite set of values a parameter can take. A test case is a 𝑛-tuple self-supervised evaluation framework, and also PandaLM [45] ob-specifying a value 𝑣 ∈ 𝑑 for every parameter 𝑥 ∈ {𝑥 }. 𝑖 𝑖 𝑖 1, . . . , 𝑥𝑛 tained reproducible and automated language model assessment A test suite is a set of test cases. Usually, we write a test suite as by training an LLM that acts as the ’judge’ to assess different a table where the columns are the parameters, and the rows have models. For more details on the specific key metrics and factors their corresponding values. for both evaluation types, we refer the interested reader to [6]. Given an input model, a complete test suite comprises a row In addition, domain-specific evaluation is also critical, as LLMs for each possible value-parameter combination. Obviously, the are often used in specific areas, such as healthcare or finance, 𝑛 upper bound of rows is of order 𝑂 (𝐷 ) where 𝐷 is the maxi- that have specific requirements for the models. In the beginning, mum size of all domains 𝑑 , 𝑖 = 1, . . . , 𝑛, i.e., 𝐷 = max (𝑑 ). 𝑖 𝑖 =1,...,𝑛 𝑖 most evaluation research has been focused on natural language Hence, computing a complete test suite is not feasible for soft-tasks. A popular direction here is, for instance, the sentiment ware or systems comprising a larger number of input parameters. analysis task, which analyzes and interprets the text to identify Moreover, applying all test cases is not feasible because the sys-the emotional inclination. Further on, in the medical field, the tem’s behavior must also be evaluated. In combinatorial testing, application of LLM has recently gained significant attention. Ac-we do not have a test oracle. The focus is only on input gener-cording to the literature ([6], [5]), most LLMs evaluations in the ation. Hence, such a test oracle must be added to classify a test medical field focus on medical queries. ChatGPT, in particular, case as passing or failing, i.e., indicating whether a test case leads generated relatively accurate information for various medical to a correct or wrong output, respectively. It is worth noting that queries from genetics [10], biomedicine [17], radiation oncology such a test oracle can be automated, and we will discuss this physics [16]. Furthermore, several studies have evaluated the when showing our application for validating LLMs considering performance and feasibility of ChatGPT in the medical education medical queries. field. In [31], GPT-3.5 [4] and GPT-4 [34] models were evaluated Combinatorial testing avoids computing all possible test cases. in terms of their understanding of surgical clinical information The idea behind this is to consider not all parameter combina-and their potential impact on surgical education and training. tions but only those combinations of values for a fixed number 𝑘 These findings demonstrate that LLMs can be successfully used (smaller than 𝑛) of parameters. Hence, a combinatorial test suite in clinical education, still further efforts may be needed to over-covers all combinations of values for any subset of parameters of come limitations. For more details concerning the evaluation of size 𝑘 , which is usually substantially smaller. Such a test suite is other applications like social science, natural science, engineer-said to be of strength 𝑘 or to be a 𝑘 -wise test suite. If 𝑘 is 2, then ing agent applications, education, search and recommendation, the test suite is a pairwise test suite, and we speak about pairwise and personality testing, we refer the reader to [6]. testing. Note that in practice, pairwise testing is not good enough Another interesting taxonomy from [6] groups the encoun- (see [22, 23]). For more information on combinatorial testing tered testing approaches into three directions: (1) from the ob-and its foundations, we refer the interested reader to [30, 21]. jective calculation (benchmarking) to human-in-the-loop testing, There are many algorithms available, including ACTS [47], for (2) from static to crowd-sourcing test sets and (3) from unified to computing combinatorial test suites for arbitrary input models challenging test sets. While unified settings involve test sets with and strengths. It is also worth mentioning that combinatorial no preference for any specific task, challenging settings create testing has been successfully used in many application domains, test sets for specific tasks. Tools like DeepTest[43] use seeds to including autonomous driving [20] and security testing [40]. generate input modifications for testing, CheckList [38] builds In the following, we illustrate combinatorial testing using test sets based on templates, whereas AdaFilter [36] constructs a small example. In this example, we assume four parameters tests adversarially. Furthermore, despite the growing number of 𝑎, 𝑏, 𝑐, 𝑑 , all of them only taking values from the Boolean domain academic projects designed for prompting LLMs [19, 27, 3], just {𝑇 , 𝐹 } standing for true and false. A pairwise combinatorial test a few of them support systematic evaluation of textual responses suite for this input model comprises 6 test cases: [48, 1]. ChainForge [1] is a visual toolkit that offers on-demand a b c d hypothesis testing of the behavior of text-generating LLMs on 1 T T F F open-domain tasks. 2 T F T T To our knowledge, the use of combinatorial testing (CT), in 3 F T T F particular for the testing of LLMs, is reported in a single paper 4 F F F T [12]. Based on a given original sentence, the authors derive new 5 F T F T sentences by replacing words with synonyms according to a com- 6 T F F F binatorial test set. Assuming that the semantics of the original sentence are preserved in the derived sentences, a test oracle is For any combination of two parameters, e.g., 𝑎 and 𝑐 , this table created based on existing annotations. In the experimental evalu-comprises all possible combinations of values. Rows 1, 2, 3, and 4 ation from [12], the authors apply generated pairwise sentence already cover all four combinations for these two parameters. For test sets from the BoolQ benchmark set [8] against two LLMs parameters 𝑏 and 𝑑 , rows 1, 2, 5, and 6 are required to cover all (T5 [37] and LLaMa [44]). The results indicate that the accuracy value combinations. It can be easily checked that this holds also 37 CT for Testing LLMs in Medicine Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia for any other pair of parameters. Note that pairwise testing in API enables rapid prompt prototyping as well as executing larger this case only requires 6 test cases. Considering all combinations, test suites with the same underlying model. That said, it would 4 we would have 2 = 16 test cases. For the remainder of this paper, be easy to swap out the SUT and test another LLM using the as we introduce domains extending beyond boolean values, we same validation methodology. will use indices when referring to parameter values. For {𝑇 , 𝐹 }, We use the expert system "Symptom-Checker" [39], which is the indices would be 0 and 1, respectively, and the first row of curated by medical professionals and freely accessible via Net-the table above would be represented as [0, 0, 1, 1]. Doktor [29]. Given a set of symptoms, we can automatically retrieve diagnoses by traversing a decision tree and answering 4 Validation Methodology yes-no-questions. In addition to "yes" and "no" there is also the option to skip a question when the inquired information is not Figure 1 gives a high-level overview of our proposed validation available. Further questions are asking for age, sex, and main methodology. The remainder of this section follows the numbers symptoms, as well as the body part that is influenced the most. We shown in Figure 1 and discusses the individual elements of our compute a score by comparing results (i.e., diagnoses) from our validation pipeline. SUT ChatGPT with our Golden Model NetDoktor. This is done The domain of our combinatorial prompt generation pipeline semi-automatically by first retrieving the diagnoses in the same can be seen in Table 1, where parameters are components of format and then assessing their overlap. As NetDoktor always a prompt and values are (sub-)phrases. Our prototypical set of yields three diagnoses, our score ranges from 0/3 (no overlap) to parameters comprises a) symptom presentation, which is an in-3/3 (complete overlap). In cases where the SUT yields more or less troductory sub-phrase to the prompt, b) diagnostic focus, which than three diagnoses, we do not normalize the score. This means sets the horizon for which kind of diagnoses are expected, c) an that a result comprising only one diagnosis cannot achieve a com-additional hint to consider context information such as age, and plete overlap with NetDoktor and thus cannot achieve a better d) constraints on how the output should be formulated. Each score than 1/3. In turn, a result comprising 10 diagnoses cannot parameter can assume an indexed value from the given set, and achieve a higher score than 3/3. It must be noted that duplicate every set of values includes an empty string, which is denoted diagnoses are only counted once, and semantic equivalence is by "-". considered when comparing diagnoses. Table 1: Domain: Prompt Components and Values by Index 5 Medical Use Case Parameter For our use case, we assume the downstream task of retrieving (i.e. Prompt IDX Value diagnoses from an LLM based on a given set of symptoms. As Component) mentioned earlier in this paper, hallucination is a severe prob-Symptom 0 - lem when using LLMs. Especially in the medical domain, it is Presentation 1 list of symptoms of utmost importance that systems are tested and validated in a "Given the 2 symptoms structured way, as faulty output might have dire consequences following..." 3 high-level overview of symptoms ranging from misinformation to taking the wrong medication. Diagnostic 0 - However, LLMs cannot be tested exhaustively due to their non- Focus 1 a probable diagnosis deterministic nature and other factors, such as computational and 2 a differential diagnosis monetary costs. Even testing a restricted domain, such as prompt 3 an emergency diagnosis formulation, given a limited set of input parameters, may lead to 4 the three most likely diagnoses a combinatorial explosion when testing all possible combinations. 5 the ten most likely diagnoses For instance, compare the 24 pairwise combinations, as described Contextual 0 - above, to all possible 192 combinations from the values in Ta- Information 1 based on patient’s age and gender ble 1 used with reasonably sized corpus of pathologies, such as Constraints 0 - DDXPlus [42]: This corpus comprises 134,530 samples (i.e. sets of "the diagnosis 1 be concise symptoms) from the real world, which yields 25,829,760 possible should..." 2 be detailed including explanations test cases as compared to 3,228,720 for pairwise testing. Taking 3 have less than 100 words into account cost, we look at 206,134.40$ versus 25,766.80$. Table 2 gives a breakdown of the number of tokens for prompts as well as their cost, assuming the current pricing of GPT-4o, In our prototype implementation, we use pairwise testing for which is 5.00$ and 15.00$ per million input and output tokens, all parameters seen in Table 1. Note that we treat the sets of symp-respectively. It has to be mentioned that the input length can be toms separately from the rest of the input parameters: Symptoms fully controlled, whereas the length of the output can only be are excluded from the pairwise combination as we rather want guided by the prompt. In this case, we always assumed 475 output to combine each set of symptoms with all pairwise combinations tokens, which was the average in our preliminary experiments. of the other parameters. This results in the listed 24 pairwise The average number of input tokens was 171.1, consisting of 19.1, combinations per set of symptoms. Finally, the output of our which are derived from combinations of values seen in Table 1, combinatorial testing pipeline is textual prompts in natural lan-and 152 coming from our exemplary set of symptoms, discussed guage that act as test cases for evaluating an LLM. The resulting in the following Subsection 5.1. test suite from pairwise combinations for our exemplary set of symptoms can be seen in the first two columns of Table 3. For our 5.1 Example preliminary evaluation, we used GPT-4o [34, 35] exclusively. The model can be accessed either via ChatGPT [33] or the OpenAI To further explain our proposed validation methodology, we use API [32]. Having both a chatbot interface and a programmable the following exemplary description of symptoms: 38 Symptoms Diagnoses Golden Overlap Score Model ⊇ Diagnoses LLM Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Perko et al. INPUT COMBINATORIAL PROMPT GENERATION TEST SUITE 1 2 Prompt Components Combinatorial Testing Algorithm + Prompt Symptoms 4 5 3 Golden Model Test Evaluation / Score System Under Test NetDoktor Result Comparison LLM (ChatGPT) Figure 1: Basic Architecture of Our Validation Methodology. Table 2: Cost per Size of Medical Corpus Given the following high-level overview of symptoms, pro- Corpus Size 1 10 100 vide the ten most likely diagnoses based on the patient’s ... DDXPlus All Combinations age and gender. An adult woman is experiencing symptoms in the breast Combinations 192 1,920 19,200 ... 25,829,760 gland area. Her most troubling symptom is fluid discharge, Input Tokens 0.03M 0.33M 3.29M ... 4,419M and she can feel a firm, painless lump. Input Cost [$] 0.16 1.64 16.43 ... 22,097.36 Output Tokens 0.09M 0.91M 9.12M ... 12,269M Output Cost [$] 1.37 13.68 136.80 ... 184,037.04 Other than the overlap score, there are severe differences in Total Cost [$] 1.53 15.32 153.23 ... 206,134.40 the output depending on the used prompt. Figure 2 illustrates Pairwise Combinations those differences underlined by textual metrics, such as the num-Combinations 24 240 2,400 ... 3,228,720 ber of words, while Figure 3 highlights the conciseness of the Input Tokens 0.004M 0.04M 0.41M ... 552M result as measured by the ratio between the number of words and Input Cost [$] 0.02 0.21 2.05 ... 2,762.17 the number of diagnoses. Most notably, the constraints to asking Output Tokens 0.01M 0.11M 1.14M ... 1,533M the LLM to provide concise diagnoses or limiting the number Output Cost [$] 0.17 1.71 17.10 ... 23,004.63 of words to 100 reduce the length drastically. As can be seen in Figure 2, the prompts 3, 4, 7, 8, 11, 12, 16, 17, 20, 22, 23, and 24 Total Cost [$] 0.19 1.92 19.15 ... 25,766.80 all yielded results with less than 200 words. However, prompt 7 exceeds the posed 100 word limit. Furthermore, none of these prompts fully overlapped. When comparing the results for con- ciseness in particular, Figure 3 shows that the ratio between the number of words and the number of diagnoses is less than 50 for An adult woman is experiencing symptoms in the breast all prompts querying the LLM to provide concise responses (i.e. gland area. Her most troubling symptom is fluid discharge, 3, 4, 16, 17, 23, 24), whereas it is above 60 for all and above 100 and she can feel a firm, painless lump. for all but one of the prompts asking for a detailed response (i.e. 5, 6, 9, 10, 19). In an effort to make our work as transparent and reproducible This set of symptom shall be seen as a sample from a corpus as possible, we provide all prompts and responses of our prelimi-of medical pathologies (i.e. sets of symptoms). We use it to test 1 nary study as a replication package . different prompting strategies by combining the values from our domain, seen in Table 1. From this singular sample, we can 6 Conclusion generate 24 test cases as per our methodology. These test cases are then used to evaluate our SUT based on the output of our golden This paper highlights the importance of a structured and rigor-model. For this set of symptoms, our golden model diagnoses ous validation methodology for LLMs in the medical domain, are: particularly focusing on prompt engineering. The proposed validation pipeline makes use of pairwise combinatorial testing to • Breast cancer systematically evaluate the responses of LLMs like ChatGPT to • Cyst in the breast medical queries. The methodology generates test cases given sets • Mastopathy of symptoms and combinations of prompt components. Combi- natorial testing ensures that a wide range of prompt variations Table 3 shows the test result of all 24 test cases. It can be easily is tested per set of symptoms without causing a combinatorial spotted that test case 13 was the only prompt achieving a com- plete overlap with the NetDoktor diagnoses. When fully writtenout, prompt 13 corresponding to the combination [3, 5, 1, 0] was: 1 https://zenodo.org/doi/10.5281/zenodo.13765131 39 CT for Testing LLMs in Medicine Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Number of Sentences in LLM Response Table 3: Overlaps of Diagnoses with Golden Model per 0 10 20 30 40 Combination for One Exemplary Set of Symptoms. Each Combination Corresponds to One Prompt and Is Denoted 1: [0, 0, 0, 0] by a Code Representing the Indices of the Assumed Values 2: [1, 1, 1, 0] per Prompt Component, as Seen in Table 1. "Mast." stands 3: [2, 2, 1, 1] for " "Mastopathy". 4: [3, 3, 0, 1] 5: [3, 4, 1, 2] Test Suite Golden Model Overlap 6: [2, 5, 0, 2] Score ID Combination Cancer Cyst Mast. 7: [1, 5, 0, 3] 1 [0, 0, 0, 0] ✓ ✓ 2/3 8: [0, 4, 1, 3] 2 [1, 1, 1, 0] ✓ 1/3 9: [0, 3, 1, 2] 3 [2, 2, 1, 1] ✓ 2/3 10: [1, 2, 0, 2] 4 [3, 3, 0, 1] ✓ 1/3 11: [2, 0, 1, 3] 5 [3, 4, 1, 2] ✓ 1/3 12: [3, 1, 0, 3] 6 [2, 5, 0, 2] ✓ ✓ 2/3 13: [3, 5, 1, 0] 14: [2, 4, 0, 0] 7 [1, 5, 0, 3] ✓ ✓ 2/3 15: [1, 3, 0, 0] 8 [0, 4, 1, 3] ✓ 1/3 16: [0, 1, 0, 1] 9 [0, 3, 1, 2] ✓ ✓ 2/3 17: [1, 0, 0, 1] 10 [1, 2, 0, 2] ✓ ✓ 2/3 18: [0, 2, 0, 0] 11 [2, 0, 1, 3] ✓ ✓ 2/3 19: [2, 1, 0, 2] 12 [3, 1, 0, 3] ✓ ✓ 2/3 13 [3, 5, 1, 0] 20: [3, 2, 0, 3] ✓ ✓ ✓ 3/3 21: [3, 0, 0, 2] 14 [2, 4, 0, 0] ✓ 1/3 22: [2, 3, 0, 3] 15 [1, 3, 0, 0] ✓ 1/3 23: [0, 5, 0, 1] 16 [0, 1, 0, 1] ✓ ✓ 2/3 24: [1, 4, 0, 1] 17 [1, 0, 0, 1] ✓ ✓ 2/3 18 [0, 2, 0, 0] ✓ ✓ 2/3 0 200 400 600 800 19 [2, 1, 0, 2] ✓ 1/3 Number of Words in LLM Response 20 [3, 2, 0, 3] ✓ 2/3 21 [3, 0, 0, 2] ✓ ✓ 2/3 Figure 2: Textual Metrics of LLM Responses: Number of 22 [2, 3, 0, 3] ✓ 2/3 Sentences & Number of Words per Prompt 23 [0, 5, 0, 1] ✓ ✓ 2/3 24 [1, 4, 0, 1] ✓ 1/3 [2] Yejin Bang et al. 2023. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity. (2023). https://arxi v.org/abs/2302.04023 arXiv: 2302.04023 [cs.CL]. explosion. Doing so is more efficient and reduces costs as com- [3] Stephen Brade, Bryan Wang, Mauricio Sousa, Sageev Oore, and Tovi Gross-pared to testing all possible combinations of prompt parameters, man. 2023. Promptify: text-to-image generation through interactive prompt which is especially important when evaluating LLMs on large exploration with large language models. (2023). https://arxiv.org/abs/2304.0 9337 arXiv: 2304.09337 [cs.HC]. medical corpora. The proposed validation pipeline implements [4] Tom B. Brown et al. 2020. Language models are few-shot learners. arXiv: semi-automated scoring based on a "golden model“, which pro-2005.14165 [cs.CL]. (2020). vides diagnoses curated by medical professionals. In a preliminary [5] David Cecchini, Arshaan Nazir, Kalyan Chakravarthy, and Veysel Kocaman. 2024. Holistic evaluation of large language models: assessing robustness, study, we demonstrate severe differences in output for prompt accuracy, and toxicity for real-world applications. In Proceedings of the variations given the same set of symptoms. Out of 24 test cases, 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024). Anaelia Ovalle, Kai-Wei Chang, Yang Trista Cao, Ninareh Mehrabi, Jieyu only one achieved a full overlap with our golden model when Zhao, Aram Galstyan, Jwala Dhamala, Anoop Kumar, and Rahul Gupta, using GPT-4o. Once more, this highlights the dependence on well-editors. Association for Computational Linguistics, Mexico City, Mexico, formulated prompts and a need for thorough testing strategies, (June 2024), 109–117. doi: 10.18653/v1/2024.trustnlp- 1.11. [6] Yupeng Chang et al. 2023. A survey on evaluation of large language models. especially in critical domains like medicine. (2023). https://arxiv.org/abs/2307.03109 arXiv: 2307.03109 [cs.CL]. [7] Wei-Lin Chiang et al. 2024. Chatbot Arena: An Open Platform for Evaluating Acknowledgements LLMs by Human Preference. (2024). arXiv: 2403.04132 [cs.AI]. [8] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael The work presented in this paper was partially funded by the Collins, and Kristina Toutanova. 2019. Boolq: exploring the surprising diffi-European Union under Grant 101159214 – ChatMED. Views and culty of natural yes/no questions. (2019). https://arxiv.org/abs/1905.10044 arXiv: 1905.10044 [cs.CL]. opinions expressed are, however, those of the author(s) only and [9] J. Clusmann et al. 2023. The future landscape of large language models in do not necessarily reflect those of the European Union. Neither medicine. Communications Medicine, 3, 141. doi: https://doi.org/10.1038/s43 856- 023- 00370- 1. the European Union nor the granting authority can be held re- [10] Dat Duong and Benjamin D. Solomon. 2023. Analysis of large-language sponsible for them. model versus human performance for genetics questions. medRxiv. eprint: https://www.medrxiv.org/content/early/2023/01/28/2023.01.27.23285115.f References ull.pdf . doi: 10.1101/2023.01.27.23285115. [11] Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. 2024. De- [1] Ian Arawjo, Chelse Swoopes, Priyan Vaithilingam, Martin Wattenberg, and tecting hallucinations in large language models using semantic entropy. Elena L. Glassman. 2024. ChainForge: A Visual Toolkit for Prompt Engi-Nature, 630. https://doi.org/10.1038/s41586-024-07421-0. neering and LLM Hypothesis Testing. In Proceedings of the CHI Conference [12] 2023. Applying pairwise combinatorial testing to large language model testing. on Human Factors in Computing Systems (CHI ’24). ACM, (May 2024), 1–18. Springer, (Sept. 2023), 247–256. isbn: 978-3-031-43239-2. doi: 10.1007/978- 3 doi: 10.1145/3613904.3642016. - 031- 43240- 8_16. 40 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Perko et al. Word/Diagnoses Ratio 0 20 40 60 80 100 120 140 160 [24] Boniphace Kutela, Kelvin Msechu, Norris Novat, Emmanuel Kidando, and Angela Kitali. 2023. Uncovering the influence of chatgpt’s prompts on scientific writings using machine learning-based text mining approaches. SSRN 1: [0, 0, 0, 0] Electronic Journal. http://dx.doi.org/10.2139/ssrn.4385895. 2: [1, 1, 1, 0] [25] Percy Liang et al. 2023. Holistic evaluation of language models. (2023). 3: [2, 2, 1, 1] https://arxiv.org/abs/2211.09110 arXiv: 2211.09110 [cs.CL]. [26] Yen-Ting Lin and Yun-Nung Chen. 2023. Llm-eval: unified multi-dimensional 4: [3, 3, 0, 1] automatic evaluation for open-domain conversations with large language 5: [3, 4, 1, 2] models. (2023). https://arxiv.org/abs/2305.13711 arXiv: 2305.13711 [cs.CL]. 6: [2, 5, 0, 2] [27] Aditi Mishra, Utkarsh Soni, Anjana Arunkumar, Jinbin Huang, Bum Chul Kwon, and Chris Bryan. 2023. Promptaid: prompt exploration, perturbation, 7: [1, 5, 0, 3] testing and iteration using visual analytics for large language models. (2023). 8: [0, 4, 1, 3] https://arxiv.org/abs/2304.01964 arXiv: 2304.01964 [cs.HC]. 9: [0, 3, 1, 2] [28] M. Nazari and G. Saadi. 2024. Developing effective prompts to improve communication with chatgpt: a formula for higher education stakeholders. 10: [1, 2, 0, 2] Discov Educ, 3. https://doi.org/10.1007/s44217-024-00122-w. 11: [2, 0, 1, 3] [29] 2024. Netdoktor. https://www.netdoktor.at. Accessed: 2024-09-03. (2024). 12: [3, 1, 0, 3] [30] Changhai Nie and Hareton Leung. 2011. A survey of combinatorial testing. ACM Comput. Surv., 43, 2, Article 11, (Feb. 2011), 11:1–11:29. 13: [3, 5, 1, 0] [31] Namkee Oh, Gyu-Seong Choi, and Woo Yong Lee. 2023. Chatgpt goes to the 14: [2, 4, 0, 0] operating room: evaluating gpt-4 performance and its potential in surgical 15: [1, 3, 0, 0] education and training in the era of large language models. Annals of Surgical Treatment and Research, 104, (Apr. 2023), 269. doi: 10.4174/astr.2023.104.5.2 16: [0, 1, 0, 1] 69. 17: [1, 0, 0, 1] [32] OpenAI. 2023. API Reference. (2023). platf orm.openai.com/docs/api- ref ere 18: [0, 2, 0, 0] nce. [33] OpenAI. 2023. ChatGPT. (2023). chat.openai.com/chat. 19: [2, 1, 0, 2] [34] OpenAI. 2023. GPT-4 technical report. arXiv: 2303.08774 [cs.CL]. (2023). 20: [3, 2, 0, 3] [35] OpenAI. 2024. Introducing gpt-4o and more tools to chatgpt free users. (May 21: [3, 0, 0, 2] 2024). https://openai.com/index/gpt- 4o- and- more- tools- to- chatgpt- f ree/. [36] Jason Phang, Angelica Chen, William Huang, and Samuel R. Bowman. 2021. 22: [2, 3, 0, 3] Adversarially constructed evaluation sets are more challenging, but may 23: [0, 5, 0, 1] not be fair. (2021). https : / / arxiv . org / abs / 2111 . 08181 arXiv: 2111 . 08181 24: [1, 4, 0, 1] [cs.CL]. [37] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the 0 2 4 6 8 10 limits of transfer learning with a unified text-to-text transformer. Journal Number of Diagnoses in LLM Response of Machine Learning Research, 21, 140, 1–67. http://jmlr.org/papers/v21/20-0 74.html. [38] Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. Figure 3: Conciseness of LLM Responses: Number of Diag- 2020. Beyond accuracy: behavioral testing of nlp models with checklist. noses & Ratio between Number of Words and Number of (2020). https://arxiv.org/abs/2005.04118 arXiv: 2005.04118 [cs.CL]. [39] Jens Richter, Hans-Richard Demel, Florian Tiefenböck, Luise Heine, and Diagnoses per Prompt Martina Feichter. 2024. Symptom-checker. https://www.netdoktor.at/sympt om- checker/. Accessed: 2024-09-03. (2024). [40] Dimitris E. Simos, Josip Bozic, Bernhard Garn, Manuel Leithner, Feng Duan, Kristoffer Kleine, Yu Lei, and Franz Wotawa. 2019. Testing TLS using [13] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, planning-based combinatorial methods and execution framework. Software Dawn Song, and Jacob Steinhardt. 2021. Measuring massive multitask lan-Quality Journal, 27. https://doi.org/10.1007/s11219-018-9412-z. guage understanding. (2021). https : / / arxiv . org / abs / 2009 . 03300 arXiv: [41] Karan Singhal et al. 2022. Large language models encode clinical knowledge. 2009.03300 [cs.CY]. (2022). https://arxiv.org/abs/2212.13138 arXiv: 2212.13138 [cs.CL]. [14] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, [42] Arsene Fansi Tchango, Rishab Goel, Zhi Wen, Julien Martel, and Joumana Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical Ghosn. 2022. DDXPlus: A New Dataset For Automatic Medical Diagnosis. problem solving with the math dataset. (2021). https://arxiv.org/abs/2103.03 (2022). https://arxiv.org/abs/2205.09148 arXiv: 2205.09148 [cs.CL]. 874 arXiv: 2103.03874 [cs.LG]. [43] Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray. 2018. DeepTest: Au- [15] Dan Hendrycks et al. 2021. Measuring coding challenge competence with tomated Testing of Deep-Neural-Network-driven Autonomous Cars. (2018). apps. (2021). https://arxiv.org/abs/2105.09938 arXiv: 2105.09938 [cs.SE]. https://arxiv.org/abs/1708.08559 arXiv: 1708.08559 [cs.SE]. [16] Jason Holmes et al. 2023. Evaluating large language models on a highly- [44] Hugo Touvron et al. 2023. LLaMA: Open and Efficient Foundation Language specialized topic, radiation oncology physics. Frontiers in Oncology, 13, (July Models. (2023). https://arxiv.org/abs/2302.13971 arXiv: 2302.13971 [cs.CL]. 2023). doi: 10.3389/f onc.2023.1219326. [45] Yidong Wang et al. 2024. PandaLM: An Automatic Evaluation Benchmark [17] Israt Jahan, Md Tahmid Rahman Laskar, Chun Peng, and Jimmy Huang. 2023. for LLM Instruction Tuning Optimization. (2024). https://arxiv.org/abs/2306 Evaluation of chatgpt on biomedical tasks: a zero-shot comparison with .05087 arXiv: 2306.05087 [cs.CL]. fine-tuned generative transformers. (2023). https://arxiv.org/abs/2306.04504 [46] Yijun Xiao and William Yang Wang. [n. d.] On hallucination and predictive arXiv: 2306.04504 [cs.CL]. uncertainty in conditional language generation. In Proceedings of the 16th [18] Neel Jain, Khalid Saifullah, Yuxin Wen, John Kirchenbauer, Manli Shu, Conference of the European Chapter of the Association for Computational Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein. 2023. Linguistics: Main Volume. doi: 10.18653/v1/2021.eacl-main.236. Bring your own data! self-supervised evaluation for large language models. [47] Linbin Yu, Yu Lei, R.N. Kacker, and D.R. Kuhn. 2013. ACTS: A Combinatorial (2023). https://arxiv.org/abs/2306.13651 arXiv: 2306.13651 [cs.CL]. Test Generation Tool. In Software Testing, Verification and Validation (ICST), [19] Ellen Jiang, Kristen Olson, Edwin Toh, Alejandra Molina, Aaron Donsbach, 2013 IEEE Sixth International Conference on, 370–375. Michael Terry, and Carrie J Cai. 2022. Promptmaker: prompt-based proto- [48] J.D. Zamfirescu-Pereira, Richmond Y. Wong, Bjoern Hartmann, and Qian typing with large language models. In Extended Abstracts of the 2022 CHI Yang. 2023. Why Johnny Can’t Prompt: How Non-AI Experts Try (and Conference on Human Factors in Computing Systems (CHI EA ’22) Article Fail) to Design LLM Prompts. In Proceedings of the 2023 CHI Conference on 35. Association for Computing Machinery, New Orleans, LA, USA, 8 pages. Human Factors in Computing Systems (CHI ’23) Article 437. Association for isbn: 9781450391566. doi: 10.1145/3491101.3503564. Computing Machinery, Hamburg, Germany, 21 pages. isbn: 9781450394215. [20] Florian Klück, Yihao Li, Jianbo Tao, and Franz Wotawa. 2023. An empirical doi: 10.1145/3544548.3581388. comparison of combinatorial testing and search-based testing in the context [49] Lianmin Zheng et al. 2023. Judging LLM-as-a-Judge with MT-Bench and of automated and autonomous driving systems. Information and Software Chatbot Arena. In Thirty-seventh Conference on Neural Information Process-Technology, 160, 107225. doi: https://doi.org/10.1016/j.infsof.2023.107225. ing Systems Datasets and Benchmarks Track. https://openreview.net/forum [21] D.R. Kuhn, R.N. Kacker, and Y. Lei. 2013. Introduction to Combinatorial Test- ?id=uccHPGDlao. ing. Chapman & Hall/CRC Innovations in Software Engineering and Software [50] Caleb Ziems, William Held, Omar Shaikh, Jiaao Chen, Zhehao Zhang, and Development Series. Taylor & Francis. Diyi Yang. 2024. Can large language models transform computational social [22] D.R. Kuhn, R.N. Kacker, Y. Lei, and J. Hunter. 2009. Combinatorial software science? (2024). https : / / arxiv . org / abs / 2305 . 03514 arXiv: 2305 . 03514 testing. Computer, (Aug. 2009), 94–96. [cs.CL]. [23] Rick Kuhn, Yu Lei, and Raghu Kacker. 2008. Practical combinatorial testing: beyond pairwise. IT Professional, 10, 3, 19–23. 41 IS2024 - CHATGPT in MEDICINE, Petrovska Dojchinovska et al. Meeting Cultural and Linguistic Demands to Accommodate Fine-Tuned LLMs to Local Medical Customs and Patient Communication Gordana Petrovska Dojchinovska1,Monika Simjanovska Misheva1, Kostadin Mishev1 1Ss. Cyril and Methodius University, Faculty of Computer Science and Engineering, Skopje, N. Macedonia gordana.petrovska.dojchinovska@students.finki.ukim.mk {monika.simjanoska, kostadin.mishev}@finki.ukim.mk ABSTRACT healthcare policies, provider training, and patient care strategies Integrating advanced open-source large language models (LLMs), will improve the quality of care for diverse populations[5]. Con-such as LLaMA and GatorTron, into healthcare offers a novel ap-versely, language barriers, cultural misunderstandings, and a lack proach to enhancing communication between physicians and pa-of cultural awareness among healthcare providers can lead to mis-tients. This paper provides a comprehensive review of the potential communication, misdiagnosis, medication errors, and other safety of these models to improve patient-provider interactions, focusing risks[11]. Large Language Models (LLMs) have shown immense on their ability to process and generate human-like language in real-potential in various fields due to their proficiency in generating time clinical settings. The review outlines the methodology used and understanding natural language. In healthcare, models such as to evaluate LLMs, which includes a detailed comparison based on LLaMA and GatorTron present an exciting frontier for improving qualitative factors such as linguistic adaptability, cultural sensitiv-communication between physicians and patients. These models can ity, and context-awareness, alongside quantitative metrics such as process and generate human-like language, which could address accuracy rates, error margins, and patient satisfaction scores from significant challenges in clinical communication, such as linguistic, clinical studies. Key ethical considerations are explored, particu-cultural, or emotional barriers. However, significant challenges re-larly concerning data privacy, patient consent, and accountability. main, particularly regarding ethical implications, technical hurdles, The paper delves into how adopting specific ethical frameworks local customs, linguistic demands, and the critical aspect of patient-or guidelines can help mitigate risks associated with bias, misin-doctor communication. This review examines current applications formation, and patient autonomy. Additionally, the potential for of LLMs in healthcare, the associated challenges, ethical concerns, LLMs to perpetuate biases or cultural misunderstandings is dis-and potential gaps, incorporating a range of recent research studies cussed, emphasizing the need for fine-tuning these models to align in the field from 2023 and early 2024.1 with clinical guidelines and patient needs across different geogra-phies and medical customs. While the paper acknowledges the 1.1 Purpose and Scope gaps in current research, such as the emotional sensitivity of LLMs This paper aims to critically examine the potential of integrating and their ability to understand nuanced patient concerns, it also LLMs into healthcare settings, emphasizing their role in enhancing proposes structured future research directions. This includes the de-communication. The scope of this review extends to a comparative velopment of LLMs that are more contextually aware, emotionally evaluation of LLaMA, GPT-4, L2M3, and GatorTron, focusing on intelligent, and capable of operating in diverse healthcare settings. their potential impact, ethical considerations, and future research By synthesizing current studies and real-world applications, this needs. The goal is to provide healthcare practitioners, researchers, paper aims to provide a transparent, reproducible framework for and developers with a transparent and reproducible framework for evaluating the effectiveness of fine-tuned LLMs in transforming adopting LLMs. healthcare communication, ultimately improving patient outcomes and satisfaction. 2 METHODOLOGY This review systematically analyzes the current applications of KEYWORDS LLMs in healthcare by conducting a structured comparison of mod-large language models, healthcare, patient-doctor communication, els such as GPT-4, Llama, GatorTron, and L2M3. The evaluation ethics in AI, cultural sensitivity criteria include both qualitative and quantitative metrics, focusing on model accuracy, error rates, patient satisfaction scores, and the 1 INTRODUCTION ability to generate culturally sensitive and clinically appropriate Effective communication has always been at the forefront of success-responses. The sources for this review consist of peer-reviewed ful interaction between physicians and patients. It has already been studies published between 2023 and 2024, covering applications purported that integrating cultural and linguistic competency into of LLMs in real-time clinical settings, mental health, and patient-doctor communication. Each model’s performance is assessed based . 1https://doi.org/10.70314/is.2024.chtm.11 42 IS2024 - CHATGPT in MEDICINE, Meeting Cultural and Linguistic Demands to Accommodate Fine-Tuned LLMs to Local Medical Customs and Patient Communication on specific clinical tasks, adaptability to local medical customs, and decision-making and patient compliance[19]. Recent research ex-ethical considerations. This methodology ensures transparency and plores how generative AI tools, including LLMs, have been applied reproducibility, allowing future researchers to replicate and validate to reduce misunderstandings in clinical services. They show that pathe findings. tients often feel overwhelmed by medical language, and LLMs can mitigate this issue by acting as interpreters, promoting clearer com-3 CURRENT APPLICATION OF LLMS IN munication between patients and their healthcare providers[28]. HEALTHCARE A pictorial demonstration of the ways large language models can LLMs are being adopted for various tasks within healthcare, with improve the communication between patients and their providers their most prominent use in medical communication, clinical de-is given in Figure 2. cision support, and facilitating patient-doctor interactions. Conversational agents like ChatGPT assist patients in understanding medical terminology and provide immediate answers to healthcare-related queries, often enhancing patient engagement with care providers. LLMs also bridge communication gaps between patients and healthcare providers, improving clarity and comprehension in patient education[25]. In patient-doctor communication, LLMs serve as intermediaries to help patients articulate their symptoms and concerns more clearly. This enhances clinical encounters, as doctors can quickly grasp patient issues without the noise of mis-Figure 2: Patient-doctor communication facilitated by the communication. They can improve the precision of patient-doctor use of LLMs dialogue, particularly when addressing complex conditions and explaining treatment plans[21]. Similarly, mental health applications have seen LLMs being used to summarize counseling sessions, There is also research that addresses how LLMs can cater to mul-allowing healthcare providers to focus on therapeutic interventions tilingual settings, facilitating better communication in regions with rather than administrative tasks[1]. Beyond patient communication, diverse linguistic backgrounds. This ensures that patients from low-LLMs assist clinicians with diagnoses and medical research. GPT-4 resource, multilingual regions receive equitable healthcare without and LLaMA are increasingly used as clinical assistants, offering language barriers compromising the patient-doctor relationship[3]. support in diagnosis and treatment planning[29]. Large language However, the risks of over-reliance on LLMs in direct patient com-models enhance decision-making in surgical care by answering munication must also be acknowledged. Some research argues that patients’ questions, thereby improving pre-surgical communication cultural sensitivity and local medical customs play a crucial role and reducing patient anxiety[18]. Figure ?? shows the areas of the in healthcare communication. LLMs that fail to adapt to these fac-medical domain where LLMs are currently being applied. tors may risk undermining trust between patients and healthcare providers. For instance, culturally inappropriate language or failure to understand local medical norms could weaken the patient-doctor bond and compromise care[11]. The diagram in Figure 3 shows a basic workflow of the integration of LLMs in healthcare. Figure 1: Applications of large language models in healthcare 4 IMPROVING PATIENT-DOCTOR COMMUNICATION Effective communication between patients and doctors is a corner-Figure 3: Workflow of LLM integration in healthcare stone of healthcare. Miscommunication can lead to misunderstanding, reduced adherence to treatment plans, and dissatisfaction with care. LLMs offer a promising solution to enhance communication 5 COMPARATIVE ANALYSIS OF THE MOST by simplifying medical information into layman’s terms. The role of LLMs in improving patient-doctor interactions is significant, as they PROMINENT LLMS IN HEALTHCARE ensure the translation of medical jargon into easily understandable Our review would be incomplete if we failed to mention the most language during consultations, ultimately leading to more informed prominent large language models that have been fine-tuned for 43 IS2024 - CHATGPT in MEDICINE, Petrovska Dojchinovska et al. Model Name Primary Applications Cultural and Linguistic Adaptations Current Limitations Future Research Needs Diagnostic support, Multilingual capabilities, Domain-specific adaptations, GPT-4 Bias, accuracy issues patient communication cultural sensitivity ethical frameworks Health equity, Designed for multiple languages, Research on impact L2M3 Limited domain-specific fine-tuning multilingual support cultural context integration in low-resource regions Tailored for healthcare terminology, Enhanced cultural adaptation, LlamaCare Healthcare knowledge sharing May lack emotional sensitivity multilingual support emotional sensitivity Clinical decision-making, Potential bias, Improving contextual understanding GatorTron Adapted for diverse clinical contexts patient interaction limited contextual awareness and bias reduction Additional Models Specific clinical domains, Further development Customizable for local languages and cultures Challenges with generalizability (e.g., Me-LLaMA) patient interaction for specialized clinical needs Table 1: Comparison of LLMs in Healthcare Applications applications in the medical domain. Therefore, this section will has the potential to increase access to healthcare for marginalized highlight the strong points and then perform a comparative analy-communities globally[8]. sis between GPT-4, LlamaCare, GatorTron, and L2M3. A graphic Cultural Sensitivity. L2M3 goes beyond just linguistic adapt- display of these models and their most prominent features in the ability by embedding cultural knowledge into its model[23]. This medical domain can be found in Figure 4, while Table 1 below sum-enables L2M3 to provide more contextually appropriate advice that marizes the models’ strong points and current areas that can be aligns with local medical customs. For instance, it can recommend improved, as well as areas where research can be focused in the treatments or healthcare guidelines that resonate with the cultural future. practices of the patient’s region, something that more generalized models like GPT-4 may fail to do effectively. Consideration for Health Equity. L2M3 plays a pivotal role in reducing healthcare disparities by making culturally and linguistically appropriate care accessible in underdeveloped and diverse regions. It is particularly adept at filling gaps left by monolingual or culturally neutral models, such as GPT-4[23]. 5.3 LlamaCare Knowledge Sharing in Healthcare. LlamaCare was developed to Figure 4: Overview of the key LLMs that are applied in health- facilitate knowledge sharing among healthcare professionals. Un-care like GPT-4, which is designed for general applications, LlamaCare is fine-tuned specifically for healthcare, giving it an edge in clinical decision support[29]. LlamaCare’s training on specialized medical datasets makes it particularly useful for knowledge-intensive tasks 5.1 GPT-4 such as diagnosing complex conditions or synthesizing information from clinical trials[12]. General Application in Medicine. As one of the most versatile Fine-tuned for Medical Data. LlamaCare’s ability to provide LLMs, GPT-4 has been widely studied for its role in clinical decision-accurate, context-specific information gives it a significant advan-making, medical education, and patient engagement. GPT-4’s broad tage over more general models. Its precision comes from training applications, from summarizing medical records to assisting with on large volumes of healthcare-specific datasets, allowing it to out-diagnosis and patient interaction. However, GPT-4’s lack of special-perform models like GPT-4 when it comes to specialized clinical ized medical training means it faces challenges when compared to decision-making[26]. models like LlamaCare and GatorTron[23]. Patient Communication. Although LlamaCare’s primary role Patient-Doctor Communication. GPT-4 has also demonstrated is to assist healthcare providers, it can also be used to improve potential in improving patient communication by generating empa-patient communication by offering detailed and reliable medical thetic, human-like responses. The model can generate emotionally information. However, unlike GPT-4, which is more conversational, intelligent text, potentially enhancing patient trust in AI-generated LlamaCare’s focus remains on delivering precise medical knowledge advice[17]. Despite this, GPT-4 struggles with more complex medi-rather than generating empathetic dialogue[4]. cal queries where detailed clinical context is needed[24]. 5.2 L2M3 5.4 GatorTron Multilingual Healthcare Accessibility. One of L2M3’s greatest Focus on Medical Records. GatorTron is being specifically de- strengths is its support for multilingual healthcare environments. signed for the U.S. healthcare system, where it excels in processing L2M3 is a model designed specifically for low-resource healthcare vast amounts of electronic health records (EHRs). Its ability to settings where language barriers and cultural diversity pose signifi-rapidly synthesize and analyze patient data has made it an essential cant challenges. By offering support for multiple languages, L2M3 tool for improving diagnostic accuracy and reducing administrative 44 IS2024 - CHATGPT in MEDICINE, Meeting Cultural and Linguistic Demands to Accommodate Fine-Tuned LLMs to Local Medical Customs and Patient Communication burden[2]. GatorTron outperforms models like GPT-4 in this regard, 6.2 Ethical Implications of LLMs in Healthcare owing to its fine-tuning on clinical records[7]. The ethical implications of deploying LLMs in healthcare extend Enhancing Clinical Workflows. One of GatorTron’s key strengths beyond technical challenges, touching on critical issues such as is its ability to streamline clinical workflows by accurately sum-data privacy, patient consent, and accountability. are substantial, marizing medical histories, diagnoses, and treatment plans.[14]. particularly in terms of biases, transparency, and patient autonomy. GatorTron enhances clinical efficiency by reducing the amount of While current studies briefly mention these concerns, this review time doctors spend on administrative tasks like reviewing patient delves deeper into specific ethical frameworks that should guide charts, allowing them to focus more on patient care.[16]. the development and deployment of LLMs. One such framework Adaptation to U.S. Healthcare. GatorTron’s design is opti- is the "Data Protection by Design" principle, which emphasizes mized for the U.S. healthcare system, making it particularly effective the need to integrate privacy safeguards into AI systems from the in this context. However, this focus on English-language records outset. In addition, patient consent must be a core element when means it may not be as adaptable in multilingual or international implementing LLMs in clinical settings, ensuring that patients are healthcare environments. GatorTron’s success illustrates the need fully aware of how their data is used and stored. Accountability in for more research on how to adapt models to diverse healthcare AI decision-making also needs to be addressed, particularly in high-systems[27]. stakes scenarios like diagnostics and treatment planning, where errors can have life-altering consequences. Finally, the risk of bias 5.5 Comparative Analysis and perpetuating healthcare inequities through LLMs necessitates stricter ethical oversight, with transparent mechanisms for identi-Local Medical Customs and Linguistic Demands. When com- fying and mitigating bias. Racial and ethnic biases in GPT-4 were paring L2M3 and GatorTron, L2M3’s strength in multilingual set-explored for medical diagnosis and triage, uncovering disparities in tings is highlighted, which makes it ideal for global healthcare the model’s responses. These concerns underscore the need for eth-applications. GatorTron, by contrast, is more specialized for English-ical oversight when using LLMs in culturally diverse settings[10]. speaking environments. GPT-4 offers broader utility but lacks the Furthermore, LLMs could reinforce healthcare inequities if not cultural and linguistic specificity of L2M3 and the clinical precision properly adapted to the specific needs and practices of various of GatorTron and LlamaCare[23]. cultures[6]. Additionally, LLMs must account for linguistic diver-Patient Communication. GPT-4 and LlamaCare both demon- sity in global healthcare contexts. For example, healthcare systems strate potential in patient-doctor communication, but they serve dif-in multilingual nations, such as India, require models that can func-ferent purposes. GPT-4 excels in generating empathetic responses, tion across multiple languages while understanding the nuances of while LlamaCare offers more medically precise information. L2M3’s local medical customs. There is an innate link between culture, lan-focus on multilingual communication gives it an edge in culturally guage, and patient safety, making linguistic competency critical for diverse settings, but GatorTron’s role remains more data-focused[17][9][26]. patient-centered care[11]. The LlamaCare model provides a frame-Customization and Accuracy. Both GatorTron and LlamaCare work for sharing healthcare knowledge across diverse linguistic excel in accuracy due to their fine-tuning on medical data[30]. groups, showing how LLMs can facilitate cross-cultural knowledge GPT-4, while highly versatile, does not have the same level of sharing in healthcare[22]. Figure 5 below focuses on the challenges, specialization[23]. L2M3, on the other hand, is a standout for global as opposed to the benefits, of applying LLMs in the healthcare healthcare, particularly in low-resource regions[3]. domain. 6 CHALLENGES 6.1 Technical Challenges of Implementing Medical LLMs Despite promising applications, adapting LLMs to specific medical tasks presents technical challenges. A major issue is the need for domain-specific training data. The models need considerable fine-tuning for clinical natural language processing (NLP) tasks, making zero-shot learning an emerging solution[20]. When it comes to the application of zero-shot learning in preventive healthcare, it has an ability to provide accurate decision support even in niche medical Figure 5: The benefits and challenges of applying large lan-contexts[13]. In addition, fine-tuning LLMs to accommodate local guage models in the medical domain medical customs and linguistic variations is critical for effective patient communication across diverse healthcare settings. That is why multilingual models like L2M3, which cater to low-resource 7 EVALUATING THE EFFECTIVENESS OF regions where local dialects and cultural practices significantly influence healthcare delivery, have such importance[3]. Without in-LLMS IN HEALTHCARE corporating local linguistic demands and medical traditions, LLMs The evaluation of LLMs in healthcare requires a combination of risk misinterpretation, leading to misdiagnosis or poor patient out-qualitative and quantitative metrics to ensure a balanced assess-comes. ment of their performance. In addition to the qualitative analysis 45 IS2024 - CHATGPT in MEDICINE, Petrovska Dojchinovska et al. of their language generation capabilities, this review incorporates designed to recognize emotional cues in patient language and ad-quantitative metrics such as accuracy rates, error rates in specific just their responses to provide more empathetic care. Another vital tasks (e.g., diagnosis or patient communication), and patient satis-research direction is improving the contextual awareness of LLMs, faction scores derived from recent studies. These metrics provide particularly in culturally diverse settings where understanding local a robust framework[9] for comparing LLMs across various clini-customs, languages, and medical practices is essential. Furthermore, cal applications. For example, models like GPT-4 and LlamaCare future research should explore ways to integrate LLMs with other have been evaluated for their diagnostic accuracy, while GatorTron healthcare technologies, such as electronic health records (EHRs), to has shown promise in reducing administrative burdens through streamline clinical workflows while ensuring data privacy. Finally, improved EHR management. Quantitative evidence, such as the the creation of robust, standardized evaluation frameworks[19][9] accuracy rates of LLMs in diagnosing rare conditions or the error will be crucial in assessing the long-term effectiveness of LLMs rates in patient-facing applications, strengthens the argument for in healthcare, focusing on patient outcomes, model accuracy, and their continued refinement and deployment in healthcare. ethical and regulatory compliance[6][15]. Evaluating LLMs in medical applications, focusing on adaptability to local linguistic demands, accuracy, and transparency requires 10 CONCLUSION a detailed framework[9]. There are METRICS in place that serve LLMs hold immense potential to revolutionize healthcare by en-as a tool for standardizing LLM evaluations, ensuring they are tai-hancing diagnostic support, improving patient-doctor communica-lored to local healthcare systems and relevant cultural needs[19]. tion, and facilitating equitable care. Their success, however, hinges In multilingual and multicultural settings, the evaluation of LLMs on addressing challenges such as local medical customs, linguis-should prioritize their ability to maintain cultural and linguistic ap-tic diversity, and ethical concerns. Proper adaptation of LLMs can propriateness. The L2M3 model, designed for regions with limited significantly enhance patient-centered care by making communi-healthcare resources, demonstrates how multilingual and culturcation between patients and healthcare providers more culturally ally aware LLMs can improve healthcare outcomes in underserved sensitive and linguistically appropriate. Prioritizing inclusivity and communities[3]. transparency is essential for improving healthcare outcomes and equity. Future research must focus on addressing cultural and lin-8 GAPS IN THE EXISTING RESEARCH guistic diversity, improving emotional sensitivity in patient-doctor interactions, mitigating biases, and establishing ethical and legal As LLMs become more prevalent in clinical environments, the frameworks for AI in healthcare. Specialized research is also needed focus on improving patient-doctor communication and address-to tailor LLMs for specific clinical domains, such as mental health ing cultural and linguistic considerations will be crucial. Over-and surgery, to ensure these tools are safe, reliable, and contextually reliance on LLMs without adequate human oversight, particularly aware. in sensitive clinical areas like diagnostics and treatment planning is dangerous[29]. Additionally, there is a need for stringent ethi-11 ACKNOWLEDGEMENT cal guidelines to ensure that LLMs do not perpetuate healthcare This project is funded by the European Union under Horizon Europe inequities[6]. In the papers that were evaluated for this review, (project ChatMED grant agreement ID: 101159214). we have identified the most notable gap in the limited focus on multicultural and linguistic diversity. There is a need for more comprehensive studies and models specifically designed for non-Western and linguistically diverse healthcare settings. LLMs must be fine-tuned to local medical practices and languages to truly serve global healthcare needs. The second gap identified is the bias that exists in LLM healthcare recommendations[10][6]. There is a REFERENCES lack of clear, actionable frameworks for systematically identifying [1] Prottay Kumar Adhikary, Aseem Srivastava, Shivani Kumar, Salam Michael and reducing bias in LLMs. Addressing this will require a multi-Singh, Puneet Manuja, Jini K Gopinath, Vijay Krishnan, Swati Kedia Gupta, disciplinary approach combining AI ethics, clinical expertise, and Koushik Sinha Deb, and Tanmoy Chakraborty. Exploring the efficacy of large sociocultural considerations. The third gap is the emotional sensi-language models in summarizing mental health counseling sessions: Benchmark study. JMIR Mental Health, 11:e57306, 2024. tivity in patient-doctor communication[1][21][18]. Finally, there is [2] Xi Chen. Large language models in the medical field: Principles and applications. the danger of over-reliance on LLMs for decision support[6][29]. International Journal of Computer Science and Information Technology, 2(3):219– 224, 2024. Still, the field is quite promising, so exploring future directions for [3] Agasthya Gangavarapu. Introducing l2m3, a multilingual medical large language research is worthwhile. model to advance health equity in low-resource regions, 2024. [4] Marius Geantă, Daniel Bădescu, Narcis Chirca, Ovidiu Cătălin Nechita, Cos-min George Radu, Stefan Rascu, Daniel Rădăvoi, Cristian Sima, Cristian Toma, and Viorel Jinga. The potential impact of large language models on doctor–patient 9 FUTURE DIRECTIONS FOR RESEARCH communication: A case study in prostate cancer. In Healthcare, volume 12, page The potential of LLMs in healthcare remains largely untapped, with 1548. MDPI, 2024. [5] Tawara D Goode, M Claire Dunne, and Suzanne Bronheim. The evidence base significant gaps in current research that need addressing. One criti-for cultural and linguistic competency in health care. Commonwealth Fund New cal area for future investigation is the development of LLMs that York, NY, 2006. [6] Joschka Haltaufderheide and Robert Ranisch. The ethics of chatgpt in medicine are emotionally sensitive and capable of handling high-stakes, emo-and healthcare: a systematic review on large language models (llms). NPJ Digital tionally charged patient interactions. For example, models could be Medicine, 7(1):183, 2024. 46 Indeks avtorjev / Author index Binić Iva ....................................................................................................................................................................................... 17 Bolješić Luka ................................................................................................................................................................................. 7 Dodevski Zlate ............................................................................................................................................................................. 11 Gams Matjaž ............................................................................................................................................................................ 7, 26 Janković Isidora ........................................................................................................................................................................... 32 Janković Sonja ............................................................................................................................................................................. 32 Kocuvan Primož ....................................................................................................................................................................... 7, 26 Korlat Alex ..................................................................................................................................................................................... 7 Lukić Stevo ............................................................................................................................................................................ 17, 32 Mishev Kostadin .......................................................................................................................................................................... 42 Nica Iulia ...................................................................................................................................................................................... 36 Pavleska Tanja ............................................................................................................................................................................. 11 Perko Alexander ..................................................................................................................................................................... 20, 36 Petrović Jovana ............................................................................................................................................................................ 17 Petrovska Dojchinovska Gordana ................................................................................................................................................ 42 Simjanoska Misheva Monika ....................................................................................................................................................... 42 Svetozarević Mihailo .................................................................................................................................................................... 32 Trajkovik Vladimir ....................................................................................................................................................................... 11 Vacev Maša .................................................................................................................................................................................. 17 Wotawa Franz ........................................................................................................................................................................ 20, 36 Zadobovšek Matic ........................................................................................................................................................................ 26 47 ChatGPT v zdravstvu ChatGPT in Medicine Uredniki > Editors: Matjaž Gams, Monika Simjanoska Misheva, Stevo Lukić, Franz Wotawa Document Outline IS2024_Volume-K - DRAFT 02 - Naslovnica - notranja - K - DRAFT 03 - Kolofon - K - DRAFT 04 - IS2024 - Predgovor 05 - IS2024 - Konferencni odbori 07 - Kazalo - K 08 - Naslovnica - notranja - K - DRAFT 09 - Predgovor podkonference - K 10 - Programski odbor podkonference - K 11 - Prispevki - K IS2024_-_CHATGPT_in_MEDICINE_paper_001 (3) IS2024_-_CHATGPT_in_MEDICINE_paper_002 (5) IS2024_-_CHATGPT_in_MEDICINE_paper_003 (6) IS2024_-_CHATGPT_in_MEDICINE_paper_004 (7) Abstract 1 Introduction 2 Validation Methodology 2.1 Golden Model 2.2 Evaluation Metric 2.3 Equivalence of Diagnoses 2.4 Sets of Symptoms & Personas 2.5 Model, Prompts & Diagnose Retrieval 3 Experimental Evaluation 4 Conclusions Acknowledgements IS2024_-_CHATGPT_in_MEDICINE_paper_005 (8) Abstract 1 Introduction 2 Datasets 2.1 Insieme Platform 2.2 Other resources 3 Methodology 3.1 Fine-tuning 3.2 Storing data 3.3 RAG 3.4 Application architecture and used technologies 4 Results 5 Evaluation 6 Conclusion Acknowledgements IS2024_-_CHATGPT_in_MEDICINE_paper_006 (9) IS2024_-_CHATGPT_in_MEDICINE_paper_007 (10) Abstract 1 Introduction 2 Related Research 3 Combinatorial Testing 4 Validation Methodology 5 Medical Use Case 5.1 Example 6 Conclusion Acknowledgements IS2024_-_CHATGPT_in_MEDICINE_paper_008 (11) IS2024_-_CHATGPT_in_MEDICINE_paper_009 (2) IS2024_-_CHATGPT_in_MEDICINE_paper_010 (4) 12 - Index - K Blank Page Blank Page Blank Page 07 - Kazalo - K 08 - Naslovnica - notranja - K - DRAFT 09 - Predgovor podkonference - K 10 - Programski odbor podkonference - K 11 - Prispevki - K IS2024_-_CHATGPT_in_MEDICINE_paper_001 (3) IS2024_-_CHATGPT_in_MEDICINE_paper_002 (5) IS2024_-_CHATGPT_in_MEDICINE_paper_003 (6) IS2024_-_CHATGPT_in_MEDICINE_paper_004 (7) Abstract 1 Introduction 2 Validation Methodology 2.1 Golden Model 2.2 Evaluation Metric 2.3 Equivalence of Diagnoses 2.4 Sets of Symptoms & Personas 2.5 Model, Prompts & Diagnose Retrieval 3 Experimental Evaluation 4 Conclusions Acknowledgements IS2024_-_CHATGPT_in_MEDICINE_paper_005 (8) Abstract 1 Introduction 2 Datasets 2.1 Insieme Platform 2.2 Other resources 3 Methodology 3.1 Fine-tuning 3.2 Storing data 3.3 RAG 3.4 Application architecture and used technologies 4 Results 5 Evaluation 6 Conclusion Acknowledgements IS2024_-_CHATGPT_in_MEDICINE_paper_006 (9) IS2024_-_CHATGPT_in_MEDICINE_paper_007 (10) Abstract 1 Introduction 2 Related Research 3 Combinatorial Testing 4 Validation Methodology 5 Medical Use Case 5.1 Example 6 Conclusion Acknowledgements IS2024_-_CHATGPT_in_MEDICINE_paper_008 (11) IS2024_-_CHATGPT_in_MEDICINE_paper_009 (2) IS2024_-_CHATGPT_in_MEDICINE_paper_010 (4) 12 - Index - K Blank Page Blank Page Blank Page