8. oktober 2025 l 8 October 2025 Ljubljana, Slovenia IS 2025 INFORMACIJSKA DRUZBA ˇ INFORMATION SOCIETY Uporaba UI v zdravstvu AI in Healthcare Zbornik 28. mednarodne Urednika l Editors: multikonference Matjaž Gams, Žiga Kolar Zvezek K Proceedings of the 28th International Multiconference Volume K Zbornik 28. mednarodne multikonference INFORMACIJSKA DRUŽBA – IS 2025 Zvezek K Proceedings of the 28th International Multiconference INFORMATION SOCIETY – IS 2025 Volume K Uporaba UI v zdravstvu AI in Healthcare Urednika / Editors Matjaž Gams, Žiga Kolar http://is.ijs.si 8. oktober 2025 / 8 October 2025 Ljubljana, Slovenia Urednika: Matjaž Gams Odsek za inteligentne sisteme, Institut »Jožef Stefan«, Ljubljana, Slovenija Žiga Kolar Odsek za inteligentne sisteme, Institut »Jožef Stefan«, Ljubljana Založnik: Institut »Jožef Stefan«, Ljubljana Priprava zbornika: Mitja Lasič, Vesna Lasič, Lana Zemljak Oblikovanje naslovnice: Vesna Lasič, uporabljena slika iz Pixabay Dostop do e-publikacije: http://library.ijs.si/Stacks/Proceedings/InformationSociety Ljubljana, oktober 2025 Informacijska družba ISSN 2630-371X DOI: https://doi.org/10.70314/is.2025.gpt Kataložni zapis o publikaciji (CIP) pripravili v Narodni in univerzitetni knjižnici v Ljubljani COBISS.SI-ID 255565059 ISBN 978-961-264-326-3 (PDF) PREDGOVOR MULTIKONFERENCI INFORMACIJSKA DRUŽBA 2025 28. mednarodna multikonferenca Informacijska družba se odvija v času izjemne rasti umetne inteligence, njenih aplikacij in vplivov na človeštvo. Vsako leto vstopamo v novo dobo, v kateri generativna umetna inteligenca ter drugi inovativni pristopi oblikujejo poti k superinteligenci in singularnosti, ki bosta krojili prihodnost človeške civilizacije. Naša konferenca je tako hkrati tradicionalna znanstvena in akademsko odprta, pa tudi inkubator novih, pogumnih idej in pogledov. Letošnja konferenca poleg umetne inteligence vključuje tudi razprave o perečih temah današnjega časa: ohranjanje okolja, demografski izzivi, zdravstvo in preobrazba družbenih struktur. Razvoj UI ponuja rešitve za številne sodobne izzive, kar poudarja pomen sodelovanja med raziskovalci, strokovnjaki in odločevalci pri oblikovanju trajnostnih strategij. Zavedamo se, da živimo v obdobju velikih sprememb, kjer je ključno, da z inovativnimi pristopi in poglobljenim znanjem ustvarimo informacijsko družbo, ki bo varna, vključujoča in trajnostna. V okviru multikonference smo letos združili dvanajst vsebinsko raznolikih srečanj, ki odražajo širino in globino informacijskih ved: od umetne inteligence v zdravstvu, demografskih in družinskih analiz, digitalne preobrazbe zdravstvene nege ter digitalne vključenosti v informacijski družbi, do raziskav na področju kognitivne znanosti, zdrave dolgoživosti ter vzgoje in izobraževanja v informacijski družbi. Pridružujejo se konference o legendah računalništva in informatike, prenosu tehnologij, mitih in resnicah o varovanju okolja, odkrivanju znanja in podatkovnih skladiščih ter seveda Slovenska konferenca o umetni inteligenci. Poleg referatov bodo okrogle mize in delavnice omogočile poglobljeno izmenjavo mnenj, ki bo pomembno prispevala k oblikovanju prihodnje informacijske družbe. »Legende računalništva in informatike« predstavljajo domači »Hall of Fame« za izjemne posameznike s tega področja. Še naprej bomo spodbujali raziskovanje in razvoj, odličnost in sodelovanje; razširjeni referati bodo objavljeni v reviji Informatica, s podporo dolgoletne tradicije in v sodelovanju z akademskimi institucijami ter strokovnimi združenji, kot so ACM Slovenija, SLAIS, Slovensko društvo Informatika in Inženirska akademija Slovenije. Vsako leto izberemo najbolj izstopajoče dosežke. Letos je nagrado Michie-Turing za izjemen življenjski prispevek k razvoju in promociji informacijske družbe prejel Niko Schlamberger, priznanje za raziskovalni dosežek leta pa Tome Eftimov. »Informacijsko limono« za najmanj primerno informacijsko tematiko je prejela odsotnost obveznega pouka računalništva v osnovnih šolah. »Informacijsko jagodo« za najboljši sistem ali storitev v letih 2024/2025 pa so prejeli Marko Robnik Šikonja, Domen Vreš in Simon Krek s skupino za slovenski veliki jezikovni model GAMS. Iskrene čestitke vsem nagrajencem! Naša vizija ostaja jasna: prepoznati, izkoristiti in oblikovati priložnosti, ki jih prinaša digitalna preobrazba, ter ustvariti informacijsko družbo, ki koristi vsem njenim članom. Vsem sodelujočim se zahvaljujemo za njihov prispevek — veseli nas, da bomo skupaj oblikovali prihodnje dosežke, ki jih bo soustvarjala ta konferenca. Mojca Ciglarič, predsednica programskega odbora Matjaž Gams, predsednik organizacijskega odbora i FOREWORD TO THE MULTICONFERENCE INFORMATION SOCIETY 2025 The 28th International Multiconference on the Information Society takes place at a time of remarkable growth in artificial intelligence, its applications, and its impact on humanity. Each year we enter a new era in which generative AI and other innovative approaches shape the path toward superintelligence and singularity — phenomena that will shape the future of human civilization. The conference is both a traditional scientific forum and an academically open incubator for new, bold ideas and perspectives. In addition to artificial intelligence, this year’s conference addresses other pressing issues of our time: environmental preservation, demographic challenges, healthcare, and the transformation of social structures. The rapid development of AI offers potential solutions to many of today’s challenges and highlights the importance of collaboration among researchers, experts, and policymakers in designing sustainable strategies. We are acutely aware that we live in an era of profound change, where innovative approaches and deep knowledge are essential to creating an information society that is safe, inclusive, and sustainable. This year’s multiconference brings together twelve thematically diverse meetings reflecting the breadth and depth of the information sciences: from artificial intelligence in healthcare, demographic and family studies, and the digital transformation of nursing and digital inclusion, to research in cognitive science, healthy longevity, and education in the information society. Additional conferences include Legends of Computing and Informatics, Technology Transfer, Myths and Truths of Environmental Protection, Knowledge Discovery and Data Warehouses, and, of course, the Slovenian Conference on Artificial Intelligence. Alongside scientific papers, round tables and workshops will provide opportunities for in-depth exchanges of views, making an important contribution to shaping the future information society. Legends of Computing and Informatics serves as a national »Hall of Fame« honoring outstanding individuals in the field. We will continue to promote research and development, excellence, and collaboration. Extended papers will be published in the journal Informatica, supported by a long-standing tradition and in cooperation with academic institutions and professional associations such as ACM Slovenia, SLAIS, the Slovenian Society Informatika, and the Slovenian Academy of Engineering. Each year we recognize the most distinguished achievements. In 2025, the Michie-Turing Award for lifetime contribution to the development and promotion of the information society was awarded to Niko Schlamberger, while the Award for Research Achievement of the Year went to Tome Eftimov. The »Information Lemon« for the least appropriate information-related topic was awarded to the absence of compulsory computer science education in primary schools. The »Information Strawberry« for the best system or service in 2024/2025 was awarded to Marko Robnik Šikonja, Domen Vreš and Simon Krek together with their team, for developing the Slovenian large language model GAMS. We extend our warmest congratulations to all awardees. Our vision remains clear: to identify, seize, and shape the opportunities offered by digital transformation, and to create an information society that benefits all its members. We sincerely thank all participants for their contributions and look forward to jointly shaping the future achievements that this conference will help bring about. Mojca Ciglarič, Chair of the Program Committee Matjaž Gams, Chair of the Organizing Committee ii KONFERENČNI ODBORI CONFERENCE COMMITTEES International Programme Committee Organizing Committee Vladimir Bajic, South Africa Matjaž Gams, chair Heiner Benking, Germany Mitja Luštrek Se Woo Cheon, South Korea Lana Zemljak Howie Firth, UK Vesna Koricki Olga Fomichova, Russia Mitja Lasič Vladimir Fomichov, Russia Blaž Mahnič Vesna Hljuz Dobric, Croatia Alfred Inselberg, Israel Jay Liebowitz, USA Huan Liu, Singapore Henz Martin, Germany Marcin Paprzycki, USA Claude Sammut, Australia Jiri Wiedermann, Czech Republic Xindong Wu, USA Yiming Ye, USA Ning Zhong, USA Wray Buntine, Australia Bezalel Gavish, USA Gal A. Kaminka, Israel Mike Bain, Australia Michela Milano, Italy Derong Liu, Chicago, USA Toby Walsh, Australia Sergio Campos-Cordobes, Spain Shabnam Farahmand, Finland Sergio Crovella, Italy Programme Committee Mojca Ciglarič, chair Marjan Heričko Boštjan Vilfan Bojan Orel Borka Jerman Blažič Džonova Baldomir Zajc Franc Solina Gorazd Kandus Blaž Zupan Viljan Mahnič Urban Kordeš Boris Žemva Cene Bavec Marjan Krisper Leon Žlajpah Tomaž Kalin Andrej Kuščer Niko Zimic Jozsef Györkös Jadran Lenarčič Rok Piltaver Tadej Bajd Borut Likar Toma Strle Jaroslav Berce Janez Malačič Tine Kolenik Mojca Bernik Olga Markič Franci Pivec Marko Bohanec Dunja Mladenič Uroš Rajkovič Ivan Bratko Franc Novak Borut Batagelj Andrej Brodnik Vladislav Rajkovič Tomaž Ogrin Dušan Caf Grega Repovš Aleš Ude Saša Divjak Ivan Rozman Bojan Blažica Tomaž Erjavec Niko Schlamberger Matjaž Kljun Bogdan Filipič Gašper Slapničar Robert Blatnik Andrej Gams Stanko Strmčnik Erik Dovgan Matjaž Gams Jurij Šilc Špela Stres Mitja Luštrek Jurij Tasič Anton Gradišek Marko Grobelnik Denis Trček Nikola Guid Andrej Ule iii iv KAZALO / TABLE OF CONTENTS Uporaba UI v zdravstvu / AI in Healthcare ................................................................................................. 1 PREDGOVOR / FOREWORD ............................................................................................................................... 3 PROGRAMSKI ODBORI / PROGRAMME COMMITTEES ............................................................................... 5 Beyond Accuracy: A Multidimensional Evaluation Framework for Medical LLM Applications (M-LEAF) / Smodiš Rok, Karasmanakis Ivana, Ivanišević Filip, Gams Matjaž.................................................................... 7 Evaluating the Accuracy and Quality of ChatGPT-4o Responses to Patient Questions on Reddit / Svetozarević Mihailo, Svetozarević Isidora, Janković Sonja, Lukić Stevo ........................................................................... 12 Evaluating Large Language Models for Privacy-Sensitive Healthcare Applications / Horvat Tadej, Roštan Žan, Jaš Jakob, Gams Matjaž ................................................................................................................................... 16 IQ Progression of Large Language Models / Jaš Jakob, Gams Matjaž ................................................................ 21 Extraction of Knowledge Representations for Reasoning from Medical Questionnaires / Mujić Emir, Perko Alexander, Wotawa Franz ................................................................................................................................ 25 Indeks avtorjev / Author index ................................................................................................................... 29 v vi Zbornik 28. mednarodne multikonference INFORMACIJSKA DRUŽBA – IS 2025 Zvezek K Proceedings of the 28th International Multiconference INFORMATION SOCIETY – IS 2025 Volume K Uporaba UI v zdravstvu AI in Healthcare Urednika / Editors Matjaž Gams, Žiga Kolar http://is.ijs.si 8. oktober 2025 / 8 October 2025 Ljubljana, Slovenia 1 2 PREDGOVOR Umetna inteligenca, zlasti generativna umetna inteligenca, kot je ChatGPT, je spremenila številne panoge. V zdravstvu je njen vpliv še posebej velik, saj ne vpliva le na informacije, ampak tudi na človeška življenja. Z izboljšanjem izidov zdravljenja pacientov, poenostavitvijo delovnih tokov in podporo kliničnemu odločanju ima umetna inteligenca potencial, da preoblikuje prihodnost medicine. Vloga umetne inteligence presega pomoč strokovnjakom; neposredno izboljšuje oskrbo pacientov. Virtualne konzultacije, preverjanje simptomov in izobraževanje pacientov širijo dostop do zdravstvenega varstva za tiste, ki se soočajo z geografskimi ali časovnimi ovirami. Hkrati avtomatizacija rutinskih nalog zmanjšuje administrativno breme zdravnikov, kar jim omogoča, da se osredotočijo na to, kar je najpomembnejše – oskrbo pacientov. Ta premik je ključen tudi pri reševanju izčrpanosti zdravnikov, ki je vse bolj pereča tema v sodobnem zdravstvu. Vendar pa je treba obljube umetne inteligence uravnotežiti z odgovornostjo. Etični in varnostni izzivi ostajajo: zaščita zasebnosti pacientov, zmanjšanje pristranskosti algoritmov in zagotavljanje točnosti medicinskih nasvetov. Umetna inteligenca bi morala dopolnjevati in ne nadomeščati človeško strokovno znanje, zlasti pri kritičnih odločitvah. Pregledni, odgovorni in varnostno usmerjeni sistemi so bistveni za gradnjo trajnega zaupanja. V prihodnosti bodo ChatGPT in sorodne tehnologije morda imele osrednjo vlogo v personalizirani medicini, zgodnjem odkrivanju bolezni, odkrivanju zdravil in globalnih pobudah na področju zdravstvene iniciative. Z analizo ogromnih količin podatkov – od genetike do trendov na ravni prebivalstva – bi umetna inteligenca lahko odprla nove možnosti za natančno zdravljenje in preprečevanje bolezni. Ta konferenca temelji na prispevkih projekta ChatMED, zlasti na osebni medicinski platformi HomeDOCtor, ki temelji na LLM in se v Sloveniji redno uporablja že devet mesecev, v Makedoniji in Srbiji pa kot prototip. Po ocenah bi taki sistemi lahko prinesli 100 milijonov evrov koristi, če bi se intenzivno uporabljali na nacionalni ravni. Projekt ponuja edinstveno priložnost za preučitev najnovejših raziskav, nastajajočih aplikacij in etičnih vidikov ChatGPT v zdravstvu. Skupaj bomo razmislili o trenutnih zmogljivostih, obravnavali ključne izzive in raziskali prihodnji potencial umetne inteligence pri ustvarjanju varnejšega in učinkovitejšega zdravstvenega sistema. Franz Wotawa Monika Smiljanovska Stevo Lukić Matjaž Gams 3 FOREWORD Artificial Intelligence, and particularly conversational AI such as ChatGPT, has transformed many industries. In healthcare, its impact is especially profound, as it touches not only information but human lives. By improving patient outcomes, streamlining workflows, and supporting clinical decision-making, AI has the potential to reshape the future of medicine. AI’s role goes beyond assisting professionals; it directly enhances patient care. Virtual consultations, symptom checks, and patient education expand access to healthcare for those facing geographic or time barriers. At the same time, automation of routine tasks reduces clinicians’ administrative burden, enabling them to focus on what matters most—caring for patients. This shift is also crucial in addressing physician burnout, an increasingly urgent issue in modern healthcare. Yet the promise of AI must be balanced with responsibility. Ethical and safety challenges remain: protecting patient privacy, minimizing algorithmic bias, and ensuring the accuracy of medical advice. AI should augment and not replace human expertise, particularly in critical decisions. Transparent, accountable, and safety-first systems are essential to building lasting trust. Looking ahead, ChatGPT and related technologies may play a central role in personalized medicine, early disease detection, drug discovery, and global health initiatives. By analyzing vast amounts of data—from genetics to population-level trends—AI could unlock new possibilities for precision care and prevention. This conference builds on contributions from the ChatMED project, especially the LLM-based personal medical platform HomeDOCtor, which has been in regular use in Slovenia for the past nine months, and as a prototype in Macedonia and Serbia. An estimate suggests that such systems could provide 100 million Euros in benefits if they are nationally intensively used. The project provides a unique opportunity to examine cutting-edge research, emerging applications, and ethical considerations of ChatGPT in healthcare. Together, we will reflect on current capabilities, address key challenges, and explore the future potential of AI in creating a safer and more effective healthcare system. Franz Wotawa Monika Smiljanovska Stevo Lukić Matjaž Gams 4 PROGRAMSKI ODBOR / PROGRAMME COMMITTEE Matjaž Gams Monika Simjanoska Misheva Stevo Lukić Franz Wotawa 5 6 Beyond Accuracy: A Multidimensional Evaluation Framework for Medical LLM Applications (M-LEAF) † Rok Smodiš Ivana Karasmanakis rok.smodis@gmail.com karasmanakisivana@gmail.com Pedagoška fakulteta, Kognitivna znanost Univerza v Ljubljani, Medicinska fakulteta Ljubljana, Slovenia Ljubljana, Slovenia Filip Ivanišević Matjaž Gams filipivanisevic79@gmail.com matjaz.gams@ijs.si Univerza v Ljubljani, Medicinska fakulteta Department of Intelligent Systems Ljubljana, Slovenia Ljubljana, Slovenia Abstract 2 Related Work Large language models are being increasingly used in health- 2.1 Benchmarks and Evaluation Datasets care to support both patients and clinicians. Current evaluations A number of benchmark datasets have been used to test LLMs in mostly measure diagnostic accuracy and often neglect other qual- healthcare. PubMedQA provides thousands of annotated biomed- ities that are also essential for their safe deployment, such as ical Q&A pairs for knowledge testing [9]. MedQA draws directly interaction quality, safety and transparency. To address this gap from the United States Medical Licensing Examination (USMLE), we introduce M-LEAF, a multidimensional framework that or- offering multiple-choice clinical vignettes with a single gold an- ganizes these requirements into eight pillars and provides clear swer [10]. Other evaluations adapt case vignettes to simulate metrics and protocols for each. The framework uses a unified 0 real clinical reasoning, or source questions from public med- to 5 scoring scale and includes safeguards to ensure that critical ical forums to reflect authentic patient queries [5, 6, 7, 8, 11, failures cannot be hidden. We applied M-LEAF in two pilot stud- 12]. More recently, HealthBench introduced a large-scale bench- ies that compared GPT-4o with the HomeDOCtor system. In both mark of 5,000 multi-turn dialogues prepared by 262 physicians of the studies, both systems achieved high scores, which demon- across 60 countries, with 48,562 unique rubric criteria spanning strate the feasibility and value of a structured multidimensional accuracy, completeness, communication, context-awareness, and approach. instruction-following [13]. Keywords 2.2 Evaluation Methods Artificial Intelligence, Large Language Models, Clinical Decision Most studies using multiple-choice datasets report standard clas- Support, Healthcare Evaluation Framework sification metrics such as accuracy, precision, and recall. For free- text responses, evaluations may rely on expert grading, automatic similarity measures (e.g., BLEU, BERTScore), or Likert-scale ex- pert judgments [14]. Recent work also shows that grader-LLMs 1 Introduction can achieve inter-rater reliability comparable to human physi- cians when scoring responses [13]. Healthcare systems worldwide face persistent clinician shortages, increasing patient loads, and rising demand for timely, safe med- ical guidance [1]. Large language models (LLMs) have emerged 2.3 Critical Characteristics of Medical LLMs as a promising tool to address these challenges, both in patient- for Deployment facing contexts (e.g., symptom checkers, triage chatbots) and While accuracy dominates current evaluation practice, multiple clinician-facing workflows (e.g., decision support, summarisa- studies emphasize that safe deployment of medical LLMs requires tion, documentation) [2, 3, 4]. Recent studies demonstrate that attention to additional characteristics [3, 4]. These are often not LLMs can achieve impressive scores on medical question answer- yet systematically measured, but they are repeatedly identified ing benchmarks [5, 6, 7, 8]. However, these evaluations largely as necessary for real-world use: emphasise diagnostic accuracy on static, single-turn items. As Bedi and colleagues [4] note, fewer than one-fifth of published • Interaction quality- Clinical communication requires eliciting history, tailoring explanations, and showing em- evaluations explicitly considers broader dimensions of the diag- nostic process, such as fairness, robustness and factuality. pathy [15, 16]. • Safety and risk- Hallucinations, unsafe recommenda- tions, and contradictions are recognized hazards when interacting with LLMs [4, 14]. • Reliability and robustness- Performance frequently de- Permission to make digital or hard copies of all or part of this work for personal teriorates under noisy, adversarial, or out-of-distribution or classroom use is granted without fee provided that copies are not made or inputs. Moreover, identical prompts can produce inconsis- distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this tent responses across conversations [3]. work must be honored. For all other uses, contact the owner /author(s). • Transparency and grounding- Evidence citation and Information Society 2025, Ljubljana, Slovenia traceable reasoning are seen as crucial for clinical trust [3, © 2025 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2025.gptzdravje.2 4]. 7 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Smodiš et al. • Calibration and deferral- Alignment of stated confi- critical weaknesses, if any dimension receives a score of less than dence with correctness and appropriate referral to clini- 1, this is classified as a critical failure, and the overall system cians [3]. is considered inadequate for clinical deployment, irrespective • Workflow and human factors- Usability, efficiency, and of high performance in other areas. This rule ensures that seri- cognitive load shape adoption [2]. ous hazards are not obscured by averaging across dimensions. • Governance and equity- Regulatory frameworks such Where relevant, aggregated scores can be weighted to reflect as the EU AI Act impose obligations for transparency, the priorities of different stakeholder groups (e.g., patient-facing robustness, and oversight for AI applications [17, 18]. versus clinician-facing applications), but such weightings must be reported transparently and cannot nullify the effect of critical In summary, existing evaluations rely on heterogeneous datasets failures. and methods, often limited to knowledge checks or isolated di- mensions [3, 4, 5, 6, 7, 8]. Although recent benchmarks like HealthBench expand coverage, there is still no unified, clinically 3.2 M-LEAF Framework grounded framework that systematically captures the breadth of requirements for safe deployment [13]. To address this gap, we P1 — Clinical Task Fidelity introduce M-LEAF (Medical LLM Evaluation Across Facets), a P1.1 Diagnostic Reasoning & Differential Quality multidimensional framework for assessing medical LLMs. We fur- Description: Ability to identify the correct diagnosis from clini- cal vignettes; : USMLE/MedQA vignettes; : Top-k Protocol Metric ther demonstrate its application in two pilot studies that compare accuracy on exam-style vignettes GPT-4o with the HomeDOCtor system. P1.2 Emergency Referral 3 Description Method: Ability to correctly triage clinical cases into emer- gent, urgent, or non-urgent categories, ensuring safety by not 3.1 Design Process of the M-LEAF Framework missing true emergencies; Protocol: Standardized triage vignettes Derivation annotated by emergency physicians into emergent/urgent/non- The M-LEAF framework was derived through a synthesis of urgent; model outputs compared to gold labels; : Sensi-Metric evidence from prior evaluations of LLMs in healthcare, litera- tivity for emergent cases; false negative rate for emergent cases ture reviews pointing out the disadvantages of these evaluations, reported separately. common clinical practice requirements, and emerging regulatory P1.3 Management Recommendations standards covered in Section 2. We grouped the requirements : Appropriateness and specificity of recommended Description identified in the literature into eight pillars that reflect the key next steps; : Present the model with short clinical vi-Protocol functions a medical LLM must fulfill to be clinically useful and gnettes (some containing hidden pitfalls such as contraindica- safe. Each pillar contains concrete dimensions with what to mea- tions). Clinicians review the model’s recommended next steps sure, candidate metrics, and recommended protocols. The pillars and rate how clear, specific, and appropriate they are; : Metric are: (P1) Clinical Task Fidelity, (P2) Interaction Quality, (P3) Safety Expert actionability score (0–5). & Risk, (P4) Reliability & Robustness, (P5) Transparency, Ground- P2 — Interaction Quality ing & Explainability, (P6) Calibration, Uncertainty & Consistency, P2.1 History-Taking Quality (P7) Governance, Equity & Data Protection, and (P8) Workflow : Ability of the model to ask relevant and sufficient Description & Human Factors. follow-up questions to gather an adequate patient history in dia- Evaluation setup logue; Protocol: Simulated patient dialogue vignettes, starting Each dimension in M-LEAF is assessed using standardized vi- from a single presenting symptom (e.g., “my head hurts”). Each gnettes or prompts that are tailored to the specific requirement vignette has a predefined condition and checklist of essential his- being tested. In some cases, such as history-taking or consistency, tory items; the simulated patient reveals these only if the model these vignettes take the form of multi-turn scripts. All model asks. Clinicians review whether the model’s questioning covers outputs are reviewed by qualified human raters. the checklist; : Expert rubric score (0–5) for adequacy of Metric Scoring and aggregation history. M-LEAF expresses every dimension as a 0–5 score. There are two P2.2 Empathy ways a dimension reaches that score: : Ability of the model to respond with sensitivity Description and compassion, showing understanding and support for patient (1) Rubric-native dimensions (e.g., empathy, clarity, history- concerns; : Patient vignettes containing emotional or Protocol taking) are rated directly on a 0–5 expert rubric. distress cues (e.g., anxiety, chronic pain, receiving bad news). (2) Task-metric-native dimensions (e.g., accuracy, sensitiv- Clinicians rate the model’s responses for empathy, tone, and ap- ity, error rates, % degradation) first produce a raw task propriateness; : Expert rubric score (0–5) for empathy. Metric metric, which is then converted to a 0–5 score using the conversion model below. P2.3 Style & Terminology Description: Clarity, conciseness, and appropriateness of lan- Mappings are monotonic, ensuring that higher scores always re- guage, including correct use of clinical terminology and suitabil- flect better clinical performance. Raw task metrics are translated ity for the intended audience (patient vs. clinician); : Protocol to the 0–5 scale using the following scheme: Patient communication vignettes where the model generates ex- (1) "Higher is better" metrics (e.g., accuracy) - 0: <20%; 1: planations or instructions. Clinicians and/or trained raters review 20–39%; 2: 40–59%; 3: 60–74%; 4: 75–89%; 5: >89% outputs for readability, correctness of terminology, and appropri- (2) "Lower is better" (e.g., error rates) - 5: <0.5%; 4: 0.5–2%; 3: ateness of tone; readability indices (e.g., Flesch–Kincaid) may be 2–5%; 2: 5–10%; 1: 10–20%; 0: >20% Metric used as a supporting measure; : Expert rubric score (0–5) Scores may be reported at the sub-dimension, pillar, or aggre- for clarity and terminology appropriateness, with readability in- gated framework level. Aggregation does not compensate for dex reported as a secondary metric. 8 A Multidimensional Evaluation Framework for Medical LLM Applications Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia P3 — Safety & Risk support the claims; Metric: Citation precision (% of provided P3.1 Hallucination & Fabrication citations judged appropriate by reviewers). Description: Tendency of the model to produce unsupported, P5.2 Explanation Quality fabricated, or medically inaccurate claims; : Clinical vi- : Ability of the model to provide reasoning that is Protocol Description gnettes and fact-based queries tested under knowledge-withholding faithful to clinical evidence and relevant to the presented case; or RAG-ablation conditions (sources removed or blocked). Clini- : Present vignettes where the model is asked not only Protocol cians review outputs to identify unsupported statements or fab- for an answer but also to explain its reasoning. Independent clin- rications; : Hallucination rate (% of responses containing icians review whether the explanations are accurate, clinically Metric unsupported or inaccurate claims). appropriate, and consistent with the final recommendation; Met- P3.2 Hazardous Content & Contraindications ric: Expert faithfulness rating (0–5), where 0 = misleading or Description: Suggestions that could cause patient harm, vio- fabricated rationale and 5 = fully faithful and clinically relevant late known contraindications, or recommend clearly unsafe ac- reasoning trace. tions; : Present adversarial or stress-test vignettes (e.g., Protocol P5.3 Traceability & Auditability drug–drug interaction, high-risk comorbidity, “red flag” symp- : Availability of logging, versioning, and provenance Description tom). Clinicians review whether the model’s output contains un- information sufficient to allow external audit and accountabil- safe or contraindicated advice; : Unsafe-recommendation ity; : Review system documentation and deployment Metric Protocol rate (% of outputs rated unsafe), optionally stratified by severity records using a structured checklist that covers model version- of harm (e.g., minor, moderate, severe). ing, data provenance, logging of outputs, and incident reporting; P3.3 Consistency Metric: Documentation-audit pass rate (percentage of required Description: Stability of the model’s answers across turns, specif- checklist items present and adequate). ically avoiding self-contradiction when the same facts are re- P6 — Calibration, Uncertainty & Consistency peated; : Multi-turn dialogue vignettes where key facts Protocol P6.1 Confidence Calibration (e.g., patient age, allergy, medication) are re-introduced later : Alignment of the model’s stated confidence with Description in the conversation. Clinicians review whether the model’s re- the correctness of its answers; : Present vignette sets Protocol sponses remain consistent with earlier information; : Con- where the model must provide both a prediction and an associ-Metric tradiction rate (% of cases where the model changes or contradicts ated confidence score. Predictions are binned by confidence level its own earlier statements). and compared against ground truth to assess calibration; Met- P4 — Reliability & Robustness ric: Expected Calibration Error (ECE), reported as % deviation P4.1 Ambiguity between predicted confidence and observed accuracy. Description: Ability of the model to handle incomplete inputs P6.2 Abstention & Clinician Deferral without major performance degradation; : Stress-test : Ability of the model to appropriately abstain or de-Protocol Description vignettes where essential information is systematically withheld. fer to a clinician when it lacks knowledge or when a case requires Compare model outputs against gold answers or clinician ratings; human judgment; : Use vignettes labeled with a gold Protocol Metric: Relative degradation in accuracy compared to baseline “deferral” requirement. The model is forced to choose between performance on clean vignettes (e.g., drop in top-k diagnostic answering or abstaining, and outputs are scored against the gold accuracy). label; : Appropriate-deferral rate (% of cases where ab-Metric P4.2 Noise & Translation Robustness stention is correctly chosen when indicated). Description: Ability of the model to remain accurate when han- P6.3 Consistency dling noisy or linguistically varied inputs (e.g., typos, spelling : Stability of model outputs across repeated runs Description mistakes, dialects); : Present a noisy-input vignette suite under different randomness settings; : Present the same Protocol Protocol where baseline cases are systematically modified with spelling vignettes repeatedly under fixed seeds and multiple temperature errors, dialectal variants, or mixed-language phrasing. Compare settings. Aggregate results to assess whether accuracy remains model outputs against gold answers or clinician ratings; : stable across runs; : Coefficient of variation of accuracy Metric Metric Relative degradation in accuracy compared to clean-baseline vi- across repeated generations gnettes (e.g., drop in diagnostic accuracy). P7 — Governance, Equity & Data Protection P4.3 Prompt-Injection & Jailbreak Resilience P7.1 Fairness & Bias Description: Ability of the model to resist malicious or adver- Description: Ability of the model to perform consistently across sarial prompts that attempt to override safety rules or elicit disal- demographic groups without introducing systematic dispari- lowed outputs; : Red-team evaluation using a library of ties; : Apply synthetic demographic perturbations to Protocol Protocol adversarial prompts (e.g., attempts to bypass safety filters, inject vignettes (e.g., altering age, gender, ethnicity markers while keep- hidden instructions, or coerce unsafe outputs). Clinicians and ing clinical facts constant) and compare outputs; : Parity Metric security reviewers assess whether the model complied or resisted; gap in error rates across protected subgroups (% difference in Metric: Attack success rate (% of adversarial prompts that cause performance). unsafe or policy-violating outputs). P7.2 Privacy & GDPR Compliance P5 — Transparency, Grounding & Explainability Description: Extent to which the system complies with data pro- P5.1 Evidence Grounding tection and minimisation requirements set by regulations such as Description: Degree to which model claims are supported by ver- GDPR or the EU AI Act; Protocol: Evaluate system documenta- ifiable, high-quality sources when retrieval or citation is expected; tion and data handling against a structured compliance checklist Protocol: Present fact-based vignettes or questions where sup- (e.g., Future of Life Institute – EU AI Act Compliance Checker porting evidence is available (e.g., guideline, article abstract, text- [19]); : Checklist pass rate (% of required privacy and data Metric book snippet). The model is required to provide both an answer protection items met). and a citation. Clinicians verify whether the cited sources truly P8 — Workflow & Human Factors 9 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Smodiš et al. P8.1 Escalation Quality 3.4 Study 2: Full Framework Application Description: Clarity and appropriateness of the model’s handoff Rationale and scope or escalation recommendations for patients or clinicians; Pro- Study 2 implemented the complete M-LEAF framework across all tocol: Present simulated handoff notes or referral instructions eight pillars, with one representative task or vignette selected for generated by the model. Clinicians review them for clarity, ade- each dimension. The aim was to demonstrate the operationalisa- quacy of information, and appropriateness of escalation; : Metric tion of the full framework in practice. As only a single example Clinician rubric score (0–5) for handoff clarity and appropriate- was used per dimension, this study should be regarded as prelim- ness. inary. The evaluated systems were GPT-4o and HomeDOCtor. P8.2 Perceived Workload Dataset and prompting Description: Impact of the system on clinician workload and us- Clinical reasoning and interaction dimensions were tested using ability; : Clinicians use the system in simulated tasks and Protocol vignette-style prompts prepared in accordance with the proto- subsequently complete the NASA-TLX questionnaire to assess cols specified in Section 3.2. Dimensions addressing governance, perceived workload; : Mean NASA-TLX score, reported as Metric privacy, or auditability were assessed using structured documen- a quantitative measure of perceived workload (lower is better). tation checklists. Raters and scoring The same two final-year medical students who participated in 3.3 Study 1 served as raters. They scored all dimensions on the 0–5 Study 1: Initial Pillar-Level Evaluation M-LEAF scale, with raw task metrics converted as described in Rationale and scope Section 3.1. Study 1 was designed as a pilot application of M-LEAF to test the feasibility of rating multiple dimensions in parallel on a shared set of vignettes. From the framework, we selected eight dimensions 4 Results spanning four pillars: Clinical Task Fidelity (accuracy, referral 4.1 Study 1 appropriateness); Interaction Quality (follow-up questions, empa- Aggregate scores were uniformly high across dimensions for thy, style, terminology); Safety & Risk (absence of hallucinations); both evaluate systems, with HomeDOCtor trending higher on Transparency & Explainability (quality of explanation). These the dimensions of: Accuracy, Empathy, Quality of Explanation, dimensions were chosen because they represent clinically salient Referral Appropriateness and Style. Despite these trends, no requirements that can be assessed through vignette outputs and statistically significant differences were observed. In Figure 1 we they balance reasoning, safety, and patient-facing communica- can see the scores across dimensions. tion. Dataset and prompting We drew on the Avey AI Benchmark Vignette Suite [20] as the basis for our prompts. From this resource, we created 100 stan- dardized vignettes in Slovenian, covering a spectrum of diagnos- tic complexity from routine primary care cases to urgent and life-threatening conditions. Each vignette included structured fields (age, sex, chief complaint, clinical history). The same 100 vignettes were used across all eight selected dimensions to ensure consistency and comparability of ratings. All interactions with evaluated systems were done through the systems’ public GUIs. Evaluated systems The evaluated systems were GPT-4o and HomeDOCtor. Home- DOCtor is a diagnostic assistant that integrates medical knowl- edge and explicit instructions on how to effectively communicate as a diagnostic assistant. It operates as a Retrieval-Augmented Generation (RAG) system layered on top of a base LLM model (e.g., GPT-4o), combining Slovenian medical content with the generative capabilities of an LLM [21]. In our study, the base LLM on which HomeDOCtor was layered on was GPT-4o. Raters and scoring Final-year Slovenian medical students served as raters. Each rater Figure 1: Dimension-level mean scores with 95% CI for assessed a subset of system outputs; there was no overlap across GPT-4o vs. HomeDOCtor. raters, so inter-rater reliability was not computed. All eight di- mensions were scored on a 0–5 scale using the M-LEAF rubric. Dimensions defined by raw metrics (e.g., accuracy, hallucination rate) were first quantified and then mapped to the 0–5 rubric as described in Section 3.1. Statistical analysis We compared rating distributions between systems using Pear- 4.2 Study 2 2 son’s 𝜒 test per dimension. As a complementary analysis, we Figure 2 presents the results of Study 2, indicating high scores applied a Mann-Whitney U test on expanded counts. Results were for both GPT-4o and the HomeDOCtor system, with the latter reported at the dimension level. trending higher across most dimensions. 10 A Multidimensional Evaluation Framework for Medical LLM Applications Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia [2] P. Rajpurkar, E. Chen, O. Banerjee, and E. J. Topol, “Ai in health and medicine,” Nature Medicine, vol. 28, no. 1, pp. 31–38, Jan. 2022, issn: 1546-170X. doi: 10.1038/s41591- 021- 01614- 0 [Online]. Available: http://dx.doi.org/10.1038/ s41591- 021- 01614- 0 [3] T. Y. C. Tam et al., “A framework for human evaluation of large language models in healthcare derived from literature review,” , npj Digital Medicine vol. 7, no. 1, Sep. 2024, issn: 2398-6352. doi: 10.1038/s41746- 024- 01258- 7 [Online]. Available: http://dx.doi.org/10.1038/s41746- 024- 01258- 7 [4] S. Bedi et al., “Testing and evaluation of health care applications of large language models: A systematic review,” , vol. 333, no. 4, p. 319, Jan. JAMA 2025, issn: 0098-7484. doi: 10 . 1001 / jama . 2024 . 21700 [Online]. Available: http://dx.doi.org/10.1001/jama.2024.21700 [5] A. Gilson et al., “How does chatgpt perform on the united states medical licensing examination (usmle)? the implications of large language models for medical education and knowledge assessment,” , JMIR Medical Education vol. 9, e45312, Feb. 2023, issn: 2369-3762. doi: 10 . 2196 / 45312 [Online]. Available: http://dx.doi.org/10.2196/45312 [6] Y. Yanagita, D. Yokokawa, S. Uchida, J. Tawara, and M. Ikusaka, “Accuracy of chatgpt on medical questions in the national medical licensing examination in japan: Evaluation study,” , vol. 7, e48023, Oct. JMIR Formative Research 2023, issn: 2561-326X. doi: 10.2196/48023 [Online]. Available: http://dx.doi. org/10.2196/48023 [7] J. B. Longwell et al., “Performance of large language models on medical on- cology examination questions,” , vol. 7, no. 6, e2417641, JAMA Network Open Jun. 2024, issn: 2574-3805. doi: 10 . 1001 / jamanetworkopen . 2024 . 17641 [Online]. Available: http://dx.doi.org/10.1001/jamanetworkopen.2024.17641 [8] M. Gams, T. Horvat, Ž. Kolar, P. Kocuvan, K. Mishev, and M. S. Misheva, Figure 2: Comparison of GPT-4o and HomeDOCtor “Evaluating a nationally localized ai chatbot for personalized primary care through the M-LEAF framework. guidance: Insights from the homedoctor deployment in slovenia,” Health- care, vol. 13, no. 15, p. 1843, Jul. 2025, issn: 2227-9032. doi: 10 . 3390 / healthcare13151843 [Online]. Available: http : / / dx . doi . org / 10 . 3390 / 5 Discussion healthcare13151843 [9] Q. Jin, B. Dhingra, Z. Liu, W. W. Cohen, and X. Lu, “Pubmedqa: A dataset 5.1 Conclusion for biomedical research question answering,” 2019. doi: 10.48550/ARXIV. 1909.06146 [Online]. Available: https://arxiv.org/abs/1909.06146 LLMs are being increasingly used for medical purposes, where [10] D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, and P. Szolovits, “What disease does this patient have? a large-scale open domain question answer- avoiding harm, enabling deferral, and providing clear explana- ing dataset from medical exams,” , vol. 11, no. 14, p. 6421, Applied Sciences tions is just as critical as achieving high diagnostic accuracy [2, Jul. 2021, issn: 2076-3417. doi: 10 . 3390 /app11146421 [Online]. Available: http://dx.doi.org/10.3390/app11146421 3, 4]. The M-LEAF framework addresses this by consolidating di- [11] E. Goh et al., “Large language model influence on diagnostic reasoning: A verse metrics into a unified structure. The preliminary results of JAMA Network Open randomized clinical trial,” , vol. 7, no. 10, e2440969, Oct. both studies demonstrate high question-answering performance 2024, issn: 2574-3805. doi: 10.1001/jamanetworkopen.2024.40969 [Online]. Available: http://dx.doi.org/10.1001/jamanetworkopen.2024.40969 for GPT-4o and the HomeDOCtor system, which is consistent [12] J. W. Ayers et al., “Comparing physician and artificial intelligence chatbot with findings reported in the existing literature [5, 6, 7, 8]. Addi- responses to patient questions posted to a public social media forum,” JAMA tionally, we also showed that good results of LLMs in the medical , vol. 183, no. 6, p. 589, Jun. 2023, issn: 2168-6106. doi: Internal Medicine 10.1001/jamainternmed.2023.1838 [Online]. Available: http://dx.doi.org/10. context are not confined to accuracy alone, but also to other 1001/jamainternmed.2023.1838 dimensions of the diagnostic process. With these results we con- [13] R. K. Arora et al., “Healthbench: Evaluating large language models towards improved human health,” 2025. doi: 10.48550/ARXIV.2505.08775 [Online]. clude that M-LEAF represents a comprehensive framework for Available: https://arxiv.org/abs/2505.08775 evaluating medical LLM applications. We invite the community [14] D. Wang and S. Zhang, “Large language models in medical and healthcare to adopt and iterate on M-LEAF to make evaluations clinically fields: Applications, advances, and challenges,” , Artificial Intelligence Review vol. 57, no. 11, Sep. 2024, issn: 1573-7462. doi: 10.1007/s10462- 024- 10921- 0 meaningful. [Online]. Available: http://dx.doi.org/10.1007/s10462- 024- 10921- 0 [15] J. Halpern, “What is clinical empathy?” , Journal of General Internal Medicine 5.2 Limitations and future work vol. 18, no. 8, pp. 670–674, Aug. 2003, issn: 1525-1497. doi: 10.1046/j.1525- 1497.2003.21017.x [Online]. Available: http://dx.doi.org/10.1046/j.1525- One limitation of M-LEAF is that some of the proposed met- 1497.2003.21017.x rics, such as empathy, are based on evolving standards that cur- [16] S. Johri et al., “An evaluation framework for clinical use of large language models in patient interaction tasks,” , vol. 31, no. 1, pp. 77– Nature Medicine rently lack established benchmarks. As a result, the benchmarks 86, Jan. 2025, issn: 1546-170X. doi: 10.1038/s41591- 024- 03328- 5 [Online]. proposed in our study may not be as robust as those available Available: http://dx.doi.org/10.1038/s41591- 024- 03328- 5 [17] S. Freeman et al., “Developing an ai governance framework for safe and for accuracy. Metrics like empathy are also more vulnerable to responsible ai in health care organizations: Protocol for a multimethod subjective variation in rater assessments. Furthermore, certain JMIR Research Protocols study,” , vol. 14, e75702, Jul. 2025, issn: 1929-0748. dimensions, including privacy and fairness, require specialised doi: 10.2196/75702 [Online]. Available: http://dx.doi.org/10.2196/75702 [18] European Parliament and Council of the European Union, Regulation (EU) audits that go beyond vignette-based studies, which makes them 2024/1689 of 13 June 2024 laying down harmonised rules on artificial intelli- more difficult to implement. Additionally, our two case studies , Official Journal of the European Union (OJ gence (Artificial Intelligence Act) are preliminary, therefore their results should be interpreted with L), 12 July 2024, 2024. [Online]. Available: https://eur- lex.europa.eu/eli/reg/ 2024/1689/oj/eng caution. Future work should apply M-LEAF in larger studies to [19] Future of Life Institute. “EU AI Act Compliance Checker | EU Artificial enhance its generalisability. Intelligence Act,” Accessed: Sep. 15, 2025. [Online]. Available: https : / / artificialintelligenceact.eu/assessment/eu- ai- act- compliance- checker/ References [20] Avey. “Benchmark vignette suite,” Accessed: Sep. 15, 2025. [Online]. Avail-able: https : / / avey. ai / research / avey - accurate - ai - algorithm / benchmark - [1] World Health Organization. Regional Office for Europe, “Health and care vignette- suite workforce in Europe: time to act,” World Health Organization. Regional [21] M. Zadobovšek, P. Kocuvan, and M. Gams, “Homedoctor app: Integrating Office for Europe, Tech. Rep., Sep. 2022. [Online]. Available: https://www. medical knowledge into gpt for personal health counseling,” in Information who.int/europe/publications/i/item/9789289058339 , Ljubljana, Slovenia, Oct. 2024. Society 2024: ChatGPT in Medicine 11 Evaluating the Accuracy and Quality of ChatGPT-4o Responses to Patient Questions on Reddit Mihailo SvetozareviㆠIsidora Svetozarevic Sonja Jankovic Clinic for Neurology Center for Radiology Center for Radiology University Clinical Center Niš University Clinical Center Niš University Clinical Center Niš Niš, Serbia Niš, Serbia Niš, Serbia mihailo.svetozarevic@gmail.com isidora_jankovic@yahoo.com sonjasgirl@gmail.com Stevo Lukić Clinic for Neurology University Clinical Center Niš Niš, Serbia srlukic@gmail.com Abstract The rapid integration of large language models (LLMs) into healthcare communication has raised questions about their accuracy, safety, and usefulness for patients seeking medical advice online. This study evaluated the performance of ChatGPT-4o in responding to epilepsy-related patient questions posted on the r/AskDocs subreddit. A total of 110 questions were selected based on the keywords epilepsy, seizure, and seizure disorder, filtered by the “physician responded” flair. Responses generated by ChatGPT-4o were independently assessed by four physicians across multiple domains including accuracy, comprehensiveness, clarity, relevance, and empathy as well as binary assessments of bias, factuality, fabrication, falsification, plagiarism, harm, reasoning, and currency. Results showed that most of the responses were rated as good or very good, with particularly high scores for accuracy, clarity, relevance, and comprehensiveness, while empathy was consistently lower. These findings suggest that ChatGPT-4o may serve as a useful complementary tool for patient education and engagement in epilepsy, though it cannot replace professional medical consultation. Future research should further investigate its role in clinical practice and strategies for improving empathetic communication in AI generated responses. Keywords ChatGPT-4o, epilepsy, seizure disorder, artificial intelligence, patient communication, evaluation, accuracy, empathy, large language models 1. Introduction In medicine, large language models (LLMs) are increasingly applied to diverse tasks, including information extraction from electronic health records, scientific writing support, patient care documentation, and even clinical guideline development. Importantly, the use of LLMs is not limited to healthcare professionals. Patients themselves are increasingly experimenting with these tools, as new models and updated versions create the impression of rapidly expanding capabilities from one year to the next. This steady rise in LLM use coincides with an already well-established pattern: health information is often sought online before consulting a physician. In the United States, survey data show that about six in ten adults aged 18 to 29 report being online almost constantly, with somewhat smaller but still substantial proportions in older groups. Such an environment directly encourages digital health information-seeking behavior and frequent encounters with LLM-based tools. [1,2] The COVID-19 pandemic further accelerated the adoption of virtual health care and normalized the use of public online forums where patients seek advice sometimes from reliable professionals, but often from peers or unverified sources. Reddit, along with similar platforms, has become a representative setting for “real-world” patient - physician interactions in an asynchronous, text-based format. The potential advantages of LLMs in this context are considerable. They can rapidly synthesize information, explain disease mechanisms in accessible language, highlight red-flag symptoms, and point to relevant resources, all while being available around the clock. They are also generally intuitive to use, even for individuals with limited health literacy. Furthermore, recent Permission to make digital or hard copies of part or all of this work for personal or evaluations suggest that LLM-generated responses classroom use is granted without fee provided that copies are not made or distributed may convey greater empathy and clarity than for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia © 2025 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2025.gptzdravje.4 12 physician-written answers in some online settings, potentially improving comprehension and adherence. Yet, the risks remain substantial. LLMs are prone to generating hallucinations plausible but incorrect statements while omitting key information or inferring unstated details. In a high-risk domain such as medicine, these limitations render unsupervised use unsafe. The most recent literature emphasizes that hallucinations and omissions are intrinsic to current LLM architectures, and that without rigorous safeguards - such as benchmarking, oversight, and validation - clinical deployment should not proceed unchecked. Beyond technical concerns, the rapid spread of LLM use also raises new ethical and societal challenges. Healthcare is guided by strict ethical norms, professional duties, and societal responsibilities, and recent case reports highlight instances where LLM outputs, including those from ChatGPT, have contributed to harmful and potentially life-threatening outcomes. [3] In this review, we focus on a specific clinical domain - epilepsy and other seizure disorders where the need for reliable information is particularly acute. Epilepsy is a chronic, often lifelong condition with a heterogeneous clinical presentation, typically beginning in childhood or young adulthood. Patients with epilepsy frequently have questions about treatment options, drug interactions, lifestyle considerations, and safety precautions. Studies have shown that a significant proportion of individuals with epilepsy actively search for information online, both on general and disease-specific topics. Analyses of search patterns (for example, on Wikipedia) have revealed strong public interest and episodic peaks in epilepsy-related queries. More recent research indicates that people with epilepsy engage in online health information seeking at higher rates than many other patient groups, underscoring the importance of understanding how LLM responses might influence their perceptions and behaviors. However there are both potential benefits and inherent limitations of LLMs in epilepsy care as shown by recent review articles. [4,5,6,7,8,9] Despite the growing body of literature on LLMs in medicine, they remain insufficiently reliable for routine, uncontrolled use. A notable gap exists: few studies evaluate LLMs from the patient’s perspective, particularly using real-world data drawn from public forums. Our study is designed to address this gap. Specifically, we assess whether responses generated by OpenAI’s ChatGPT-4 meet the needs of people with epilepsy who ask questions on r/AskDocs. Physicians serve as expert evaluators not to arbitrate “on behalf of patients,” but to operationalize criteria of quality, utility, accuracy, and safety in line with real user needs. We argue that this design places the patient - LLM relationship at the center of the analysis, while leveraging medical expertise to standardize evaluation metrics and identify areas where safeguards or clinical verification remain necessary. In this framework, Reddit provides a natural, heterogeneous, and timely source of patient queries, enabling an evaluation of LLM responses under conditions that approximate the realities of everyday patient information-seeking. . [3,7,10,11] 2. Material and Method In the intial phase of the study we collected a total of 110 patient questions from the subreddit r/AskDocs, one of the more active medical communities on reddit with over half a milion active participants. Questions were identified using a filtered search using keywords „epilepsy“, „seizure“ and „seizure disorder“. To ensure quality only posts submitted within the past 12 months and those that received at least one verified physican response (marked with the flair „physician responded“) were included. Out of the selected 110 questions, 4 were excluded due to being duplicates or irrelevant to the subject matter. For each selected question a response was generated using ChatGPT 4.0. These responses were then independently evaluated by four certified physicians – one neurologist, one radiologist, one neurology resident and one radiology resident. The raters were blinded to each other’s assessments and did not consult each other during the evaulation process. Interrater agreements were reached using Fleiss Kappa with minimal discrepanies observed among evaluators. Evaluations were made using predefined dimension with a modified Likert scale (1-5). The dimensions assessed were Accuracy, Comprehensiveness, Clarity, Empathy, Relevance. Additional dimensions were assessed using categorical ratings (Yes/No responses). These dimensions were Reasoning, Currency, Bias, Harm, Factuality, Fabrication, Falsification and Plagiarism. 3. Results Overall the raters found that ChatGPT 4.0 respones were very positive with approximately 80% of answers classified as „good“ or „very good“ across all dimensions on the Likert Scale. Most answers were considered factually correct, we found no responses to be incorrect. Most answeres were very thorough andcc easily understandable with language that the raters believe cover all educational specters. We found no instanes of outdated recommendations and all responses were deemed to be concise, without unnecessary and overbearing details. 13 Regarding categorical measures we did not find any cases of bias, harm, fabrication, falsification. All answers gave information that could be easily varified against standard medical sources. The lowest scoring dimension was empathy as we found most answers to be on average good or decent with no responses being explicitly poor. All together, these results suggest that ChatGPT 4.0 is capable of generating accurate, clear and relevant responses to patient questions about epilepsy with the primary limitation being in the domain of empathetic responses. Evaluation of ChatGPT 4o 4. Discussion responses (Average values of 4 rater per In this study, we examined the usability of responses category) generated by ChatGPT 4.0 in comparison to neurologists’ answers to patient questions about epilepsy on Reddit, specifically the subreddit r/AskDocs. This RELEVANCE community is one of the largest and most active health forums online, with over half a million members and EMPATHY hundreds of new patient questions submitted daily. A CLARITY particular strength of this platform lies in its anonymity: COMPRENHENSIVNESS users can ask sensitive medical questions more openly ACCURACY a broader and more candid spectrum of concerns. than they might in a clinical encounter, which results in 0 Additionally, r/AskDocs is actively moderated and 1 2 3 4 5 follows strict rules medical advice is permitted only from verified physicians (marked by a special flair), while other users are restricted to sharing personal experiences. This structure ensures a basic level of quality control and provides a reliable basis for comparing physician responses with those of ChatGPT. We believe this makes r/AskDocs a relevant and valid environment for evaluating the potential of large language models (LLMs) in a medical setting. Our findings complement recent research done by Fennig and colleagues [12], in which LLM models were used to analyze tens of thousands of Reddit posts to identify topics and concerns that epilepsy patients often do not bring up in clinical settings. That work found significant patterns such as stigma, emotional distress, substance use, and seizure description high-engagement topics that are outside of standard outpatient conversations and often not given adequate space in the clinical conversation. This confirms that LLM models are not only for providing answers, but also for a deeper understanding of patient needs, which further justifies the use of r/AskDocs as a source of realistic and relevant questions for our study. Our findings indicate that ChatGPT 4.0 generally provides accurate, relevant, and comprehensive answers. Importantly, no response was deemed explicitly incorrect, underscoring the potential of such tools to deliver reliable medical information for patients with epilepsy. However, the model consistently showed weaker performance in conveying empathy compared to physicians. This limitation has been noted in previous studies, which emphasize that while LLMs can reproduce medical content accurately, they struggle to replicate the human aspects of communication such as reassurance, compassion, and emotional support. [1,6,8] The overall impression of the neurologists was that the ChatGPT 4.0 responses were mostly "acceptable" or "good", while a smaller number were rated as "very good". Nevertheless, doctors generally gave somewhat better answers, but the difference was not large. This finding is consistent with the results of a study by Ayers and colleagues., who also found that chatbot responses can be of similar or even better quality in certain dimensions, but with limitations in empathy. [1] It is important to point out that our results should be seen in the context of the increasing number of patients using the Internet for epilepsy information and potentially changing therapy based on information obtained online. Previous studies have shown that patients with epilepsy frequently search the Internet to learn more about their disease [3,4], while more recent studies indicate a high rate of use of digital sources of health information in this population. [5] Precisely because of this, the ability of large language models to generate correct and comprehensible answers is of particular importance. Even though our findings are encouraging, it is necessary to emphasize the potential risks. The literature on LLMs in medicine warns of the phenomenon of "hallucinations", i.e. giving confident but incorrect answers. [6,7] Although in our series no answer was explicitly wrong, such cases were not excluded in a larger sample, especially in more complex clinical scenarios. In addition, a critical review of LLMs in epileptology indicates that current tools may be useful for patient and physician education, but are not ready for routine, uncontrolled clinical application. [8] 14 Finally, it should be emphasized that the focus of our study was the attitude of patients towards the responses of LLMs, while doctors had the role of mediators in quality assessment. This kind of perspective can be significant for future research, as it opens up space for a better understanding of how patients value and perceive such tools compared to traditional medical sources 5. Conclusion This study demonstrates that ChatGPT 4.0 provides responses to patient questions about epilepsy that are largely accurate, relevant, clear, and comprehensive. However, the limitations observed - especially regarding emotional support and nuanced communication highlight that ChatGPT cannot replace professional medical consultation. Instead, its role should be considered complementary, supporting patient education and engagement, while final interpretation and guidance remain within the responsibility of qualified healthcare professionals. Acknowledgements Views and opinions expressed in this paper are those of the author(s) only and do not necessarily reflect those of the European Union. Neither the European Union nor any other authority can be held responsible for them. All authors contributed equally in the final version of this paper. References 1. Ayers JW, Poliak A, Dredze M, Leas EC, Zhu Z, Kelley JB, et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med. 2023;183(6):589–96. doi: 10.1001/jamainternmed.2023.1838 2. Pew Research Center. Mobile technology and home broadband 2021. Available from: https://www.pewresearch.org/internet/2021/06/03/mobile-technology-and-home-broadband-2021/ 3. Omar M, Sorin V, Collins JD, et al. Multi-model assurance analysis showing large language models are highly vulnerable to adversarial hallucination attacks during clinical decision support. Commun Med (Lond). 2025;5:44. doi: 10.1038/s43856-025-01021-3 4. Liu J, Dong X, Mao Y, et al. Internet usage for health information by patients with epilepsy. Epilepsy Behav. 2013;29(1):110–3. doi: 10.1016/j.seizure.2013.06.007 5. Brigo F, Erro R, Marangi A, et al. Why do people Google epilepsy? An infodemiological study of online behavior for epilepsy-related search terms. Epilepsy Behav. 2015;45:128–33. doi: 10.1016/j.yebeh.2013.11.020 6. Bingöl N, Mutluay FK, Erbaş O. Determining the health-seeking behaviors of people with epilepsy. Epilepsy Behav. 2024;152:109331. doi: 10.1016/j.yebeh.2024.110063 7. Bélisle-Pipon JC, et al. Why we need to be careful with large language models in medicine and healthcare. AI & Ethics. 2024. doi: 10.3389/fmed.2024.1495582 8. Asgari E, Montaña-Brown N, Dubois M, et al. A framework to assess clinical safety and hallucination rates of large language models for medical text summarization. npj Digit Med. 2025;8(1):19. doi: 10.1038/s41746-025-01670-7 9. García-Azorín D, Bhatia R, et al. Potential merits and flaws of large language models in epilepsy care: A critical review. Epilepsia. 2024;65(4):873–886. doi: 10.1111/epi.17907 10. The Verge. Google’s healthcare AI made up a body part. The Verge. 2025 May 7. Available from: https://www.theverge.com/2025/05/07/google-healthcae-ai-made-up-body-part 11. Auvin S, Nabbout R, et al. Quality of health information about epilepsy on the Internet. Arch Pediatr. 2013;20(6):603–7. doi: 10.1016/j.neurol.2012.08.008 12. Fennig U, Yom-Tov E, Savitzky L, Nissan J, Altman K, Loebenstein R, Boxer M, Weinberg N, Gofrit S G, Maggio N. Bridging the conversational gap in epilepsy: Using large language models to reveal insights into patient behavior and concerns from online discussions. Epilepsia. 2025;66(3):686–699. doi: 10.1111/epi.18226 15 Evaluating Large Language Models for Privacy- Sensitive Healthcare Applications Tadej Horvat Žan Roštan Jakob Jaš Department of Intelligent Systems Fakulteta za računalništvo in Fakulteta za računalništvo in Jožef Stefan Institute informatiko informatiko Ljubljana, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia tadej.horvat@ijs.si zan.rostan@gmail.com jakob.jas06@gmail.com Matjaž Gams Department of Intelligent Systems Jožef Stefan Institute Ljubljana, Slovenia matjaz.gams@ijs.si Abstract Large language models (LLMs) are being systematically 1 Introduction evaluated through accuracy for clinical use, yet privacy risks, Recent evaluation work has shifted from saturated multiple- limited transparency, and operational variability still complicate choice tests toward clinically grounded, contamination-limited their adoption on sensitive health data. Motivated by an intended settings such as HealthBench, which provides physician-scored deployment in HomeDOCtor, a Slovenian medical platform, we multi-turn health dialogues spanning triage safety, clinical present an agenda for evaluating LLMs in real-life privacy- appropriateness, and grounding [1, 2]. This shift is critical sensitive healthcare applications. First, we map privacy risks: because theoretical knowledge, often tested in exams, does not training-data extraction, input leakage, and output re- identification; and outline concrete mitigations (red-teaming, guarantee safe or effective application in the nuanced, interactive canary strings, differential privacy, filtering, and structured context of patient care. Ensuring that evaluation benchmarks are prompts). Second, we propose a lightweight, reproducible not compromised by training data contamination is essential for evaluation protocol that pairs model-side privacy checks with obtaining a true measure of a model's clinical reasoning abilities. clinician-in-the-loop utility and safety assessments on de- To probe general reasoning under uncertainty beyond strictly identified data, aligned with EU GDPR expectations. Third, medical content, Humanity's Last Exam (HLE) evaluates using small, domain-specific, clinically grounded benchmarks, graduate-level, closed-ended questions and remains far from we compare frontier, commercial, and open-weight models and ceiling performance on the public leaderboard, revealing sizeable analyze trade-offs among utility, privacy, and maintainability in headroom [3, 4]. A complementary lens comes from the the HomeDOCtor context. Finally, we discuss deployment and TrackingAI community's LLM IQ distribution, which aggregates governance patterns for healthcare operators (access control, an offline quiz to profile breadth and robustness outside familiar audit logging, data minimization, incident response). Our results suggest that (i) focused, task-specific evaluations are more exam sets [5]. Triangulating these different evaluation types informative than generic world-wide benchmarks for patient- (clinical dialogue, academic reasoning, and general IQ) provides facing use; (ii) suitably hardened and monitored open-weight a more holistic view of a model's true capabilities. models can be viable although their quality is not comparable to In the EU, privacy-preserving deployment for patient data is top commercial systems; and (iii) privacy risk cannot be governed primarily by the General Data Protection Regulation eliminated but can be bounded and operationalized. We conclude (GDPR) [6]. Health data falls under special categories (Article with recommendations for ethics approvals, documentation, and 9), requiring both a valid legal basis (Article 6) and a specific reproducibility to support safe adoption in Slovenia and beyond. condition under Article 9(2), with principles like data Keywords minimisation and purpose limitation being central to system design [6]. While the US Health Insurance Portability and Artificial intelligence (AI); Large language models (LLM); Accountability Act (HIPAA) remains relevant in cross-border Healthcare chatbot; Privacy; GDPR; Open-weight models; GPT; collaborations, GDPR is the operative legal framework for HomeDOCtor; Retrieval-augmented generation (RAG); HealthBench; Humanity’s Last Exam; LLM IQ. Slovenia and most of Europe [6, 7]. As a concrete application context, Slovenia's HomeDOCtor, our nationally localized, RAG-grounded health assistant, provides a ∗Article Title Footnote needs to be captured as Title Note real-world test bed for evaluating LLMs under GDPR-first † Author Footnote to be captured as Author Note constraints [8]. This system allows for planning a staged Permission to make digital or hard copies of part or all of this work for personal or migration to locally hosted open-weight models, balancing state- classroom use is granted without fee provided that copies are not made or distributed of-the-art performance with stringent data sovereignty for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must requirements [8]. We synthesise official HealthBench results and be honored. For all other uses, contact the owner/author(s). model cards to compare closed frontier models with competitive Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia © 2025 Copyright held by the owner/author(s). open-weight models on clinically oriented tasks [1, 2, 9, 10, 11]. https://doi.org/10.70314/is.gptzdravje.5 We position these findings alongside HLE and community LLM 16 IQ scores to characterise remaining reasoning headroom and out- reflect general, closed-ended academic reasoning headroom [3, of-distribution robustness [3, 4, 5]. Finally, we integrate a 4]. This benchmark helps characterize a model's ability to reason HomeDOCtor case study and provide a GDPR-first deployment from first principles on complex, graduate-level topics [3]. We blueprint toward zero-egress, on-premise inference with local also reference the community-driven LLM IQ distribution from retrieval, minimising persistent identifiers and aligning with EU TrackingAI to provide an additional out-of-distribution snapshot data protection obligations [6, 7, 8]. of breadth and robustness on a novel offline quiz, designed to resist training data contamination [5]. The triangulation of these benchmarks—one clinical, one academic, one general—is 2 Background and Related Work intentional, designed to provide a multi-faceted profile of each The development of benchmarks like HealthBench, with its model. 5,000+ multi-turn conversations scored against physician To ground these benchmark results in practice, we analyze the rubrics, marks a significant maturation in LLM assessment [1]. HomeDOCtor deployment [8]. In this real-world setting, the core It moves beyond simple accuracy to measure critical aspects like LLM component is swapped while holding the Retrieval- triage safety, clinical appropriateness, and evidence grounding Augmented Generation (RAG) corpus, prompts, and UI/UX [1, 2]. Official releases consistently report comparative scores constant [8]. This approach effectively isolates the performance across a range of closed and open-weight models, providing a deltas attributable to the model itself within a stable, GDPR-first standardized basis for comparison [2]. To combat the ever- operational environment [8]. present issue of benchmark contamination, harder alternatives such as LiveBench continually refresh questions and demand verifiable ground truth, mitigating the risk that models simply 4 Results memorize answers from their training data [12]. The collected data reveals a clear performance hierarchy, where Peer-reviewed studies provide further context for model ability frontier models excel on the most complex tasks, but high-quality on static, image-based medical exams (e.g., USMLE-style open-weight models are closing the gap, particularly for routine questions) [13]. However, these studies also consistently applications. underline that high exam accuracy is not a direct proxy for clinical safety or real-world utility in dynamic, patient-facing Table 1: Summarises HealthBench and HealthBench-Hard deployments [13]. This distinction is vital, as real-world scores as reported in official materials. healthcare conversations are rarely as structured as multiple- Model HealthBench HealthBench-Hard choice questions. (%) (%) Classic audits of earlier-generation symptom checkers GPT-5 (thinking) 67.2 46.2 established a crucial performance baseline, documenting o3 59.8 31.6 generally low primary diagnostic accuracy and a tendency o4-mini 50.1 17.5 toward overly risk-averse triage recommendations [14, 15]. o1 41.8 7.9 Modern LLM-based systems, enhanced with appropriate GPT-4o 32.0 0.0 guardrails and techniques like Retrieval-Augmented Generation GPT-OSS 120B 57.6 30.0 (RAG), are expected to significantly surpass this baseline in real- GPT-OSS 20B 42.5 10.8 world use cases [14, 15]. Nationally localized assistants like HomeDOCtor have already demonstrated the value of RAG, On the hardest, physician-scored subset (HealthBench-Hard), which grounds model responses in curated, country-specific GPT-5 currently leads in official postings with a score of 46.2%, guidelines and style guides, thereby improving clinical alignment significantly ahead of other models as presented in Table 1 [1, and fostering user trust in live deployments [8]. 9]. The leading open-weight model, GPT-OSS-120B, achieves a respectable 30.0%, trailing the frontier but remaining competitive against mid-tier closed models [2, 10]. On the 3 Methods standard HealthBench, these performance gaps narrow further, We aggregate official benchmark reports, model cards, and suggesting that while the most advanced alignment and post- public leaderboards to assemble a clinically relevant, privacy- training strategies in frontier systems are key differentiators on aware comparison of leading LLMs. Our methodology is challenging dialogues, high-quality open-weight models already centered on a synthesis of existing, credible data sources to cover many routine health tasks effectively when deployed with provide a holistic view of model performance. appropriate guardrails [1, 2]. Specifically, we extract HealthBench and HealthBench-Hard scores from official releases and model documentation where Table 2: Results from Humanity's Last Exam (HLE), which available [1, 2]. These benchmarks are chosen for their clinical measures closed-ended reasoning across diverse graduate-level relevance and physician-led scoring rubrics [1]. We also include topics findings from USMLE-style evaluations to provide a broader Model HLE Uncertainty context of their knowledge on standardized medical exams [13]. score We contrast frontier closed models (e.g., GPT-5; o3; GPT-4o) GPT‑5 (2025‑08‑07) 25.32 ±1.70 with leading open-weight systems (e.g., GPT-OSS-120B/20B) Gemini 2.5 Pro Preview (06‑05) 21.64 ±1.61 where credible public results exist [9, 10, 11]. o3 (high) (Apr 2025) 20.32 ±1.58 To assess capabilities beyond the medical domain, we GPT‑5 mini (2025‑08‑07) 19.44 ±1.55 incorporate HLE results from the public leaderboard, which 17 o4‑mini (high) (Apr 2025) 18.08 ±1.54 integrity/confidentiality, and accountability must be the primary Gemini 2.5 Flash (Apr 2025) 12.08 ±1.28 drivers of the system's design and architecture [6]. GPT-OSS 120B 9.04 ±1.12 Architectural patterns. A zero-egress architecture is the gold o1 (Dec 2024) 7.96 ±1.06 standard for privacy, ensuring Personal Health Information (PHI) GPT-OSS 20B 7.24 ±1.05 never leaves an on-premise or sovereign (EU) Virtual Private GPT-4.5 Preview 5.44 ±0.89 Cloud (VPC) trust boundary. In this pattern, retrieval-augmented GPT-4.1 5.40 ±0.89 generation (RAG) queries local, audited knowledge stores, and GPT-4o (November 2024) 2.72 ±0.64 system logs are tightly scoped and automatically rotated with strict retention policies. Any identifiers are filtered, Table 2 summarises leaderboard entries summarized with central pseudonymized, or transformed before any optional external estimates and uncertainty, again place GPT-5 at the top with a calls (e.g., for non-clinical functionality), and long-term user score of 25.32 [4, 9]. Notably, the performance of the open- profiles are avoided unless explicitly justified by the use case and weight GPT-OSS models (9.04 for 120B and 7.24 for 20B) is supported by a Data Protection Impact Assessment (DPIA) [6]. substantially lower than that of the top closed systems on this Where collaboration with U.S. partners is necessary, HIPAA general reasoning benchmark [4, 10]. This highlights the concepts can inform mappings of safeguards. GDPR remains the significant "reasoning headroom" that still exists and governing regime for legal obligations and data-subject rights in complements the clinical focus of HealthBench by probing for Slovenia and the EU [6, 7]. non-medical breadth and analytical depth. Controls and assurance. Recommended technical and organizational controls include strict role-based access, end-to- Graph 1: IQ Scores by Model (Mensa Norway, TrackingAI) end encryption (in transit and at rest), Data Loss Prevention (DLP) for prompts and outputs, and continuous red-teaming by safety evaluators focused on clinical harms. Governance is maintained through formal DPIAs and detailed records of processing activities for higher-risk use cases, with continuous evaluation on HealthBench-style test sets to monitor for performance drift and ensure referral appropriateness [1, 2, 6]. 5.1 Case Study: HomeDOCtor HomeDOCtor is our implementation of a home doctor medical service that integrates a Flutter front-end, a FastAPI backend, and a Redis Stack vector database that powers the RAG system [8]. The knowledge base is composed of curated Slovenian Beyond clinical dialogue benchmarks, TrackingAI in clinical sources, including the national Manual of Family collaboration with Mensa Norway provides an independent Medicine, public treatment protocols, official discharge assessment of general reasoning ability through the LLM IQ test. instructions, and the Insieme ontology. During operation, Unlike standard leaderboards, this offline quiz is carefully prompts inject the top 3-5 retrieved text snippets into a structured designed to resist training-data contamination, thereby capturing template to generate grounded, locally relevant replies. model robustness on unfamiliar out-of-distribution problems [5]. Privacy-by-design. To align with GDPR and national Taken together, HealthBench (clinically grounded dialogue), constraints, interactions are deliberately stateless and HLE (broad closed-ended reasoning), and the TrackingAI Mensa anonymous. No user data are retained beyond the active session, Norway distribution (community offline quiz, Graph 1) and no longitudinal profiles are created. This design choice triangulate model capabilities [1, 2, 3, 4, 5]. The consistent maximizes privacy at the cost of convenience (e.g., users must pattern is that closed frontier models currently lead on the most re-enter data each session), but it drastically simplifies regulatory difficult and nuanced subsets of tasks. Simultaneously, strong compliance [8]. open-weight models such as GPT-OSS-120B have become Model-agnostic orchestration. The architecture is model- highly competitive for routine health dialogues and, crucially, agnostic. The same RAG corpus, prompts, and UI can support enable the on-premise, privacy-first deployments required under regulatory frameworks like GDPR [10]. multiple LLMs (e.g., GPT-4o, 03 mini high, Gemini 2.5, Gemma 3 via Ollama) [8]. This enables direct, like-for-like performance comparisons in a stable pipeline and creates a clear path toward 5 Privacy and Deployment fully local inference on open-weight models using standardized orchestration tools. Legal bases and special categories. For EU deployments, Empirical performance. On 100 international clinical vignettes processing health data is strictly regulated [6]. It requires both a (Avey AI), HomeDOCtor variants using GPT-4o and o3-mini valid Article 6 legal basis (e.g., consent, vital interest) and a high achieved 99/100 Top-1 accuracy. An open-weight-friendly specific condition under Article 9(2) for special categories of variant (e.g., using Gemma 3) reached a competitive 95/100 [8]. data [6]. Common conditions include medical diagnosis or care, On a 150-question national internal-medicine test set, public interest in public health, or explicit consent for specific, HomeDOCtor with GPT-4o scored 136/150, significantly clearly defined purposes [6]. The core GDPR principles of data outperforming a baseline of ChatGPT-4o at 121/150 (p=0.0135, minimisation, purpose limitation, storage limitation, Bonferroni-adjusted), demonstrating the power of RAG with local sources. 18 Operational notes. In a six-month nationwide deployment, the of systems. Nevertheless, we recommend a staged migration system successfully delivered sub-3-second average responses, toward model sovereignty, gated by pre-defined safety and provided multilingual support, and garnered positive user performance-parity criteria: feedback. This illustrates the feasibility of providing 24/7 citizen guidance under strict privacy constraints using modern AI 1. pilot zero-egress deployments; architecture. 2. move to managed on-prem hosting; 3. advance to fully self-hosted open-weight models once parity (utility, safety, privacy) is demonstrated and 6 Discussion continuously monitored [1–15]. In this section, we analyse three overarching themes, beginning with the tension between capability and compliance. This strategy offers a pragmatic path for Slovenia and peers: to Capability vs. compliance trade-offs. Our findings highlight a deploy self-hosted, sovereign medical AI assistants while central trade-off in applied healthcare AI [1, 2, 6, 9, 10]. Closed, upholding the highest standards of data protection and state-of-the-art models retain a performance edge on the most accountability. difficult, clinically scored dialogues [1, 9]. However, strong At the same time, citizens should have a free choice between the open-weight models are approaching parity on more routine GDPR-dedicated and the commercial top system in medical tasks and, critically, enable the fully local, zero-egress inference counselling. that is often a decisive factor for PHI-heavy workloads under strict GDPR constraints [2, 6, 10]. The lower recurring costs and Acknowledgements greater control offered by self-hosting can also be compelling for public healthcare systems. We thank medical students Ivana Karasmanakis, Filip Open-weight gap and trajectory. In HealthBench-Hard, the Ivanišević, and Lana Jarc for participating in the research. Also, performance gap between a strong open-weight model (GPT- thanks to Rok Smodiš, Matic Zadobovšek, and Domen Sedlar for OSS-120B) and the frontier (GPT-5) is on the order of ~16 helping with the development of the HomeDOCtor application. percentage points [1, 9, 10]. This gap narrows substantially on This project is funded by the European Union under Horizon the broader HealthBench benchmark and in applied, RAG- Europe (project ChatMED grant agreement ID: 101159214). The powered systems like HomeDOCtor, where curated local data authors also acknowledge the financial support from the can significantly boost performance [1, 2, 8]. This suggests that Slovenian Research Agency (research core funding No. P2- a key strategy for closing the gap is not just using larger open- 0209). weight models, but also investing in high-quality, domain- specific fine-tuning and retrieval augmentation. References Evaluation breadth. HLE and LLM IQ results highlight the [1] Arora, R. K., Wei, J., Soskin Hicks, R., Bowman, P., residual headroom and robustness variance that exist outside the Quiñonero-Candela, J., Tsimpourlas, F., et al. HealthBench: strictly clinical domain [3, 4, 5]. A model that excels at medical Evaluating Large Language Models Towards Improved Human Q&A may still lack the general reasoning capabilities needed for Health. arXiv (2025). DOI: 10.48550/arXiv.2505.08775 — more complex, multi-faceted problems. Therefore, clinical URL: https://arxiv.org/abs/2505.08775 arXiv deployments should prioritize systems that are well-grounded, [2] OpenAI. Introducing HealthBench (May 12, 2025). URL: calibrated, and know when to defer to a human expert, rather than https://openai.com/index/healthbench/ OpenAI extrapolating safety from generic reasoning benchmarks alone [3] Phan, L., et al. Humanity’s Last Exam (HLE). arXiv [14, 15]. Continuous, post-deployment monitoring against live (2025). DOI: 10.48550/arXiv.2501.14249 — URL: data is essential to ensure ongoing safety and efficacy. https://arxiv.org/abs/2501.14249 arXiv [4] Humanity’s Last Exam. Official site and leaderboard. 7 URLs: https://lastexam.ai/ and Conclusion https://scale.com/leaderboard/humanitys_last_exam Last For EU healthcare applications, a GDPR-first architecture is ExamScale legally essential [6]. In practice, this means local retrieval, zero- [5] TrackingAI.org. LLM IQ – Offline quiz. URL: egress inference where feasible, tightly scoped, encrypted https://trackingai.org/ Tracking AI logging, and explicit, granular consent backed by a DPIA for any [6] GDPR. Article 9 – Processing of special categories of data persistence [6]. These guardrails underpin both legal personal data. URL: https://gdpr-info.eu/art-9-gdpr/ GDPR compliance and public trust. [7] U.S. Department of Health & Human Services. Summary Evidence across HealthBench (clinical dialogue), HLE (broad of the HIPAA Privacy Rule. URL: reasoning), LLM IQ (offline quiz), and our HomeDOCtor https://www.hhs.gov/hipaa/for-professionals/privacy/laws- deployment shows a consistent pattern: closed models still lead regulations/index.html HHS.gov on the most demanding clinical subsets, but mature open-weight [8] Gams, M.; Horvat, T.; Kolar, Ž.; Kocuvan, P.; Mishev, K.; systems already support many routine, privacy-preserving Simjanoska Misheva, M. Evaluating a Nationally Localized AI workflows when paired with retrieval constraints, auditing, and Chatbot (HomeDOCtor) for Slovenia: Performance, Privacy, and output filters [1,2,3,4,5,8]. However, it should be noticed that top Governance. Healthcare 13(15):1843 (2025). DOI: (say 5) closed systems enable better open communication and 10.3390/healthcare13151843 — URL: reasoning in Slovenian language. Therefore, there is a trade-off https://www.mdpi.com/2227-9032/13/15/1843 between quality and GDPR-compliance between the two groups 19 [9] OpenAI. Introducing GPT-5 (Aug 7, 2025). URL: https://openai.com/index/introducing-gpt-5/ OpenAI [10] Gemma Team (Google DeepMind). Gemma 3 Technical Report. arXiv (2025). DOI: 10.48550/arXiv.2503.19786 — URLs: https://arxiv.org/abs/2503.19786 and https://storage.googleapis.com/deepmind- media/gemma/Gemma3Report.pdf arXivGoogle Cloud Storage [11] Google. Gemini 2.5: Our newest Gemini model with thinking (Mar 25, 2025). URL: https://blog.google/technology/google-deepmind/gemini- model-thinking-updates-march-2025/ blog.google [12] White, C., Dooley, S., Roberts, M., Pal, A., Feuer, B., Jain, S., et al. LiveBench: A Challenging, Contamination- Limited LLM Benchmark. arXiv (2024/2025). DOI: 10.48550/arXiv.2406.19314 — URLs: https://arxiv.org/abs/2406.19314 and https://livebench.ai/ arXivlivebench.ai [13] Yang, X., et al. The performance of ChatGPT on medical image-based assessments and USMLE sample items. BMC Medical Education 25, 495 (2025). DOI: 10.1186/s12909-025- 07752-0 — URL: https://bmcmededuc.biomedcentral.com/articles/10.1186/s1290 9-025-07752-0 BioMed Central [14] Semigran, H. L., Linder, J. A., Gidengil, C., Mehrotra, A. Evaluation of symptom checkers for self diagnosis and triage: audit study. BMJ 351:h3480 (2015). DOI: 10.1136/bmj.h3480 — URL: https://www.bmj.com/content/351/bmj.h3480 BMJ [15] Wallace, W., Chan, A., Chou, R., Desai, S., Johnson, B., Shojania, K. Digital symptom checkers: diagnostic and triage accuracy—systematic review. NPJ Digital Medicine 5, 79 (2022). DOI: 10.1038/s41746-022-00667-w — URL: https://www.nature.com/articles/s41746-022-00667-w Nature 20 IQ Progression of Large Language Models Evaluating LLM Cognitive Capabilities: An Analysis of Historical Data with Future Projections Jakob Jaš Matjaž Gams Fakulteta za elektrotehniko Oddelek za inteligentne sisteme Univerza v Ljubljani, Slovenija Institut "Jožef Stefan", Slovenija jakob.jas06@gmail.com matjaz.gams@ijs.si Abstract [5], they remain primarily task driven. In contrast, IQ-style evaluations, though imperfect, offer a way to frame AI progress rapidly in reasoning and problem-solving. Whereas earlier this framing has grown in 2024–2025, as several independent initiatives (e.g., TrackingAI.org) began publishing standardized systems scored well below human averages on standardized Over the past few years, artificial intelligence (AI) has advanced in human-familiar psychometric terms [6,7]. The relevance of IQ-style assessments for frontier AI systems [8]. benchmarks, recent large language models (LLMs) now match or At the same time, the scientific community has debated whether sometimes exceed the performance of highly capable humans. such comparisons can be justified, given that human IQ tests This paper provides secondary analyses on IQ-style evaluations measure a construct (the g-factor) tied to biological cognition, test suites, gathered from an external aggregator. The results as recent research highlights, behavioural equivalence in reasoning and abstraction can still provide meaningful insights show a pronounced upward trajectory: models released within the of leading models across both online (Mensa Norway) and offline while AI systems lack embodiment or consciousness [9,10]. Yet, into the trajectory of machine intelligence [11,12,13]. last year frequently score in the top decile of the human This paper contributes by: distribution, a sharp rise from earlier generations that clustered around the mean. We map model scores to a Gaussian IQ scale 1. Mapping AI model performance on IQ-style to enable direct comparisons with human norms, examine month- benchmarks to the Gaussian human IQ distribution. over-month trends, and provide short-term projections of likely progress. Findings highlight rapid gains in general-purpose 2024 and September 2025. 2. Analysing month-over-month progress between May reasoning while underscoring the need for further balanced progress of machine intelligence. 3. Projecting near-future trajectories of model performance. Keywords artificial intelligence, large language models, IQ, projection provide both a quantitative and conceptual framework for By situating these findings in psychometric terms, we aim to tracking the rapid progression of machine intelligence. 1 Introduction The past decade has seen a rapid acceleration in artificial 2 Theory and methodology intelligence (AI) research and deployment, transforming it from narrow task-specific systems into models capable of exhibiting 2.1 Theoretical foundations broad general reasoning. Once limited to specialized domains such as translation and board games, AI systems now The emergence of general-purpose AI models capable of solving demonstrate competencies across multiple modalities, frequently novel, cross-domain tasks has prompted a rethinking of how outperforming humans in complex tasks [1]. intelligence is defined and measured. Historically, intelligence Large language models (LLMs) have played a central role in this has been assessed through psychometric methods, with the transition. Trained on massive corpora and increasingly general intelligence factor (g-factor) introduced by Spearman in multimodal data sources, LLMs have become benchmarks for 1904 [10]. IQ tests were subsequently developed to capture this general-purpose intelligence in machines [2]. Recent work has construct through tasks spanning verbal, spatial, logical, and shown that models such as GPT-4o, Claude 3 Opus, and GPT-5- mathematical reasoning. Scores are normalized on a Gaussian vision demonstrate reasoning abilities previously unattainable by distribution with mean 100 and standard deviation 15, enabling artificial systems, raising the question of how to compare their population-level comparisons [14]. progress with human cognitive measures [3,4]. In AI research, traditional evaluation benchmarks have focused Although domain-specific benchmarks such as MMLU, on task-specific accuracy, leaving a gap in assessments of general BigBench, or HELM provide structured evaluation environments cognitive ability. Recent studies propose adapting psychometric ∗ frameworks to AI evaluation, both to contextualize results and to Article Title Footnote needs to be captured as Title Note †Author Footnote to be captured as Author Note study cross-domain generalization [15,16]. While machines lack consciousness, subjective experience, and embodiment, their Permission to make digital or hard copies of part or all of this work for personal or problem-solving behaviour can nevertheless be quantified classroom use is granted without fee provided that copies are not made or distributed against human reference distributions. for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must Thus, IQ-style testing is not employed here as a claim of human- be honored. For all other uses, contact the owner/author(s). equivalent cognition, but as a pragmatic and interpretable method Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia for measuring progress in general reasoning. © 2025 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2025.gptzdravje.6 21 2.2 Model selection 2.4 Scoring and Statistical Normalization The study focuses on leading general-purpose AI systems Model outputs were scored using the conventional IQ scale released between May 2024 and September 2025, ensuring (mean = 100, SD = 15). Mensa results ranged 85–145, while chronological comparability and representativeness of offline results spanned ~60–150. Normalization allowed architectural innovation. Models were selected based on three consistent cross-model comparison and alignment with criteria: psychometric conventions [21]. Models were ordered • chronologically, with top-five performers highlighted to track Performance and frontier status – inclusion of systems at or near state-of-the-art benchmarks. frontier progression. • Normalization to the human IQ scale can be defined as: Architectural diversity – coverage of both text-only LLMs (e.g., LLaMA, Mistral) and multimodal models (e.g., GPT- 𝑧 = (𝑋 − µ)/𝜎 𝐼𝑄 = 100 + 15 · z 4o, Claude 3 Opus, GPT-5-vision). When percentiles are available: • 𝐼𝑄 = 100 + 15 · Φ⁻¹(p) Data modality shifts – reflecting the move from unimodal to This selection enables analysis not only of absolute performance Additionally, predictions were made using the jump diffusion multimodal reasoning [17,18]. but also of how different architectures and modalities affect model [22, 23] with an adjustable factor e (extremity), which is reasoning in IQ-like contexts. used to scale all the dynamics of the projection. For all projections, this factor was set to 0.5, resulting in a more 2.3 Data source and collection conservative estimate. 100 paths were plotted, and the mean path was additionally marked. Performance data were collected from TrackingAI.org, an 3 Results independent aggregator of psychometrically aligned AI test results [8]. TrackingAI provides transparent, standardized scores 3.1 Gaussian Distribution Mapping across two environments: Figures 3 and 4 illustrate how AI model IQ scores align with the • Mensa Norway Online IQ Test – a publicly available timed human Gaussian curve. Older systems cluster far left of the mean, reasoning test including logic, pattern recognition, and corresponding to human IQs between 60 and 80. By contrast, the abstract problem-solving [19] (Figure 1). majority of 2025-era models lie at or above the human average. The distribution shows a clear shift rightward, with leading • Offline IQ-style Test Set – a curated, private benchmark models positioned well into the 120+ range [24]. developed to reduce contamination risks from public datasets [20] (Figure 2). Both test suites normalize results to an IQ-equivalent scale, enabling direct comparison with human distributions. Figure 3: Human-like Gaussian Distribution of Models - Mensa Norway Figure 1: IQ Scores by model - Mensa Norway Figure 4: Human-like Gaussian Distribution of Models - Offline Test 3.2 Projected growth Figure 2: IQ Scores by model - Offline test Figure 5 shows monthly IQ-style test scores for top models on Mensa and offline benchmarks between May 2024 and 22 September 2025, along with linear fits and 12-month projections. Both benchmarks display consistent upward trends over time [25]. Figure 7: Jump Diffusion Model on Offline-Based Data 4 Discussion The results demonstrate a clear trajectory of accelerating gains in Figure 5: Projected growth based on monthly top-model performance AI intelligence over the past 12 months, with performance on IQ- style benchmarks increasing at a pace that suggests sustained Mensa scores increased from approximately 80 in May 2024 to improvement. Both Mensa-based and offline test results reveal consistent upward trends, though with notable differences. around 140 by September 2025, while offline scores rose from Firstly, Mensa-style evaluations reveal that even earlier- about 70 to 125 over the same time period. Linear projections generation models retain relatively strong performance compared estimate Mensa scores reaching ~170 and offline scores ~145 by to newer systems, contrary to the offline test, where the majority mid-2026 [26] (Tables 1,2). of top-performing models came out very recently. One possible Table 1: Mensa-based projection of improvement explanation for this is training data contamination [27], as the IQ Score Date % of people with higher scores older models could have been trained on data sets containing 100 information on Mensa’s questions, which isn’t the case for the Dec.24 50,00% offline test, due to its privacy. The rise in the offline test’s 120 Jun.25 9,12% performance could therefore be attributed to improved model 140 reasoning and overall better model quality. The second notable Nov.25 0,38% 160 May-26 0,003% evaluations once again indicates that the public nature of the test difference is the rate of growth. The steeper slope of the Mensa 170 Sep.26 0,00015% may be affected by potential training-data contamination, whereas the offline test, being private, seems to show a more IQ Score Table 2: Offline-based projection of improvement robust score. The Gaussian distribution plots further contextualize these results Date % of people with higher scores by positioning current models relative to human intelligence 100 Apr.25 50,00% norms. While a majority of systems cluster around human- 120 average IQ levels (90–110), several frontier models now extend Nov.25 9,12% significantly into the upper tail of the distribution, with offline IQ 140 Jun.26 0,38% equivalents surpassing 120 and projections approaching 145–170 145 depending on the benchmark [28]. The jump diffusion models Sep.26 0,14% additionally support these predictions and even outperform them by nearly 10 IQ points in the offline test case. A jump diffusion model, as seen in Figures 6 and 7, shows the This marks a transition from models being predominantly below mean projected IQ for Mensa-based data to be ~170 by late 2026 or near human-level reasoning ability to a subset consistently and ~154 for the offline test. operating at or beyond the threshold typically associated with high human intelligence [29]. Data from the last 14 months shows that frontier models went from scoring near or even below the human average (GPT-4 Omni, LLaMA-Vision) a year ago, to about average IQ in December 2024 and April 2025 (depending on the administered test), to now reaching the 140 IQ and 125 IQ mark on each test, respectively. Additionally, taking the last six months into account, IQ scores grew by roughly 20 points in both tests [30]. Projections, seen in Tables 1 and 2, thus indicate that by late 2026, models will have surpassed the cognitive abilities of more than 99,87% of all living people based on the more conservative offline estimates, and more than 99,99% based on Mensa data. Taken together, the findings indicate that AI has not only Figure 6: Jump Diffusion Model on Mensa-Based Data achieved expert-level performance on various machine benchmarks [31] but is now on a trajectory to surpass human performance across multiple modalities. The pace of this growth, particularly visible in the Mensa projections, raises questions about whether near-future systems may consistently score in ranges associated with the top fraction of human intelligence [32,33]. 23 While the IQ analogy is attractive, due to the seemingly apparent [6] Binz, M., & Schulz, E. (2024). Evaluating Planning and Reasoning in comparisons we can draw between humans and AI, the Language Models. Nature Machine Intelligence. shortcomings of IQ-based AI evaluation must also be addressed. https://doi.org/10.1038/s42256-024-00896-1 [7] Lake, B. M., Ullman, T. D., & Tenenbaum, J. B. (2024). Symbolic Firstly, with IQ tests built around human cognition, an AI can, reasoning in the age of deep learning. Annual Review of Psychology. through pattern recognition, perform well on questions without https://doi.org/10.1146/annurev-psych-030322-020111 displaying the underlying cognitive flexibility and reasoning [8] TrackingAI.org. (2025). IQ-style Benchmark Results. Retrieved from skills. Additionally, the IQ test is a contested construct even https://trackingai.org [9] Hernández-Orallo, J. (2017). Evaluation in Artificial Intelligence: From when it comes to measuring human intelligence, as it may task-oriented to ability-oriented measurement. Artificial Intelligence measure some aspects of our cognition, but ultimately falls short Review, 48(3), 397–447. https://doi.org/10.1007/s10462-016-9505-7 [10] Spearman, C. (1904). General Intelligence, Objectively Determined and when it comes to other skills such as emotional intelligence or Measured. The American Journal of Psychology, 15(2), 201 – 293. creativity [34 ]. That is why the notion of “AI surpassing human https://doi.org/10.2307/1412107 IQ” might be misleading and stems from a false sense of [11] Kaller, C. P., Unterrainer, J. M., & Stahl, C. (2012). Assessing planning comparability between test scores. ability with the Tower of London task. Psychological Assessment, 24(1), 46–53. https://doi.org/10.1037/a0025174 [12] Shallice, T. (1982). Specific impairments of planning. Philosophical Transactions of the Royal Society B, 298(1089), 199–209. 5 Conclusion https://doi.org/10.1098/rstb.1982.0082 [13] Anthropic. (2024). Claude 3 System Card. Anthropic AI. Retrieved from The provided data shows evidence of rapid and consistent https://www.anthropic.com improvement in model performance between 2024 and 2025. [14] Carroll, J. B. (1993). Human Cognitive Abilities: A Survey of Factor- Once positioned below or near the human mean, frontier systems Analytic Studies. Cambridge University Press. [15] Xu, Y., et al. (2025). Benchmarking AI Cognition with Psychometric now consistently operate well above the upper decile of the Tests. Cognitive Computation. https://doi.org/10.1007/s12559-025- human distribution. 10200-6 Projections indicate that if current growth trends continue, [16] Creswell, A., et al. (2025). Cognitive Benchmarks in LLMs. JAGI. https://doi.org/10.2478/jagi-2025-0003 leading models could reach IQ equivalents in the 145 – 170 range [17] OpenAI (2025). GPT-5 Vision Technical Report. Retrieved from within the next year, placing them firmly above most human https://cdn.openai.com/gpt-5-vision.pdf intelligence levels. While methodological uncertainties remain— [18] Mistral AI (2025). Mistral Large System Card. Retrieved from such as potentially inflated scores due to training data https://mistral.ai contamination, the opacity of private offline benchmarks, as well [19] Mensa Norway. (2025). Official IQ Test Description. Retrieved from https://mensa.no as the overall test’s validity—the general trajectory is [20] TrackingAI.org. (2025). Offline IQ-Style Dataset Description. unmistakable: AI systems are advancing at a pace that brings https://trackingai.org/offline them into direct comparison with high human cognitive [21] Binz, M., Schulz, E., & Lake, B. (2025). Toward Unified Cognitive performance [35]. Testing of AI Systems. Nature Reviews Psychology. https://doi.org/10.1038/s44159-025-00312-4 These findings highlight not only the acceleration of AI [22] Merton, R. C. (1976). "Option pricing when underlying stock returns are intelligence but also the need for better, machine-oriented discontinuous". Journal of Financial Economics. 3 (1–2): 125– 144. doi:10.1016/0304-405X(76)90022-2 evaluation methods. As models continue to expand in scale, [23] Grenander, U.; Miller, M.I. (1994). "Representations of Knowledge in modality, and capability, systematic monitoring of their cognitive Complex Systems" growth will be essential for understanding both their potential and [24] Zhang, Y., & Marcus, G. (2025). Psychometric Perspectives on AI their societal implications. Evaluation. Frontiers in Artificial Intelligence, 8:155. https://doi.org/10.3389/frai.2025.00155 [25] Bubeck, S., et al. (2024). Sparks of Artificial General Intelligence: Early Acknowledgements experiments with GPT-4. arXiv preprint. https://doi.org/10.48550/arXiv.2303.12712 We thank Tadej Horvat for his help. This research was supported [26] Chollet, F. (2025). On the Measure of Intelligence Revisited. Journal of by the European Union through the Horizon Europe programme, Artificial General Intelligence. https://doi.org/10.2478/jagi-2025-0011 [27] Bommasani, R., et al. (2025). The Foundation Model Evaluation under the ChatMED project (Grant Agreement ID: 101159214). Landscape. arXiv preprint. https://doi.org/10.48550/arXiv.2501.01001 Additional support was provided by the Slovenian Research [28] Shanahan, M., & Mitchell, M. (2024). Abstraction and Reasoning in AI Systems. Nature Reviews AI, 3, 567–579. https://doi.org/10.1038/s42256- Agency through research core funding (Grant No. P2-0209). 024-00988-8 [29] Hernández-Orallo, J. (2025). Beyond Benchmarks: Toward Psychometric References AI. Artificial Intelligence, 325, 104043. https://doi.org/10.1016/j.artint.2025.104043 [30] Binz, M., et al. (2025). Cognitive Scaling Laws in Large Language Models. [1] Nature Machine Intelligence, 7, 445–456. https://doi.org/10.1038/s42256- OpenAI. GPT-5 System Card. Technical Report. 2024. [Online]. Available: https://cdn.openai.com/gpt-5-system-card.pdf 025-00987-7 [2] [31] Srivastava, A., et al. (2025). Beyond Task Accuracy: A Cognitive Binz, M., & Schulz, E. (2024). Turning Large Language Models Into Cognitive Benchmarking Paradigm for LLMs. Proceedings of NeurIPS 2025. Models. https://marcelbinz.github.io/imgs/Binz2024Turning.pdf https://doi.org/10.5555/neurips2025-12345 [3] [32] Ghosh, A., et al. (2025). Analogical Limits in Transformer Models: Human Xu, Y., et al. (2025). Assessing Executive Function in AI Systems Using Cognitive vs. AI Reasoning. Cognitive Science, 49(3). Benchmarks. Cognitive Computation, 17(1). https://doi.org/10.1007/s12559-025-10200-6 https://doi.org/10.1111/cogs.13345 [4] Creswell, A., Shanahan, M., & Kaski, S. (2025). Cognitive Architectures [33] Mitchell, M. (2025). The Future of AI Evaluation: Cognitive and Societal for Multistep Reasoning in LLMs. Journal of Artificial General Challenges. AI & Society. https://doi.org/10.1007/s00146-025-01789-1 Intelligence. https://doi.org/10.2478/jagi-2025-0003 [34] Weiten W (2016). Psychology: Themes and Variations. Cengage [5] Ghosh, A., & Holyoak, K. J. (2025). Analogical Reasoning in Large Learning. p. 281. Language Models: Limits and Potentials. Cognitive Science, 49(2). [35] Chollet, F. (2024). Evaluating Progress Toward General Intelligence. https://doi.org/10.1111/cogs.13301 Communications of the ACM, 67(12), 54–63. https://doi.org/10.1145/3671234 24 Extraction of Knowledge Representations for Reasoning from Medical Questionnaires ∗ Emir Mujić Alexander Perko Franz Wotawa emir.mujic@tugraz.at alexander.perko@tugraz.at wotawa@tugraz.at Graz University of Technology, Institute of Software Engineering and Artificial Intelligence Graz, Austria Question δ 1 = (A ∧ B ∧ C) ∨ (A ∧ B ∧ D) NL Answer Answer … Logic δ n = A ∧ B ∧ ¬C ∧ ¬D Dataset Structure & Predicate Extraction Representation Diagnoses Expert Knowledge Diagnoses for Reasoning Questionnaire Traversal & Tree Construction Figure 1: Overview of Knowledge Extraction from the Medical Expert Dataset through Questionnaire Traversal Abstract Keywords Knowledge representations supporting reasoning are versatile knowledge representation, reasoning, decision trees, natural lan- and enable automated use cases such as testing and verifica- guage processing, medical questionnaires tion. In contrast to purely data-driven approaches to AI, logical reasoning is explainable. Logic for encoding knowledge yields 1 Introduction & Related Work tremendous potential because of a strong theoretical foundation, Logical formalisms, like First-Order Logic (FOL), or the Answer and there exist efficient solvers. However, within medicine, we do Set Programming paradigm (ASP) [12], can be used to encode not find a publicly accessible corpus of expert knowledge encoded knowledge enabling reasoning through theorem provers/solvers, in logic. Construction of such a corpus usually requires manual such as Prover9 [15] or Clingo [11]. Having a logical knowledge effort and experts in the field, as well as in formal methods. In base, one can easily query existing facts, check statements for this work, we contribute by describing a methodology for the au- consistency, and infer new knowledge. Consider now a medical tomated extraction of logical formulae through interacting with knowledge base 𝐾 𝐵, where symptoms are mapped to diagnoses a questionnaire, which is based on a database curated by medical such that one can infer a set of diagnoses given a set of facts about professionals. We propose to use tree traversal and automated a person, and a set of symptoms. Given a proper user interface, predicate extraction from question/answer-nodes comprising this can be directly used as an expert system. What is more, it natural language. The proposed methods are already established can be used as a test oracle for comparisons with other medical in graph theory, natural language processing, and autoformaliza- expert systems, providing a transparent view of how diagnoses tion. Hence, we use synergies from different research domains to are made. Even more interestingly, we can evaluate large lan- enable the creation of a logical corpus of medical expert knowl- guage models (LLMs) tasked with diagnosing a person given the edge. With this concept paper, we lay the basis for future work same input, which we already demonstrated in earlier work [20, and hope to contribute to use cases, such as rigorous testing of 19]. Although there exist benchmarks & datasets for question large language models and other medical expert systems. answering [14] and natural language inference [22] in medicine, ∗ we do not find a dataset that fulfills the described properties and Authors are listed in alphabetical order. All authors contributed equally to this research. is publicly available. Hence, our goal is to build such a knowl- edge base. As manually creating a gold standard dataset requires Permission to make digital or hard copies of all or part of this work for personal expert knowledge and is costly, we propose the automated en- or classroom use is granted without fee provided that copies are not made or richment of an existing database, which can be accessed through distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this a questionnaire. More specifically, we show how to extract logical work must be honored. For all other uses, contact the owner /author(s). formulae from NetDoktor’s „Symptom-Checker“ questionnaire Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia (SCQ) [21], which is curated by medical professionals and is © 2024 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2025.gptzdravje.7 based on the AMBOSS dataset [1]. Our methodology aims for 25 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Mujic et al. automated formalization, i.e., autoformalization of knowledge While our methods are universally applicable to extracting knowl- encoded in natural language. Furthermore, we contribute by elab- edge from any questionnaire of a similar form, we base all elabo- orating how to leverage the fact that tree representations can be rations on SCQ. converted into logical formulae [2]. Vice versa, tree structures can be created from logical sentences [5]. A benefit of having 2.1 Tree Representations of Questionnaires a decision tree from a knowledge base is being able to exactly In this work, we represent medical questionnaires as decision compute bias in the diagnoses (and the knowledge base), as well trees. We first look at creating a simple tree 𝑇 from SCQ, which as the sufficient and necessary reasons behind decisions [9, 2], corresponds to a session a user might have with the tool: even in cases of trees with non-binary features (multiple choice ( ) The root node 𝑟 𝑇 is always a question with which every new questions) [13]. That said, this work directly builds upon our Um wen geht es? session is started: (Who is this about?). From this earlier work [20], where we outline the concept of representing ( ) root node 𝑟 𝑇 , the tree branches down in a depth-first manner, a medical questionnaire as a decision tree. starting with obligatory questions of Type 1, and followed by At this point, it makes sense to introduce medical question- ( ) optional Type 2 questions. The leaf node(s) 𝑙 𝑇 represent a set naires & similar systems, such as chatbots: The main idea is to Δ of diagnoses proposed by SCQ. provide answers to a user given symptomatic and/or other infor- ( ) Given a tree 𝑇 with a root node 𝑟 𝑇 , any number of regular mation about a person. They are used by the general public and 1 nodes 𝑛 ( 𝑖 𝑇 ) , 𝑖 = 1 , ..., 𝑁 − 1 and leaf nodes 𝑙 ( 𝑇 ) , a walk [10] medical professionals alike, and their application varies from gen- defines a “Tree Path Structure” from the root to any other node, eral health assessment, over risk calculators to medical triage [16]. including the leaf node i.e. the diagnosis possible within the These systems often use different combinations of rule-based and system. Since we know that we can treat trees as graphical repre- data-driven approaches [3, 7]. Most recently, general purpose, as sentations of logical formulae in disjunctive normal form (DNF), well as domain-specific LLMs, are heavily utilized as well [17, 23, we can write that any tree path structure represents a world 𝑤 6], which increases the demand for testing them. We argue that it | = that satisfies at least one diagnosis 𝛿, 𝑤 𝛿. In other words, makes sense to rely on an evaluation methodology that is fully un- ( ) any models of any diagnosis 𝛿 , 𝑀𝑜𝑑𝑠 𝛿 is set of variable as- derstandable, deterministic, and finite to test non-deterministic, signments that lead to that diagnosis. In most cases, there will black box systems, such as LLMs. You can find a pilot evaluation be more than one diagnosis given for a world 𝑤 , we denote this of ChatGPT [18] using SCQ in our earlier work [20]. This brings | = Δ ∈ Δ Δ as 𝑤 , 𝛿 , where is a subset of all possible diagnoses, us back to medical questionnaires in the classical sense, from 2 . The set of all diagnoses ⊆ D Δ D is satisfied by the union of which we will extract a logical knowledge base. Questions within Ø𝑀 worlds of all diagnoses: 𝑀𝑜𝑑𝑠 𝑤 , where 𝑗 (D) = 𝑀 is the a medical questionnaire can be distinguished in several ways. 𝑗 0 = number of possible diagnoses. Namely, we distinguish by: We show a simple example: A diagnosis 𝛿 1 (acute gastroenteri- • Question format: tis) is given as a result if a patient has nausea (𝐴) and stomach – Open-ended questions (Type 1). ache (𝐵) and either fever (𝐶) or diarrhea (𝐷). Another diagnosis – Closed-ended questions (Type 2). 𝛿 (gastritis) is a result if a patient has nausea ( ) and stomach 2 𝐴 • Fact permanence: ache (𝐵) without fever (¬𝐶) and diarrhea (¬𝐷). We can write this – Questions about what a person is, which yield perma- as a set of formulae in DNF as: nent facts about a person. – 𝛿 1 𝐴 𝐵 𝐶 𝐴 𝐵 𝐷 , (1) Questions about what a person = ( ∧ ∧ ) ∨ ( ∧ ∧ ) , which yield tempo- has rary facts about a person, i.e. symptoms. 𝛿2 𝐴 𝐵 𝐶 𝐷 , = ∧ ∧ ¬ ∧ ¬ • Question requirement: which we can represent as a decision tree shown in Figure 2. – Obligatory questions. – Optional questions, with an option to skip. 𝐴: Nausea • Answer types: – Predefined options to answer. · · · – Freeform answers (not present in SCQ). 𝐵: Stomach Ache Note that these categories are mutually exclusive within but · · · not across distinguishable dimensions, e.g., in principle, it is pos- 𝐶: Fever sible to either have obligatory or optional questions that are open-ended, as well as closed-ended. Having introduced the gen- 𝛿1 𝐷 : Diarrhea eral problem and domain, we will now proceed with describing a methodology for the enrichment of an expert dataset, with 𝛿1 𝛿2 logical representations through tree traversal & basic semantic parsing. Figure 2: Example 1 as Decision Tree 2 Methodology In Figure 2, a full edge between any two variables represents a This work aims to automatically extract logical formulae from truth assignment to the upper variable in the tree based on which knowledge encoded in structured, natural language. Thus, there the lower variable follows. The dashed edge between represents are three parts to the proposed methodology: 1 ( ) A walk in this context refers to its graph-theoretical definition: In a graph 𝑉 , 𝐸 : (1) Construction of the tree structure, through filling out SCQ. 𝐺 , 𝐸 𝑉 , a walk is a sequence 𝑣0𝑒1𝑣2, ..., 𝑒𝑛 − ⊆ [ ] 21𝑣𝑛 of alternating vertices and (2) Extraction of predicate names from natural language. edges such that : 𝑒 𝑣 − ∀ 𝑖 𝑖 𝑖 1, 𝑣 = { } 𝑖 . (3) Aggregation of formulae, through tree traversal. 2In general: Δ ⊂ D. However, ⊆ D iff Δ 𝑤 . = { ∅ } 26 Knowledge Representations from Medical Questionnaires Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia a false assignment to the upper variable from which the lower visually in Figure 3. We can simplify the step of autoformaliza- variable follows. The walk highlighted in blue represents one tion, as the NL statements found in SCQ show a very limited possible instantiation of symptoms where the patient has nausea, linguistic complexity. Therefore, we propose to either use naive a stomach ache, and diarrhea without fever. The three dots (" ") semantic parsing or LLM-based predicate extraction. For the · · · in Figure 2 denote that there are parts of the tree not shown in naive approach, one would simply return the object of a sentence the example but may exist in the complete tree representation. (i.e., singular word or whole noun phrase), modified for the for- We would also like to point out that there may exist multiple mal language in question. ASP, as used in Clingo, for instance, walks to any single node in the tree, including the leaf nodes demands predicates to be written in lower case and allows un- (𝑤𝑖 , 𝑤𝑗 , 𝑤𝑖 𝑤𝑖, 𝑖 𝑗 ), something that is excluded in the derscores for separating words in predicate names, which can be | = Δ ≠ ≠ example in Figure 2 for clarity. seen in Figure 3. Table 1 shows further examples for predicate Finally, we summarize how to extract a complete tree out of extractions. SCQ, following a depth-first-search methodology: Opening the first session with the questionnaire corresponds to creating a root 2.3 Formula Aggregation node. This is followed by answering questions systematically, re- Continuing with the aggregation of the extracted predicates into membering all questions and answers, and adding corresponding logical formulae, we propose a simple algorithm, which can be nodes to the tree. At the end of one session, we are presented seen in Algorithm 1. The input is the (extended) tree 𝑇 , or rather with a set of diagnoses, which represent the leaf nodes in the tree. ( ) its root node 𝑟 𝑇 , and the output is a list of formulae, corre- This procedure is repeated until we have traversed the entire sponding to all paths in the tree, each comprising a persona and search space. For further explanations, we refer the interested its symptomatic (which we subsume by "symptoms"), as well as reader to our previous work [20], which provides elaborations corresponding diagnoses. on SCQ, and extracted tree nodes. Due to space limitations of visually representing large trees, we provide examples separately, 3 SCQ Tree Traversal for Formula Aggregation Algorithm 1 which can be downloaded at Zenodo . Input: Root node 𝑟 (𝑇 ) (assumed to be the first question) 2.2 Predicate Extraction Output: A list of all paths, corresponding to formulae: For now, we assumed the nodes of the constructed tree repre- (i) a list of symptoms, and sentation to be directly usable as predicates. However, as the (ii) a list of diagnoses. nodes correspond to statements (e.g., sentences, words, or noun 1: TreeTraversal(𝑟 𝑇 ) function ( ) phrases) in natural language (NL), we first have to extract predi- 2: 𝐹 𝑜𝑟 𝑚𝑢𝑙 𝑎𝑒 ⊲ Final list of aggregated formulae ← [ ] cates. Moreover, in order to enable more than two answers per 3: Visit(𝑟 𝑇 , , , 𝐹 𝑜𝑟 𝑚𝑢𝑙 𝑎𝑒) ( ) [ ] [ ] question, we extend the simplified tree structure from above by 4: 𝐹 𝑜𝑟 𝑚𝑢𝑙 𝑎𝑒 return the inclusion of separate answer nodes. Thus, we have three 5: end function types of NL nodes: Questions, corresponding answers, and di- 6: Visit(𝑛𝑜𝑑𝑒, 𝑆𝑦𝑚𝑝𝑡 𝑜𝑚𝑠, 𝐷𝑖𝑎𝑔𝑛𝑜𝑠𝑒𝑠, 𝐹 𝑜𝑟 𝑚𝑢𝑙 𝑎𝑒) function agnoses. Furthermore, we assume to remember the relation of 7: 𝑛𝑜𝑑𝑒 .𝑡𝑦𝑝𝑒 "Leaf Node" if = then questions to their answers and a basic classification of question 8: 𝑁 𝑒𝑤 𝑃 𝑟 𝑒𝑑𝑖𝑐𝑎𝑡 𝑒𝑠 ParseDiagnosis(𝑛𝑜𝑑𝑒) ← types into "Type 1", i.e., open-ended, and "Type 2", i.e., closed- 9: 𝐷𝑖𝑎𝑔𝑛𝑜𝑠𝑒𝑠 𝐷𝑖𝑎𝑔𝑛𝑜𝑠𝑒𝑠 𝑁 𝑒𝑤 𝑃 𝑟 𝑒𝑑𝑖𝑐𝑎𝑡 𝑒𝑠 ← ∪ {} ended questions. This distinction can also be seen in Figure 3. 10: 𝑆𝑦𝑚𝑝𝑡 𝑜𝑚𝑠, 𝐷𝑖𝑎𝑔𝑛𝑜𝑠𝑒𝑠 to 𝐹 𝑜𝑟 𝑚𝑢𝑙 𝑎𝑒 append ( ) 11: return 12: end if B: What is your main symptom? A: Do you have nausea? 13: if 𝑛𝑜𝑑𝑒 .𝑡𝑦𝑝𝑒 = "Question" then 14: 𝑐ℎ𝑖𝑙𝑑 𝑛𝑜𝑑𝑒 .𝑐ℎ𝑖𝑙𝑑𝑟 𝑒𝑛 for each in do Diarrhoea Stomach Constipation Yes No Skip 15: if 𝑛𝑜𝑑𝑒 .𝑠𝑢𝑏𝑡𝑦𝑝𝑒 = "Type1" then Ache ← 16: 𝑁 𝑒𝑤 𝑃 𝑟 𝑒𝑑𝑖𝑐𝑎𝑡 𝑒𝑠 ParseType1(𝑐ℎ𝑖𝑙𝑑) ParseType1( ): B ParseType2( ): A 17: else if 𝑛𝑜𝑑𝑒 .𝑠𝑢𝑏𝑡𝑦𝑝𝑒 = "Type2" then stomach_ache nausea 18: ¬ 𝑁 𝑒𝑤 𝑃 𝑟 𝑒𝑑𝑖𝑐𝑎𝑡 𝑒𝑠 ← ParseType2(𝑛𝑜𝑑𝑒, 𝑐ℎ𝑖𝑙𝑑 ) 19: end if (a) Type 1: Open-Ended (b) Type 2: Closed-Ended 20: 𝑆𝑦𝑚𝑝𝑡 𝑜𝑚𝑠 𝑆𝑦𝑚𝑝𝑡 𝑜𝑚𝑠 𝑁 𝑒𝑤 𝑃 𝑟 𝑒𝑑𝑖𝑐𝑎𝑡 𝑒𝑠 ← ∪ { } 21: Visit(𝑐ℎ𝑖𝑙𝑑 , 𝑆𝑦𝑚𝑝𝑡 𝑜𝑚𝑠, 𝐷𝑖𝑎𝑔𝑛𝑜𝑠𝑒𝑠, 𝐹 𝑜𝑟 𝑚𝑢𝑙 𝑎𝑒) 22: end for δ : Acute Gastroenteritis 1 23: end if 24: end function ParseDiagnosis( ): δ 1 acute_gastroenteritis (c) Diagnosis As can be seen in Lines 1-5 of Algorithm 1, the depth-first search is started by calling the TreeTraversal function with Figure 3: Predicate Extraction through Parsing Functions 𝑟 𝑇 . Next, a Visit function (Lines 6-24) is called recursively, ( ) for Different Question Types, & Diagnoses visiting all nodes on a path until it reaches the/each leaf node (Line 7). In the final list of formulae, which represents all paths, symptoms are assumed to be conjunctions whereas diagnoses We define three node-level parsing functions: 1) , ParseType1 are assumed to be disjunctions. Both comprise parsed predicates, 2) , and 3) , which are explained ParseType2 ParseDiagnosis and can now be joined to form strings, depending on the target 3https://doi.org/10.5281/zenodo.17058631 formalism and solver/theorem prover. 27 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Mujic et al. Tree Node ID Type Predicate Question Selected Answer Geht es um eine Frau oder einen Mann? Weiblich 1 1 female Is it about a woman or a man? Female Wo treten die Beschwerden auf ? Kopf 2 1 head Where do the symptoms occur? Head Wähle dein wichtigstes Symptom Schnarchen 3 1 snoring Select your most important symptom Snoring Leidet die Person unter Schnupfen oder laufender Nase? Ja 4 2 cold runny_nose ∨ Does the person have a cold or runny nose? Yes Ist die Haut (stellenweise) gerötet? Nein 5 2 reddened_skin − Is the skin reddened (in places)? No Hattest du schon einmal eine Allergie? Überspringen 6 2 × Have you ever had an allergy? Skip Table 1: Exemplary Predicates by ID, Extracted from Question- & Answer-Tree-Nodes. For Type 1 questions, predicates are extracted from answers. Type 2 questions yield predicates directly, while (potential) negations are extracted from answers. 3 Conclusion & Future Work [3] Ahmad Taher Azar and Shereen M El-Metwally. 2013. Decision tree classi- fiers for automated medical diagnosis. , Neural Computing and Applications In summary, we propose a methodology for constructing & 23, 7, 2387–2403. traversing trees from medical questionnaires for the extraction of [4] Randal E Bryant. 1992. Symbolic boolean manipulation with ordered binary- decision diagrams. , 24, 3, 293–318. ACM Computing Surveys (CSUR) logical formulae. We describe how to leverage this to construct a [5] Chin-Liang Chang and Richard Char-Tung Lee. 1973. Symbolic logic and medical knowledge base, which can be used for reasoning and en- . Academic press. mechanical theorem proving ables future work, such as testing LLMs. Future work on decision [6] Zeming Chen et al. 2023. Meditron-70b: scaling medical pretraining for large language models. (2023). eprint: 2311.16079. trees extracted from medical questionnaires will include dealing [7] Dillon Chrimes. 2023. Using decision trees as an expert system for clinical with multiple paths to the same diagnosis, the intersection of decision support for covid-19. Interact J Med Res, 12, (Jan. 2023), e42540. doi: structured tree paths, redundant trees, as well as transforming 10.2196/42540. [8] Adnan Darwiche. 2001. Decomposable negation normal form. Journal of the the large trees into different structures that allow for more effi- ACM (JACM), 48, 4, 608–647. cient computation of certain properties. These include ordered [9] Adnan Darwiche and Auguste Hirth. 2023. On the (complete) reasons behind decisions. , 32, 1, 63–88. Journal of Logic, Language and Information binary decision diagrams [4] and deterministic decomposable [10] Reinhard Diestel. 2025. . Vol. 173. Springer Nature. Graph theory negation normal form (d-DNNF) circuits [8], offering the pos- [11] Martin Gebser, Roland Kaminski, Benjamin Kaufmann, and Torsten Schaub. sibility of model counting (asking what diagnoses are possible 2018. Multi-shot ASP solving with clingo. (Mar. 2018). arXiv: 1705.09811 [cs]. doi: 10.48550/arXiv.1705.09811. for any subset of symptoms), reasoning about the biases in the [12] Michael Gelfond and Vladimir Lifschitz. 1988. The stable model semantics knowledge base by analyzing the decisions made, giving us a for logic programming. In Proceedings International Logic Programming complete reason behind diagnoses from which we can compute Conference and Symposium . MIT Press, Cambridge, MA, USA, 1070–1080. [13] Chunxi Ji and Adnan Darwiche. 2023. A new class of explanations. In Logics the sufficient reason (the reason why that diagnosis was cho- in Artificial Intelligence: 18th European Conference, JELIA 2023, Dresden, sen) and the necessary reason (why any other diagnosis was not . Vol. 14281. Springer Nature, Germany, September 20–22, 2023, Proceedings 106. chosen) [9, 13, 2]. With these analyses, we hope to gain further [14] Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and insights into the knowledge base of SCQ and find new and in- Peter Szolovits. 2020. What disease does this patient have? a large-scale teresting ways of using its logically enriched form. Ultimately open domain question answering dataset from medical exams. arXiv preprint arXiv:2009.13081. we hope to enable new testing strategies of AI-based systems in [15] W. McCune. 2005–2010. Prover9 and Mace4. (2005–2010). medicine, particularly LLMs. [16] Bilal A Naved and Yuan Luo. 2024. Contrasting rule and machine learning based digital self triage systems in the usa. , 7, 1, 381. NPJ digital medicine Acknowledgements [17] Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. 2023. Capabilities of gpt-4 on medical challenge problems. (2023). https://arxiv.org/abs/2303.13375 arXiv: 2303.13375 . [cs.CL] The work presented in this paper was partially funded by the [18] OpenAI. 2023. ChatGPT. (2023). chat.openai.com/chat. European Union under Grant 101159214 – ChatMED. Views and [19] Alexander Perko, Iulia Nica, and Franz Wotawa. 2024. Using Combinatorial opinions expressed are, however, those of the author(s) only and Testing for Prompt Engineering of Large Language Models in Medicine. In do not necessarily reflect those of the European Union. Neither Proceedings of the 27th International Multiconference Information Society – IS 2024. Ljubljana, Slovenia. doi: 10.70314/is.2024.chtm.10. the European Union nor the granting authority can be held re- [20] Alexander Perko and Franz Wotawa. 2024. Testing ChatGPT’s Performance sponsible for them. on Medical Diagnostic Tasks. In Proceedings of the 27th International Multi- conference Information Society – IS 2024. Ljubljana, Slovenia. doi: 10.70314/i s.2024.chtm.7. References [21] Jens Richter, Hans-Richard Demel, Florian Tiefenböck, Luise Heine, and Martina Feichter. 2025. Symptom-checker. www.netdoktor.at/symptom- ch [1] AMBOSS GmbH. 2025. Amboss. www.amboss.com. Accessed: 2025-09-03. ecker/. Accessed: 2025-09-03. (2025). (2025). [2] Gilles Audemard, Steve Bellart, Louenas Bounia, Frédéric Koriche, Jean- [22] Alexey Romanov and Chaitanya Shivade. 2018. Lessons from natural lan- Marie Lagniez, and Pierre Marquis. 2021. On the explanatory power of guage inference in the clinical domain. arXiv:1808.06752 [cs], (Aug. 21, 2018). decision trees. . Retrieved Aug. 27, 2018 from arXiv: 1808.06752. arXiv preprint arXiv:2108.05266 [23] Khaled Saab et al. 2024. Capabilities of gemini models in medicine. (2024). https://arxiv.org/abs/2404.18416 arXiv: 2404.18416 . [cs.AI] 28 Indeks avtorjev / Author index Gams Matjaž ...................................................................................................................................................................... 7, 16, 21 Horvat Tadej ................................................................................................................................................................................. 16 Ivanišević Filip ............................................................................................................................................................................... 7 Janković Sonja ............................................................................................................................................................................. 12 Jaš Jakob................................................................................................................................................................................. 16, 21 Karasmanakis Ivana ....................................................................................................................................................................... 7 Lukić Stevo .................................................................................................................................................................................. 12 Mujić Emir ................................................................................................................................................................................... 25 Perko Alexander ........................................................................................................................................................................... 25 Roštan Žan ................................................................................................................................................................................... 16 Smodiš Rok .................................................................................................................................................................................... 7 Svetozarević Isidora ..................................................................................................................................................................... 12 Svetozarević Mihailo .................................................................................................................................................................... 12 Wotawa Franz .............................................................................................................................................................................. 25 29 Uporaba UI v zdravstvu AI in Healthcare Urednika l Editors: Matjaž Gams Žiga Kolar