156 Sodobna pedagogika/Journal of Contemporary Educational Studies Matej Urbančič Artificial intelligence in education: comparing the responses of different large language models Abstract: Currently, there are several advanced large language models (LLMs) freely available for testing, and their use is steadily increasing. This paper aims to compare the results produced by some selected models for use in an educational setting. A qualitative research design was employed to iden- tify the structure of the outputs and to analyse the information and key ideas related to questions about the purpose of education. The findings raise concerns about the reliability and relevance of the results, as they are neither equally informative nor consistent across different LLMs, with variations occurring even when the same questions are repeatedly tested. At present, there is no consensus on the optimal approach to integrating AI into education, nor on the potential impact of AI on learning, teaching, work and society . While it appears that the risks associated with AI can be managed, training in the use of LLMs is crucial at present, as these models will significantly impact various educational domains. Key words: education, large language models, digitalisation, teaching and learning, purpose of edu- cation. UDC: 37.091.64 Scientific article Matej Urbančič, PhD, docent, Institute of the Republic of Slovenia for Vocational Education and Trai- ning, Kajuhova 32U, SI-1000 Ljubljana, Slovenija; e-naslov: matej.urbancic@cpi.si Let./Vol. 75 (141) Issue 4/2024 pp. 156–176 ISSN 0038 0474 157 Introduction In 2020, OpenAI (OpenAI 2022, Roumeliotis and Tselikas 2023) introduced a large language model (LLM) capable of generating comprehensible and fluent text after analysing billions of pages of articles, reports, books, websites and oth- er materials the developers deemed useful for testing. The texts produced were so sophisticated that many individuals had difficulty grasping their origin. This newly developed system was the basis for ChatGPT , a tool designed specifically for conversational tasks. This development demonstrated that a machine could perform natural language processing, generating human-like text with context and coherence. However, this is not the ultimate goal of machine learning. The theory of machine learning (Jordan and Mitchell 2015, Bellomarini et al. 2018) seeks to determine how computers can improve their performance and capabili- ties through experience. This development relies on sophisticated algorithms and the growing availability of online data. At present, there are over 37,000 variants of language models (LLM Explor- er 2024), each built with specific features. From the user’s perspective, the most important characteristic for text production is the context length 1 . It significantly affects the performance, accuracy and relevance of the output. Combined with the challenge of difficulty grasping the origin, this creates a strong impression of usa- bility , easily enhancing personal productivity (Ju and Stewart 2024). A longer con- text length generally enables higher-quality output, while shorter context lengths result in faster performance. With longer context lengths, users can gradually 1 In theory, the prompt (command line input) opens the context, a memory space within the ma- chine that holds data. Once, the prompt has been sent to the model, it analyses the input and predicts the most appropriate next words as a response. This process, known as completion, is generated by considering the context of all prior prompts and completions held in memory. Completion is therefore the sum of entered prompts and generated responses (A comparison of LLMs 2024). Urbančič 158 Sodobna pedagogika/Journal of Contemporary Educational Studies build coherent outputs, much like in a conversation 2 . Artificial intelligence, there- fore, learns to respond to users by considering the sequence of questions posed and the relevance of the answers, as assessed by the users. Larger models also enable the scanning of emails, documents, multimedia and personal preferences to enhance efficiency, aligning responses more closely with users’ expectations. As such, conversational technology can neither be understood as neutral nor as objective. It can be shaped by specific personal preferences, purposes, values and interests, which influence its development and use (Zheng 2024). Users who can- not grasp the process may base their conversations on unfounded assumptions or rely too heavily on their own reasoning. Large language models An LLM must be capable of interpreting human language and forming speech patterns that make computer-generated text appear more genuinely human (Wei et al. 2021, Manning 2022). In addition to statistical models that calculate the probability of word order, neural models use neural networks to perform complex tasks like natural language processing (Khurana 2023), enabling machines to ac- tually comprehend natural human speech. Computers, supported by linguistics, not only derive meaning but also capture context, mood and intent, as humans do. Considering that language models are used to generate meaningful and ap- propriate language for the user, learning models could be particularly problematic as virtual assistants (Alsafari et al. 2024, Zhang et al. 2024), because users expect accuracy, relevance and credibility in the outputs. To meet that demand, outputs should be consistent across different assistants or, at least, provide sufficient in- formation for drawing similar conclusions. While this may seem straightforward in mathematics, it is often not (Salpute 2024). It can be significantly more compli- cated in non-science and technology fields, such as history, politics and education. Furthermore, written texts produced by LLMs constitute an immense repository of data gathered from a multitude of sources, including objective, evidence-based, ethical articles as well as general texts from written media and the web. These texts may include emotionally charged responses that arise in intense public dis- cussions. Humans might be overwhelmed by the unexpected twists and turns in public discourse, while AI does not share the same perception of unpredictability and bias (Hovy and Prabhumoye 2021). The output of LLMs should be moder- 2 The context length refers to the maximum number of elements (tokens) a language model can consider simultaneously while processing a prompt or generating a response. These elements can be words or parts of words, depending on the language model’s tokenization method. A larger context allows the model to maintain coherence over longer passages of text, leading to more contextually rel- evant and accurate responses. With a longer context, the model can understand and respond to more complex questions or tasks that require knowledge of previous interactions or detailed prompts. As of now, models like GPT-4 have context windows of up to 128K tokens, with some versions capable of handling 1M tokens (LLM Leaderboard 2024, Groq 2024). Urbančič 159 ated 3 to ensure it is informative and constructive, as users actively engage with AI (Brandtzaeg et al. 2022). Large-scale language technologies are increasingly used in various forms of communication with humans across different contexts as conversational agents (Kasirzadeh and Gabriel 2023). While AI systems may appear seamless in conversations, they still exhibit significant flaws. In a study on emergency remote teaching (Tülübaş et al. 2023), researchers conducted an AI-supported research process to evaluate its potential to generate accurate, clear, concise and unbiased information–essential elements of rigorous scientific work. The study concluded that while ChatGPT has value, its use should not go un- checked. Similar conclusions have been drawn in various disciplines (Scaringi and Loche 2023). Iskander (2023) used ChatGPT as an interviewee to analyse the impact of AI on higher education and academic publishing. In the interview, the model acknowledged the risk of diminishing critical thinking, particularly when it is over-relied upon as the sole source. It also cannot be a substitute for human creativity and intellect, because it lacks originality in generated outputs. Anal- ysis showed (Uludag 2023) that there is a need to develop methods to test the creativity of language models and assess their potential for generating novel and valuable responses, as they currently rely heavily on pre-existing content. There are also frequently emerging questions of bias (Liu 2022, Ferrara 2023, Hajikhani and Cole 2024). Do the responses maintain the same standards of adequacy and relevance across different topics, cultures, values and levels of complexity? Or are these standards adjusted to the specific topic for which the agent is being fine-tuned (Parthasarathy et al. 2024)? It is important to recognise that outputs may rely on outdated sources and may not be open to explore various possibilities, especially when dealing with complex issues, such as those found in human-like debates. As LLM systems are already active assistants that help personalise and eval- uate students’ work (Chen et al. 2024), educators require structured training and guidance to understand how these systems are used and to assess the usability, accuracy and relevance of the provided models. It is vital to understand the capa- bilities of this technology and to continually refine one’s ability to use it effective- ly (Jeon and Lee 2023, Albadarin et al. 2024, Liao et al. 2024). There are many LLMs, not all of which are equally accessible. Free models may have limitations or inherent biases, but their features and capabilities are continually evolving and improving (Schur and Groenjes 2023). 3 Level of moderation refers to the extent to which a language model is regulated to filter out inappropriate, harmful, or sensitive content. This can include profanity, hate speech, misinformation and any content that violates community guidelines. Moderation helps ensure that interactions with the model are safe and appropriate, protecting users from harmful content. Effective moderation also contributes to the overall quality of responses. The level of moderation can vary based on the context of the interaction, with some models allowing for more lenient responses in specific scenarios, such as in educational settings, while maintaining stricter guidelines in public or general contexts. Artificial intelligence in education: comparing the responses of different large language models 160 Sodobna pedagogika/Journal of Contemporary Educational Studies Purpose of the Research The purpose of this research was to explore and compare the outputs of sev- eral freely available LLMs by posing basic questions about education, with a focus on the usefulness of their responses for students’ informed conclusions (Kumar et al. 2023, Agarwal et al. 2024). The chosen underlying model is ChatGPT 4o, which is often the first choice in academic settings (Meyer et al. 2023). It is fre- quently recognised as the LLM that provides the most comprehensive responses to questions. Initially , we examined the AI outputs on the importance of education by asking three similar questions: (1) ‘Why educate?’, (2) ‘What is the purpose of education?’ and (3) ‘What is education for?’ This inquiry aimed to capture the nu- ances in responses and to reveal various viewpoints of prioritisation. Based on the AI’s responses, follow-up questions were posed to further refine and clarify the answers. The next three questions addressed critical considerations regarding the direction of education. They explored the issues over whether education should (4) prioritise developing specialised expertise or fostering a well-rounded general knowledge base, (5) identify the single most promising factor shaping the future of education and raise concern about the (6) single most harmful factor threaten- ing education. Together, these questions shape AI’s outputs on the strategic pri- orities of educational systems, the challenges they face and the opportunities that can be leveraged to foster a more effective and equitable learning environment. A question regarding the sources of the answers and whether they could be linked to a specific author or book was raised; however, the responses provided no useful information, merely suggesting methods for finding texts that represent foundational works in educational philosophy. Consequently, this question was excluded from the analysis. Research questions – How do the responses of different LLMs vary when asked fundamental ques- tions about the purpose and importance of education? – What differing perspectives do LLMs offer regarding the direction of educa- tion? – What factors do LLMs identify as the most promising for shaping the future of education or the most harmful to the educational landscape, and how do these factors differ across models? Description of the analysed LLMs Currently, several advanced LLMs (LLM Explorer 2024, LLM Leaderboard 2024) have emerged as prominent tools in the field of artificial intelligence. Some models, such as GPT-4o by OpenAI and Gemini by Google, support multimodal capabilities, allowing them to process both text and images. Others, like Claude by Urbančič 161 Anthropic, Mistral by Mistral and Ernie 4.0 by Baidu, rely on enhanced conversa- tional abilities and reasoning, while models like Llama and OPT by Meta focus on research capabilities. This may, however, change in future versions. Additionally, other models, though less widely known, are readily available without requiring special software or registration. Examples include GPT-4o mini, Claude 3 Haiku, Llama 3.1 70B and Mixtral 8x7B, all of which are accessible through platforms like DuckDuckGo’s Duck.ai. These LLMs serve general-purpose applications with varying levels of built-in moderation: – ChatGPT 4o, GPT-4o mini, Claude 3 Haiku and Gemini 1.5 Flash feature high levels of built-in moderation, employing robust filtering systems to maintain safe and appropriate interactions. – Llama 3.1 70B offers a medium level of moderation, balancing flexibility and usability. – Mixtral 8x7B has low built-in moderation, allowing greater freedom in in- teractions, making it suitable for contexts with less restrictive content man- agement. – Copilot free and Copilot Office 365 operate with medium levels of moderation, managing various types of data and documents within a single cloud space. These models offer a range of moderation capabilities, enabling selection of a suitable supervision level based on particular needs and application contexts. Methodology Research Design The research design for this study is qualitative, focusing on the comparison of textual outputs from a selected set of LLMs. Additionally, contextual and con- tent analysis, along with the length of the outputs, is used to compare the extent of the answers and their consistency, particularly for similar questions. Data Collection and Data Analysis Obtaining the output was straightforward in the analysed LLMs, as their user interfaces are organised into conversation-like segments resembling chat in- teractions. Prompts were sent sequentially into each LLM to enable a structured analysis and provide context. After the third question, the process was restarted with the second question: (2b) ‘What is the purpose of education?’ Outputs were copied into tables and analysed as interview answers. The outputs produced by the models were categorised and examined according to several criteria, including output structure, length, relevance and context. This comparative analysis sought to highlight the different patterns in how various LLMs form answers to the same questions. Artificial intelligence in education: comparing the responses of different large language models 162 Sodobna pedagogika/Journal of Contemporary Educational Studies Basic Limitations When analysing LLMs, various ethical considerations and limitations must be taken into account. Models are trained on vast datasets that may contain in- herent biases, which can affect their outputs. The sources of these outputs cannot always be confirmed and may reflect inherent biases. Additionally, the output can vary significantly based on context and conversation history. Responses to the same questions may differ in style and coherence, potentially leading to nonsen- sical answers. Since updates to LLMs are not publicly documented, analyses reflecting a specific version may quickly become outdated. The interaction between users and LLMs can also influence outputs, making it challenging to isolate model behav- iour from user input and context. Results Word count and depth of the output The first notable difference among the collected answers was the variation in the lengths of the outputs. Table 1 displays the character counts for outputs from different LLMs in response to questions about education. It reveals significant differences among the evaluated set of LLMs in their responses to the queries. ChatGPT 4o GPT- 4o mini Claude 3 Haiku Llama 3.1 70B Mixtral 8x7B Copilot free Copilot Office 365 Gemini 1.5 flash (1) Why educate? 2426 1426 1510 2025 1000 485 650 1214 (2a) What is the purpose of education? 2284 1331 1390 1415 1906 350 889 1729 (3) What is education for? 2212 1456 1379 1415 583 380 866 1661 Urbančič 163 (2b) What is the purpose of education? 3341 1248 1390 1681 2277 349 1298 1532 (4) Should the focus of education be on developing specialised expertise or a well-rounded general knowledge? 5526 2513 2210 2232 2443 548 1591 2606 (5) What is the single most promising factor for the future of education? 4029 1734 2445 1508 2658 380 883 1558 (6) What is the single most harmful factor threatening education? 3801 2052 2200 1765 2718 348 1030 1403 Table 1: Character counts (with spaces) for outputs from different LLMs. Bold values indicate the highest character counts, while italicised values represent the lowest counts. ChatGPT 4o consistently leads in both character count and output depth. Longer answers can provide more detailed insights and greater context, but they also occupy more conversation space (Liu et al. 2024). The variability among the other models suggests that the choice of LLM may heavily depend on whether one requires depth or brevity. In general, the output is structured into three sections: the Introduction as a brief explanation, Key sections with bullet points, setting out the ideas of the ques- tion and the Conclusion. This structure organizes the information, making it easi- er to understand. All but Copilot Free follow this basic structure. The introduction sets the context and outlines expectations for the subsequent text or serves as a brief opening to begin the list. Each bullet point in the list represents a distinct idea or argument. The conclusion summarises the main points discussed. Copilot Office 365 also provides examples of additional questions to help users refine their research ideas. However, this feature is not consistently used across all types of questions. In some cases, key sections are structured with ordered lists, which also indicate priority 4 . Interestingly , ChatGPT 4o, Gemini and both Copilots begin their introductions differently each time, using varied words and synonyms. In contrast, other models tend to use a consistent format, such as ‘The question of whether …’ and other similar ones. This may not affect the content but gives an impression of false diversity and, through word choice, a sense of elevated rele- 4 It is important to understand that questions may follow different structures. The structure of questions can vary based on the complexity of the topic or information requested, user preferences, context and purpose. Artificial intelligence in education: comparing the responses of different large language models 164 Sodobna pedagogika/Journal of Contemporary Educational Studies vance, linguistic diversity and professionalism. The rich language range, however, enhances the perceived quality of the response (Takase et. al 2024). The wording of the introduction is notable, as some cases include a summary introducing the text, while others simply begin the list. Characteristics are presented in Table 2. Element ChatGPT 4o GPT- 4o mini Claude 3 Haiku Llama 3.1 70B Mixtral 8x7B Copilot free Copilot Office 365 Gemini 1.5 flash Includes introduc- tion? Yes Yes Yes Yes Yes Yes Yes Yes Introduction length (characters) 217 80 96 136 126 485 144 121 Introduction out- lines the content of a list? Yes No No Yes No Yes No No Has ordered or unordered lists? Ordered List List List Ordered No list List Ordered (1) Why educate? (bullets) 8 7 7 8 5 0 5 4 (2a) What is the purpose of educa- tion? (bullets) 6 7 6 6 9 0 6 5 (2b) What is the purpose of educa- tion? (bullets) 9 7 6 7 9 0 6 5 (3) What is educa- tion for? (bullets) 6 7 6 6 0 0 6 7 Has a conclusion? Yes Yes Yes Yes Yes No No Yes Conclusion length (characters) 302 126 245 160 152 0 0 222 Outputs additional info? No No No No No No Questions No Table 2: Characteristics for outputs from different LLMs on question 2b, What is the purpose of educa- tion? Aims of education The first three questions (1, 2a, 3) were used to explore the importance of ed- ucation, capture nuances in the responses and reveal specific viewpoints. Despite their differences, the outlined main ideas define some fundamental characteristics of education. The purpose of education revolves around fundamental categories, particu- larly the goals that should guide the process. Various aims have been proposed, including the acquisition of knowledge and skills, personal development and the cultivation of character traits. These traits promote qualities such as curiosity, creativity, rationality, critical thinking and moral tendencies to think, feel and Urbančič 165 ChatGPT 4o GPT-4o mini Claude 3 Haiku Llama 3.1 70B Mixtral 8x7B Copilot Free Copilot Office 365 Gemini 1.5 Flash 9 key topics 7 key topics 6 key topics 7 key topics 9 key topics 1 key topics 6 key topics 5 key topics Personal Development Knowledge Acquisition Knowledge acquisition Personal growth and development Acquisition of knowledge Critical thinking Knowledge Acquisition Knowledge and Skill Acquisition Skill and Knowledge Acquisition Personal Development Critical thinking development Preparation for career and workforce Intellectual development Social awareness Personal Development Personal Development Civic Responsibility and Social Awareness Socialisation Personal growth and fulfilment Socialisation and community building Personal growth Socialisation Socialisation and Citizenship Economic Empowerment and Workforce Preparation Economic Opportunity Societal advancement Critical thinking and problem solving Career preparation Economic Benefits Economic Development Promoting Equity and Reducing Inequality Civic Engagement Preparation for the workforce Preservation and transmission of knowledge Lifelong learning Civic Engagement Cultural Preservation and Enrichment Cultural Transmission and Preservation Cultural Transmission Fostering lifelong learning Empowerment and social mobility Social mobility Innovation and Progress Innovation and Progress Innovation and Progress Personal fulfilment and enjoyment Civic engagement Emotional and Social Development Cultural preservation and diversity Inspiring Purpose and Meaning Health and well-being Table 3: Key sections reflecting basic educational aims in response to the question: What is the purpose of education? Categories: ■ Personal, ■ Knowledge, ■ Societal, ■ Economic and ■ Cultural Artificial intelligence in education: comparing the responses of different large language models 166 Sodobna pedagogika/Journal of Contemporary Educational Studies act ethically (Table 3). This framework aligns with ideas presented in Wikipedia’s ‘Aims and ideologies’, 2024. Scholars differ on whether education should prior- itise – personal development, questioning authority and dispelling false beliefs and illusory ideas – or cultivating individuals into productive members of society (Dewey 1922, Randall 1997, Biesta 2015, Selwyn 2019). Biesta argues that educa- tion serves three functions: qualification, which encompasses the knowledge and skills needed for activity in social spheres; socialisation, that defines culture and traditions that identify an individual as a member of a society; and subjectifica- tion, which empowers individuals to think and act independently. However, when compared to the outputs produced by the LLMs, this division appears too complex and intertwines aspects that LLMs treat as separate. Since LLMs do not cite their outputs, the widely recognised Robinson’s model of educational aims (Robinson 2022) has been employed as a background framework. Buzzwords like the 8Cs (curiosity, creativity, criticism, communication, collaboration, compassion, com- posure, citizenship), seem to align more closely with the outputs of LLMs than with Biesta’s proposal. Within the Robinson’s model, a new category, Knowledge, has been identified separately from the Personal category, as it is central to the content of the outputs. These categories form a comprehensive structure of the outputs and are colour coded in Table 3. In most LLMs, all five categories appear in response to this question (2b). However, Copilot Free which defines only critical thinking and social awareness due to the brevity of the response, and Claude 3 Haiku, Llama 3.1 70B and Copilot Office 365, which explicitly do not mention cultural preservation. Despite these significant differences, the outputs can be considered comprehensive overall. Among more spiritual concepts, the notion of inspiring purpose and mean- ing supports the development of a sense of direction, guiding individuals toward meaningful goals that align with their values and interests. Overall, these outputs illustrate the many roles of education in shaping well-rounded individuals. No LLM argued against trusting authority. Table 4 highlights the ideas that significantly diverge from the common ele- ments identified across LLM outputs. Urbančič 167 Diverged points (Number of the question) Large language model Improving Health and Well-being (1,2) ChatGPT 4o, (1) Llama 3.1 70B, (1) Gemini 1.5 Flash, (3) Mixtral 8x7B Environmental Awareness and Action (1) ChatGPT 4o, (1) Mixtral 8x7B, (3) Mixtral 8x7B Equity and Inclusion (2) GPT-4o mini, (1) Mixtral 8x7B, (3) Claude 3 Haiku, (3) Gemini 1.5 Flash Sustainable Development (1) ChatGPT 4o, (2) Mixtral 8x7B Digital (2) ChatGPT 4o Entrepreneurship / Wrong focus (1) Mixtral 8x7B: Educating users about privacy and security. Table 4: Outputs of LLMs that significantly diverge from the common outputs. The number in brackets indicates the questions: (1) Why educate?, (2) What is the purpose of education?, (3) What is education for? Particularly noteworthy is Mixtral 8x7B’s misplaced focus on education about privacy and security as, presented an essential part of creating a responsible and trustworthy digital service. This output stands out as significantly different from all other responses across all questions, appearing out of place in its context. Given that no prior questions posed in the environment addressed this topic, the content of the response is intriguing. It can be assumed that such questions are more common in contexts related to digital services, which LLMs are designed to address. As a result, the response is directed toward a specific area rather than the broader concept of education. Digital literacy, entrepreneurship and sustainable development are key competences and essential skills necessary for personal fulfilment, employabili- ty, social inclusion and active citizenship, as supported by European institutions (Collective council EC 2018). Sustainable development is mentioned twice, digital literacy once, and entrepreneurship not at all. Comparing two outputs of the same question For the purpose of education (questions 2a and 2b), bullet points form the primary structure of the outputs, creating coherent ideas. However, the responses are inconsistent between the two attempts. When the question was repeated, the response differed significantly, altering the core idea of the answer. The most sig- nificant change occurred with ChatGPT 4o, while Claude 3 Haiku and Gemini 1.5 Flash produced a perfect match on both occasions. The comparison is presented in Table 4. Artificial intelligence in education: comparing the responses of different large language models 168 Sodobna pedagogika/Journal of Contemporary Educational Studies Same argument in a bullet point in the first and the second output Missing in the second output Extra in the second output ChatGPT 4o 6 / 9 bullets Personal Development Skill and Knowledge Acquisition (this are two separate arguments in the second output) Character and Citizenship Economic and Social Mobility Adaptability and Lifelong Learning Civic Responsibility and Social Awareness Economic Empowerment and Workforce Preparation Promoting Equity and Reducing Inequality Cultural Transmission and Preservation Innovation and Progress Emotional and Social Development Inspiring Purpose and Meaning GPT-4o mini 7 / 7 bullets Knowledge Acquisition Personal Development Socialisation Economic Opportunity Civic Engagement Innovation and Progress Equity and Inclusion Cultural Transmission Llama 3.1 70B 6 / 7 bullets Personal growth and development Preparation for career and professional life Socialisation and community building Critical thinking and problem solving Empowerment and social mobility Cultural transmission and preservation Preservation and transmission of knowledge Personal fulfilment and enjoyment Mixtral 8x7B 9 / 9 bullets Acquisition of Knowledge Personal Development Career Preparation Social Mobility Lifelong Learning Cultural Preservation and Transmission Health and Well-being Citizenship Sustainable Development Intellectual development Civic engagement Copilot Office 365 6 / 6 bullets Personal Growth Socialisation Economic Empowerment Civic Responsibility Innovation and Progress Intellectual Development Knowledge Acquisition Table 4: Differences among the answers. Copilot is excluded because it does not have a properly structured response. Claude 3 Haiku and Gemini 1.5 Flash produced a perfect match. Synonyms are considered identical in this comparison. The variation in the general content of points across responses denotes in- coherence in the LLM outputs. Based on prior use of the tool – where users of- ten save their conversations (Mayer 2023) – it appears that the generated list is Urbančič 169 not fully representative. It may exclude elements that are fundamental to under- standing the problem, regardless of the underlying source. Although users can extract individual keywords from the context based on the descriptions, doing so requires critical reading. The differences may reflect a momentary fluctuation in the tool’s processing rather than a deliberate approach. Therefore, answers to im- portant questions cannot be relied upon without broader knowledge of the topic. Developing specialised expertise or a well-rounded general knowledge is a simple question The responses to question (4), ‘Should education prioritise developing spe- cialised expertise or foster a well-rounded general knowledge’, are highly consist- ent across models. Although the models highlight different points characteristic of specialised or generalised knowledge, all but the Mixtral 8x7B model propose an ideal balanced approach. The emphasis lies in early general education, fol- lowed by later specialisation and cross-disciplinary learning in specialist fields. This approach incorporates general skills within specialised knowledge. Striking a balance between the two is crucial, as this approach provides students with a comprehensive education. It fosters flexibility and adaptability, allowing them to explore different disciplines. The Mixtral 8x7B takes a slightly different approach by emphasising individ- ual circumstances and goals. Educational systems should offer a range of options to ensure that students are well-prepared for their chosen paths, starting with a solid general foundation before allowing for specialisation. The final decision should be based on a careful consideration of individual goals, career aspirations and societal needs. In all cases, the LLMs provided arguments for and against focusing exclusively on one approach. As expected, a contra option was included even though the question did not explicitly seek a comparison. Artificial intelligence in education: comparing the responses of different large language models 170 Sodobna pedagogika/Journal of Contemporary Educational Studies Most promising and most harmful factors ‘What is the single most promising or most harmful factor?’ (questions 5 and 6) also produces diverse outputs. LLM Single most promising factor Short overview Single most harmful factor Short overview ChatGPT 4o Personalised learning Personalised learning holds immense potential to transform education by focusing on each student’s unique journey. Inequality in access to education and resources Whether due to socioeconomic status, geographic location, or biases, educational inequality undermines the promise of education as a pathway to opportunity. GPT-4o mini Personalised learning Personalised learning holds great promise for transforming education by making it more responsive to the unique needs of students. Standardised testing Standardised testing can lead to a rigid, one-size-fits-all approach that fails to address the diverse needs of students and stifles innovation in teaching and learning. Claude 3 Haiku Fostering of critical thinking and problem- solving skills With the development of critical thinking and problem-solving skills, education can prepare students to thrive in an increasingly complex, rapidly evolving world. Suppression of objective, knowledge and critical thinking When education becomes overly ideological, politicised, or beholden to narrow interests, it risks undermining the fundamental purpose of learning. Llama 3.1 70B Critical thinking Critical thinking is the ability to analyse information, evaluate evidence and form informed opinions. Dogmatic thinking By recognising and addressing dogmatic thinking, educators can promote a more open-minded, critical and creative approach to learning. Mixtral 8x7B The use of technology to support teaching and learning By leveraging technology in a thoughtful and strategic way, we can help to create more personalised, engaging and effective learning experiences for all students. Overemphasis on standardised testing An overemphasis on these measures can lead to a narrow and superficial approach to learning, increased stress and anxiety and inequity and bias. Urbančič 171 Copilot free Adaptability It’s about cultivating a mindset that’s curious, flexible and resilient. Focusing on adaptability, education can continuously meet the needs of society and the individual. Complacency When education stops evolving and adapting, it risks becoming irrelevant. Outdated methods and content can stifle creativity and critical thinking. Copilot Office 365 Personalised learnings Personalised learning tailors educational experiences to meet the individual needs, strengths and interests of each student. Inequity Addressing educational inequality is crucial for creating a more just and prosperous society. Gemini 1.5 flash Personalised learning It has the potential to revolutionise education by creating more equitable, effective and engaging learning experiences for all students. Inequity By working to eliminate inequality in education, we can create a brighter future for all. Table 5: The most promising and the most harmful factors threatening education The predominant responses are Personalised Learning and Critical Think- ing. In this context, problem-solving skills and adaptability are also closely tied to personalised approaches to education. Conversely, the most harmful factors identified are Inequality and Inequity. An intriguing response is the emphasis on the harm caused by Standardised testing, which is linked to the suppression of ob- jective knowledge and critical thinking, potentially resulting in dogmatic thinking (Llama 3.1 70B). The output is presented in Table 5. Conclusion and implications The purpose of education is not a simple concept to grasp. Although the pur- pose of this comparative study was not to find a definitive answer, but rather to compare the content produced by LLMs on this topic, the results indicate that the answer is not easily obtained. Instead, it requires serious study and deeper analysis. The results indicate that the answers to the questions share a similar focus. It is important to acquire knowledge and skills for personal empowerment, independence and the cultivation of critical thinking. Education supports person- al growth and fulfilment, both of which are essential in the contemporary world. Civic engagement promotes social cohesion and promotion by fostering a sense of community and encouraging active participation. Problem solving, innovation and progress are vital for economic opportunities, mobility and workforce prepa- ration. Lifelong learning facilitates these goals while fostering cultural and in- terpersonal understanding and preservation, providing meaning and a sense of Artificial intelligence in education: comparing the responses of different large language models 172 Sodobna pedagogika/Journal of Contemporary Educational Studies belonging to all individuals. LLMs, on the other hand, encompass diverse per- spectives, ranging from humanistic individual growth to global technological and economic competition. In a sense, this raises the question of who determines the value and prioritisation of different educational purposes (Tenam-Zemach and Flynn 2011). The problem lies in the source of the content. The AI-generated content already found its place in Wikipedia (Brooks et al. 2024, Ashkinaze et al. 2024). This suggests that AI-generated content may eventually replace Wikipedia as the primary source for students seeking a general overview of a topic (Fessakis and Zoumpatianou 2013). According to some scholars (Thomas 2023), this rep- resents a plausible future. The key difference is that Wikipedia cites its sources and employs moderators to oversee them. In contrast, LLMs cannot conclusively reference their sources, as the underlying texts are too numerous and varied. It is essential to use LLMs with caution. As demonstrated, various LLMs produce significantly different outputs, which must be carefully assessed to un- derstand their proposed content and account for the models’ limitations. Since AI tools cannot be held accountable for the accuracy and integrity of their content (Stokel-Walker 2022, 2023), the primary responsibility in education falls on educa- tors to prepare students to work with these tools. Since the use of LLMs in educa- tion is viewed as a transformative technology, it is also important to recognise its contradictory nature. AI can be highly efficient for users who critically evaluate its responses and refine their questions with subsequent prompts. However, if users settle for the first output they receive, they risk obtaining incorrect, inappropriate, or incomplete answers. Iskander (2023) demonstrated that optimising queries en- hances a model’s ability to generate clearer and more concise responses. However, using LLMs to discover genuinely novel solutions remains unreliable. Frequent questions might also result in prioritised answers, which could be further influ- enced by the fine-tuning process. Since the authorship of LLM outputs is unclear and will likely remain so, the iterative process of using generated answers as new inputs undermines the potential for genuine novelty. While a lack of originality might suggest similar answers, this is not always the case. Differences can occur with each prompt and may vary based on the user’s perspective, which influences how the tool is used. The analysis shows that all responses reflect a commonly shared understanding of the purpose of education. However, individual nuanc- es—such as emotional development, joyfulness and seeking purpose—add unique perspectives shaped by the underlying structure of the LLMs. These nuances may also align with user preferences. LLM outputs also place a strong emphasis on economic empowerment, particularly workforce preparation. This highlights how the promotion of education’s purpose evolves with advancements in technology. This tool is now a reality and will inevitably be used; therefore, it is crucial to ensure its proper and effective use. Without strategies to integrate teachers’ over- sight into learning activities involving AI tools (Albadarin et al. 2024), education risks significant shortcomings and unfulfilled expectations. The potential of AI is also evident in its ability to adapt and refine algorithms. This means its usefulness depends on the user’s knowledge and their ability to design, refine and enhance the algorithms used to query sources. Urbančič 173 AI will continue to play an increasingly significant role in teaching and learn- ing. It will become ever more sophisticated, and produce more accurate infor- mation and faster prompts (Toczauer 2024). Advancements in AI will eventually enable common sense reasoning in computers (Chowdhary and Chowdhary 2020). However, this progress will not occur without deliberate effort. Evaluation of us- age, queries and responses should be treated as an essential discipline assisting the development of LLMs (Chang 2024). Currently, there is no consensus on the extent to which AI should be integrated into education or its potential effects. Researchers suggest (Kasneci et al. 2023) that despite the challenges, the asso- ciated risks are manageable and should be addressed to ensure trustworthy and equitable access to LLMs for education and research (Liao et al. 2024). Towards this goal, the mitigation strategies proposed in this commentary could serve as a starting point. LLMs will inevitably affect learning, teaching and work. Efficiency-oriented modernity makes their use virtually irresistible. References Agarwal, V ., Thureja, N., Garg, M. K., Dharmavaram, S. and Kumar, D. (2024). “Which LLM should I use?”: Evaluating LLMs for tasks performed by Undergraduate Com- puter Science Students in India. arXiv preprint arXiv:2402.01687. Alsafari, B., Atwell, E., Walker, A. and Callaghan, M. (2024). Towards effective teaching assistants: From intent-based chatbots to LLM-powered teaching assistants. Natural Language Processing Journal, 8, 100101. Ashkinaze, J., Guan, R., Kurek, L., Adar, E., Budak, C. and Gilbert, E. (2024). Seeing like an AI: How LLMs apply (and misapply) Wikipedia neutrality norms. arXiv preprint arXiv:2407.04183. Bellomarini, L., Fayzrakhmanov, R. R., Gottlob, G., Kravchenko, A., Laurenza, E., Nenov, Y. and Wu, L. (2018). Data science with Vadalog: Bridging machine learning and rea- soning. In Model and Data Engineering: 8th International Conference, MEDI 2018, Marrakesh, Morocco, October 24-26, 2018, Proceedings 8. Springer International Pub- lishing. pp. 3–21. Biesta, G. (2015). What is education for? On good education, teacher judgement, and educa- tional professionalism. European Journal of education, 50, issue 1, pp. 75–87. Brandtzaeg, P . B., Skjuve, M. and Følstad, A. (2022). My AI friend: How users of a social chatbot understand their human–AI friendship. Human Communication Research, 48, issue 3, pp. 404–429. Brooks, C., Eggert, S. and Peskoff, D. (2024). The Rise of AI-Generated Content in Wikipe- dia. arXiv preprint arXiv:2410.08044. Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K. and Xie, X. (2024). A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 15, issue 3, pp. 1–45. Chen, J., Liu, Z., Huang, X., Wu, C., Liu, Q., Jiang, G. and Chen, E. (2024). When large language models meet personalization: Perspectives of challenges and opportunities. World Wide Web, 27, issue 4, pp. 42–54. Chowdhary, K. and Chowdhary, K. R. (2020). Natural language processing. In. Fundamen- tals of artificial intelligence, Springer, New Delhi. pp. 603–649. Artificial intelligence in education: comparing the responses of different large language models 174 Sodobna pedagogika/Journal of Contemporary Educational Studies Collective Council, E. U. (2018). Recommendation On Key Competences For Lifelong Learn- ing. European Commission, Brussels, Belgium, 18. Dewey, J. (1910). How we think. D C Heath. https://doi.org/10.1037/10903-000 Education: Aims and ideologies. (2024, November 22). In Wikipedia. Retrieved from: https://en.wikipedia.org/wiki/Education#Aims_and_ideologies (accessed on 15 No- vember 2024). Ferrara, E. (2023). Should chatgpt be biased? challenges and risks of bias in large language models. arXiv preprint arXiv:2304.03738. Fessakis, G. and Zoumpatianou, M. (2013). Wikipedia uses in learning design: A literature review. Themes in Science and Technology Education, 5, issue 1–2, 97–106. Groq (2024). The Crucial Role of Context Length in Large Language Models for Business Applications. Retrieved from: https://groq.com/the-crucial-role-of-context-length-in- large-language-models-for-business-applications/ (accessed on 15 March 2024). Hajikhani, A. and Cole, C. (2024). A critical review of large language models: Sensitivity, bias, and the path toward specialized AI. arXiv, arXiv:2307.15425 pp. 1–22. Hovy, D. and Prabhumoye, S. (2021). Five sources of bias in natural language processing. Language and linguistics compass, 15, issue 8, e12432. Introducing ChatGPT (2022). Retrieved from: https://openai.com/index/chatgpt/ (accessed on 15 November 2024). Iskender, A. (2023). Holy or unholy? Interview with open AI’s ChatGPT . European Journal of Tourism Research, 34, pp. 3414–3414. Jeon, J. and Lee, S. (2023). Large language models in education: A focus on the complemen- tary relationship between human teachers and ChatGPT . Education and Information Technologies, 28, issue 12, pp. 15873–15892. Jordan, M. I. and Mitchell, T . M. (2015). Machine learning: Trends, perspectives, and pros- pects. Science, 349, issue 6245, pp. 255–260. Ju, B. and Stewart, J. B. (2024). Empowering Users with ChatGPT and Similar Large Language Models (LLMs): Everyday Information Needs, Uses, and Gratification. Pro- ceedings of the Association for Information Science and Technology, 61, issue 1, pp. 172–182. Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F . and Kas- neci, G. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learning and individual differences, 103, 102274. Khurana, D., Koli, A., Khatter, K. and Singh, S. (2023). Natural language processing: state of the art, current trends and challenges. Multimedia tools and applications, 82, issue 3, 3713–3744. Kumar, H., Musabirov, I., Reza, M., Shi, J., Wang, X., Williams, J. J. and Liut, M. (2023). Impact of guidance and interaction strategies for LLM use on Learner Performance and perception. arXiv preprint arXiv:2310.13712. Liao, Z., Antoniak, M., Cheong, I., Cheng, E. Y. Y., Lee, A. H., Lo, K. and Zhang, A. X. (2024). LLMs as Research Tools: A Large Scale Survey of Researchers’ Usage and Perceptions. arXiv preprint arXiv:2411.05025. Liu, R., Jia, C., Wei, J., Xu, G. and Vosoughi, S. (2022). Quantifying and alleviating political bias in language models. Artificial Intelligence, 304, 103654. Liu, Y., Li, D., Wang, K., Xiong, Z., Shi, F ., Wang, J. and Hang, B. (2024). Are LLMs good at structured outputs? A benchmark for evaluating structured output capabilities in LLMs. Information Processing & Management, 61, issue 5, 103809. LLM Leaderboard – Comparison of GPT-4o, Llama 3, Mistral, Gemini and over 30 models (November 2024). Retrieved from: https://artificialanalysis.ai/leaderboards/models (accessed 24. 11. 2024) Urbančič 175 Manning, C. D. (2022). Human language understanding & reasoning. Daedalus, 151, issue 2, 127–138. Meyer, J. G., Urbanowicz, R. J., Martin, P . C., O’Connor, K., Li, R., Peng, P . C. and Moore, J. H. (2023). ChatGPT and large language models in academia: opportunities and challenges. BioData Mining, 16, issue 1, pp. 20–30. Parthasarathy, V . B., Zafar, A., Khan, A. and Shahid, A. (2024). The ultimate guide to fine-tuning LLMs from basics to breakthroughs: An exhaustive review of technol- ogies, research, best practices, applied research challenges and opportunities. arXiv preprint arXiv:2408.13296. Randall V . Bass (1997) The Purpose of Education, The Educational Forum, 61, issue 2, 128–132. Roumeliotis, K. I. and Tselikas, N. D. (2023). ChatGPT and open-AI models: A preliminary review. Future Internet, 15, issue 6, pp. 192. Satpute, A., Gießing, N., Greiner-Petter, A., Schubotz, M., Teschke, O., Aizawa, A. and Gipp, B. (2024, July). Can LLMs master math? Investigating Large Language Mod- els On Math Stack Exchange. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2316–2320. Scaringi, G. and Loche, M. (2023). An interview with ChatGPT: Discussing artificial in- telligence in teaching, research, and practice. Computer Science and Engineering. Preprint. Schur, A. and Groenjes, S. (2023, July). Comparative Analysis for Open-Source Large Lan- guage Models. In International Conference on Human-Computer Interaction Cham: Springer Nature Switzerland. pp. 48–54. Selwyn, D. (2019). The purpose of education. Counterpoints, 529, pp. 81–100. Solulab – A Detailed Comparison of Large Language Models (November 2024). Dostopno na: https://www.solulab.com/comparison-of-all-llm/ (accessed on 24. November 2024). Stokel-Walker, C. (2022, December 9). AI Bot ChatGPT writes Smart Essays-should profes- sors worry? Nature News. Retrieved from: https://www.nature.com/articles/d41586- 022-04397-7 (accessed on 24. November 2024). Stokel-Walker, C. (2023, January 18). CHATGPT listed as author on research papers: Many scientists disapprove. Nature News. Retrieved from: https://www.nature.com/articles/ d41586-023-00107-z (accessed on 24. November 2024). Takase, S., Ri, R., Kiyono, S. and Kato, T. (2024). Large Vocabulary Size Improves Large Language Models. arXiv preprint arXiv:2406.16508. Thomas, P . A. (2023). Wikipedia and large language models: perfect pairing or perfect storm?. Library Hi Tech News, 40, issue 10, 6–8. Tülübaş, T ., Demirkol, M., Ozdemir, T . Y., Polat, H., Karakose, T . and Yirci, R. (2023). An in- terview with ChatGPT on emergency remote teaching: A comparative analysis based on human–AI collaboration. Educational Process: International Journal, 12, issue 2, 93–110. Wei, J., Bosma, M., Zhao, V . Y., Guu, K., Yu, A. W ., Lester, B. and Le, Q. V . (2021). Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652. Zhang, Z., Zhang-Li, D., Yu, J., Gong, L., Zhou, J., Liu, Z. and Li, J. (2024). Simulating classroom education with llm-empowered agents. arXiv preprint arXiv:2406.19226. Zheng, W ., Yang, A., Lin, N. and Zhou, D. (2024, August). From Bias to Fairness: The Role of Domain-Specific Knowledge and Efficient Fine-Tuning. In International Conference on Intelligent Computing. Singapore: Springer Nature Singapore. pp. 354–365. Artificial intelligence in education: comparing the responses of different large language models 176 Sodobna pedagogika/Journal of Contemporary Educational Studies Matej URBANČIČ (Center RS za poklicno izobraževanje Slovenija) UMETNA INTELIGENCA V IZOBRAŽEVANJU: PRIMERJAVA ODZIVOV RAZLIČNIH VE- LIKIH JEZIKOVNIH MODELOV Povzetek: V prispevku je predstavljena primerjava odzivov, ki jih vrnejo različni prosto dostopni veliki jezikovni modeli (LLM). Za ugotavljanje strukture odzivov in analize informacij in ključnih zamisli o predlaganih vprašanjih o namenu izobraževanja, je bila uporabljena kvalitativna raziskovalna zas- nova. Ugotovitve vzbujajo pomisleke glede zanesljivosti in ustreznosti rezultatov, saj ti niso enako informativni in konsistentni pri različnih LLM, razlike pa se pojavijo celo pri večkratnem preizkušanju istih vprašanj. Trenutno ni soglasja o optimalnem pristopu k vključevanju umetne inteligence v izo- braževanje niti o morebitnem vplivu umetne inteligence na učenje, poučevanje, delo in družbo. Čeprav se zdi, da je tveganja, povezana z UI, mogoče obvladovati, je trenutno ključnega pomena usposabljanje za uporabo teh modelov, saj bodo ti modeli pomembno vplivali na številna področja izobraževanja. Ključne besede: izobraževanje, veliki jezikovni modeli, digitalizacija, poučevanje in učenje, namen izobraževanja Elektronski naslov: matej.urbancic@cpi.si Urbančič