An Evaluation Framework for LLMs That Considers Social Justice
In this article, I propose an evaluation framework for LLMs in the tertiary sector, that also brings awareness to the ethical, social, and environmental impact of training such model
As a lecturer in HE, I teach several modules on Generative AI (GenAI) to enhance learning and assessment. I teach these subjects to teaching staff, and a question I am frequently confronted with is how to evaluate GenAI tools and compare different LLMs. I use the word 'evaluate' in English intentionally, as the verb derives from Norman French 'évaluer', and from Latin 'valeō, valēre', meaning 'to give value to something'.
Evaluating an LLM means assessing the performance and abilities of the LLM, but also developing a deeper understanding of its value to carry out a comprehensive analysis of the LLM's implications and impact, considering aspects beyond technical performance or scoring against particular metrics.
Evaluating an LLM comprises both quantitative evaluation, based on how well the LLM performs against standardized frameworks or according to the criteria of a matrix, and qualitative evaluation, which is often undertaken by humans.
What is the difference between evaluation metrics and benchmarking when assessing the performance of these models? I did not have a clear answer myself, so I asked Scholar AI about this. According to the AI chatbot, 'evaluation metrics are specific quantitative measures used to assess the performance of an LLM on a given task,' while benchmarking involves comparing the performance of different LLMs across a range of tasks designed to test various aspects of an LLM's capabilities. Benchmarking provides a baseline for comparison between different models while metrics are focused on the individual performance of specific LLMs.
As a responsible user of GenAI chatbots, I verified the information provided by Scholar AI through reliable sources on the Internet. In terms of quantitative evaluation, there is a vast range of benchmarking standards to compare the performance of different LLMs, and there are, likewise, well-established evaluation metrics to measure the performance of a particular LLM.
Yet, the initial question of 'how to evaluate and compare different LLMs' that many lecturers in HE ponder is usually related to quantitative evaluation: how to set up the evaluation criteria and performance tasks to choose a particular GenAI content generator over another tool, and how to advise students and staff on comparing and selecting between different GenAI applications.
In this blog post, I propose an LLM evaluation framework for teaching staff and students who want to compare LLMs in the tertiary sector to enhance learning and teaching. The evaluation framework considers different evaluation categories drawn from the competencies, skills, and knowledge typically encompassed under the umbrella term 'AI literacy.'
The term 'AI literacy,' as applied to GenAI in the context of HE, can be defined as:
“the critical awareness of the potential (understood in the Aristotelian sense of what is latent but has capacity for growth and fulfilment), limitations, and social and ethical challenges that the use of GAI models brings to society.” (Garcia Vallejo, 2024)
Yes, I cite myself to define the concept of AI literacy. Partly because I always wanted to—ego calling—and partly because my definition of AI literacy includes elements of critical awareness and dissent against a vision of AI and LLMs in HE that are deeply rooted in Anglo-Saxon capitalism: the critical awareness of the social and ethical challenges that the use of GenAI brings not only to HE but to society as well
The evaluation framework I propose in this blog post integrates elements associated with AI literacy from academic literature, such as:
A critical review of the existing digital pedagogies to adapt them to the opportunities and challenges that AI brings to teaching and learning (Bearman & Ajjawi 2023) (Okagbue et al 2023).
A technical knowledge and understanding of how to use the most common AI multimodal agents, such as ChatGPT, Claude, Copilot, etcetera (Kings College London 2023).
Critical debate focused on aspects such as AI ethics, human-centred considerations (egalitarian access, accountability, safety, etcetera) (Chai et al, 2020), impact on copyright (Guadamuz 20203) (Narayanan y Kaapor 2024) (Marcus 2024) and data protection (Comisión Europea 2021).
Higher-order thinking skills (Ng et al 2020), including problem formulation and the ability to break down complex problems into smaller, manageable sub-problems (Acar 2023).
An active awareness of the affordances and limitations of AI technologies that involve both critical thinking and digital skills (Kings College London 2023).
To develop critical and equanimous awareness of the social and ethical challenges, and the impact that GenAI and LLMs bring to a tertiary sector largely shaped by capitalist attributes such as speed and productivity, the LLM framework I propose also integrates dissident voices against the current vision of AI in society and education (Morley, 2023; Klein, 2023; Beetham, 2023). That is why one of the categories in the evaluation framework is how well these LLMs score against basic principles relating to social justice and environmental impact:
Who owns these LLMs? Where do the providers pay taxes?
Who trains these LLMs? Which labour legislation should be abided by to regulate the working conditions of those who train these models? Does the legislation guarantee fair conditions for the workers who train these models?
What is the environmental impact of the large computing resources employed in the training of these models?
This is the final evaluation framework, integrating all the previous aspects:
An evaluation framework for GenAI LLMs in the tertiary sector
(LLM evaluation framework © 2024 by Mari Cruz Garcia Vallejo is licensed under CC BY-NC-SA 4.0)
General Information
Provider
LLM used, purpose and potential use
On what devices can it be used?
Social Justice and Ethical Considerations
Where is the provider’s legal entity located? Under what labour legislation does it operate, and where does it pay taxes?
Where are the data servers and computing infrastructures used by the LLM located? What natural resources do these infrastructures consume? What is the impact on local communities?
What human resources have been used to train the LLM? Where are these resources located, and what labour legislation has been observed?
Does the provider adhere to any specific ethical framework, and if so, what principles does this framework comprise?
Accessibility
Is the user interface compatible with screen readers and assistive technologies?
Does the user interface comply with international standards such as the Web Content Accessibility Guidelines (WCAG)?
Does the user interface comply with national legislation regarding software/web page accessibility?
Data Protection and Privacy
Does the LLM provider offer easy-to-find and accurate information on the internet regarding:
Privacy policy?
Terms and conditions of use?
Does the provider offer information on the data protection regulations that its LLM complies with (international or national)?
What type of personal information does the service provider store to access the LLM, and for what purpose? Where is this personal information stored? Is it necessary to provide sensitive personal information to access the LLM?
How does the LLM process the data and information provided by users? Is the data provided by users stored or saved as part of a query? What third parties have access to the data provided by users, and for what purpose?
What protection measures does the provider offer to guarantee the integrity and security of the data?
Price
What type of subscription and payment plans does the provider offer?
Is there a free version for users?
Managing, archiving and exporting results
Is there a maximum number of queries (per day/account)?
Can you access previous queries?
Can you search, catalogue, and archive previous queries?
Can the information be exported/downloaded, and in what formats?
Is there a maximum time limit to access previous queries?
Can the results be shared with other users and/or on the Internet?
User Interface
On a scale of 1 to 10, how easy is it to navigate the application’s interface to provide instructions to the LLM?
What types of formats does the LLM use for receiving instructions: text, audio, scanned images, etc.?
Can files be attached as part of the instructions in the prompt?
Is there a maximum character limit for text instructions provided to the chatbot?
What types of formats (text, images, video) does the LLM allow for the output?
Evaluation of LLM’s Limitations and Veracity
Task: Make a query or ask a question on an academic or scientific discipline that the user knows in depth.
Did you find inconsistencies and/or hallucinations in the results provided by the LLM? What are these inconsistencies like?
How would you rate the veracity of the results obtained from an academic/scientific knowledge perspective that the LLM’s response provides?
Is the LLM able to cite the bibliographic sources it uses?
Is there any cultural or anthropological bias in the answers/results generated by the LLM?
Is the LLM capable of changing language style, semantics and syntax according to the instructions provided?
Are there any limits on the number of characters/words provided in the answer?
On a scale of 1 to 10, how would you rate the originality and creativity of the results obtained? What is the basis of your score?
Effectiveness
Task: Make a query or formulate a complex problem using different prompting techniques (chain-of-thought prompting, meta prompting, few-shot prompting, etc.).
Is the LLM able to identify errors committed and the result through prompting?
How many interactions were necessary for the LLM to produce an acceptable result?
Was it necessary to provide additional data or information to the LLM, such as examples, to improve the result?
Can the LLM be customized to adopt a certain perspective or persona by personalizing the following parameters?
Information sources
Personalization of text results (e.g., literary style, semantics, and syntax)
Personalization of multimedia results (e.g., aesthetics, type of filter, or colour palette)
Setting initial configuration for a particular context or scenario
Intellectual Property (Copyright)
Does the provider offer information about copyrighted data and information used to train the language model?
Does the provider offer information on how its LLM complies with current legislation (international and national) on copyright/intellectual property?
Does the LLM allow modifying copyrighted multimedia content as part of the query/task? Does the LLM provide copyrighted elements as part of the result in content generation?
Who retains copyright in the production of multimedia content by the LLM? Does the LLM provider inform whether it retains any copyright on the result generated by the LLM?
Institutional Integration
Are the terms of use and privacy policies of the LLM compatible with institutional policy on the use of new computer systems and software?
Is the use of the LLM for pedagogical purposes compatible with institutional policies such as:
Recommendations and position on the use of AI and GenAI?
Pedagogical model and L&T strategies.
Academic regulations and diversity and inclusion policies?
Can the LLM be integrated with other systems and computer platforms (LMS, Microsoft 365, etc.)?
Can the use of the LLMs be scaled up within the organisation to staff and students? What costs would be involved in scaling its use?
What teams/departments in the organization could provide technical and/or pedagogical assistance for the use of the LLM?
In my module, I use this framework in a learning activity in which participants are invited to compare several LLMs. First, I ask participants to decide which LLM is better based on their judgement. After that, I invite participants to evaluate and compare the same LLMs again based on the proposed framework.
_________________________________________________
References:
Acar, Oguz A. (2023) ‘AI Prompt Engineering Isn’t the Future’, Harvard Business Review, 6th June 2023. Available at: https://hbr.org/2023/06/ai-prompt-engineering-isnt-the-future (Accessed: 16th April 2024).
Aneca (2021) ‘DOCENTIA Academic Staff Teaching Evaluation Support Programme’. Available at https://www.aneca.es/documents/20123/78401/DOCENTIA_Procedure.pdf/758d968c-b7ea-6653-d669-2a0f268d3fc1?t=1678105187535 (Accessed: 16th April 2024).
Bearman, M. and Ajjawi, R. (2023) ‘Learning to work with the black box: Pedagogy for a world with artificial intelligence’, British Journal of Educational Technology, 54, 1160-1173. Available at: https://doi.org/10.1111/bjet.13337 (Accessed: 19th March 2024).
Betham, H. (2024) ‘Never mind the quality, feel the speed’, Imperfect Offerings, 2nd February 2024. Available at:
(Accessed: 16th April 2024).
Garcia Vallejo, MC 2024. ‘Prompting engineering or AI literacy?: Developing a critical AI literacy on HE lecturers’. Abegglen, S., Nerantzi, C., Martínez-Arboleda, A.,Karatsiori, M., Atenas, J., & Rowell, C. (Eds.) . Towards AI Literacy: 101+ Creative and Critical Practices, Perspectives and Purposes. #creativeHE. https://doi.org/10.5281/zenodo.11613520
Guadamuz, A. (2023) ‘A Scanner Darkly: Copyright Liability and Exceptions in Artificial Intelligence Inputs and Outputs’. GRUR International 2/2024 (Forthcoming). Available at: https://ssrn.com/abstract=4371204 or http://dx.doi.org/10.2139/ssrn.4371204 (Accessed: 16th April 2024)
Kings College London (2023) ‘Generative AI in HE’. Available at: https://www.kcl.ac.uk/short-courses/generative-ai-in-he (Accessed: 16th April 2024).
Klein, N (2023) ‘AI machines aren’t ‘hallucinating’. But their makers are’, The Guardian, 8th May 2023. Available at: https://www.theguardian.com/commentisfree/2023/may/08/ai-machines-hallucinating-naomi-klein (Accessed: 16th April 2024)
Marcus, G. (2024) ‘No, multimodal ChatGPT is not going to “trivially” solve Generative AI's copyright problems’, Marcus on AI, 24th January 2024. Available at:
(Accessed: 16th April 2024).
Morley, D (2023) ‘Artificial Intelligence: Doomsday for Humanity, or for Capitalism?’, The Communist, 10th May 2023. Available at: https://socialistrevolution.org/artificial-intelligence-doomsday-for-humanity-or-for-capitalism/ (Accessed: 16th April 2024).
Narayanan, A. and Kaapor S. (2024) ‘Generative AI’s end-run around copyright won’t be resolved by the courts’, AI Snake Oil, 22nd January 2024. Available at: https://wonkhe.com/blogs/building-back-learning-and-teaching-means-changing-assessment/ (Accessed: 16th April 2024).
Ng, D T K, Lok Leung, J L , Wah Chu, SK, Shen Qiao, M (2020), Conceptualizing AI literacy: An exploratory review, Computers and Education: Artificial Intelligence, Volume 2, 100041, ISSN 2666-920X. Available at: https://doi.org/10.1016/j.caeai.2021.100041 (Accessed 20th March 2024).