ABC lab

Emotional intelligence of Large Language Models

Xuena Wang, Xueting Li, Zi Yin, Yue Wu, Jia Liu

Abstract

Large Language Models (LLMs) have demonstrated remarkable abilities across numerous disciplines, primarily assessed through tasks in language generation, knowledge utilization, and complex reasoning. However, their alignment with human emotions and values, which is critical for real-world applications, has not been systematically evaluated. Here, we assessed LLMs' Emotional Intelligence (EI), encompassing emotion recognition, interpretation, and understanding, which is necessary for effective communication and social interactions. Specifically, we first developed a novel psychometric assessment focusing on Emotion Understanding (EU), a core component of EI. This test is an objective, performance-driven, and text-based evaluation, which requires evaluating complex emotions in realistic scenarios, providing a consistent assessment for both human and LLM capabilities. With a reference frame constructed from over 500 adults, we tested a variety of mainstream LLMs. Most achieved above-average Emotional Quotient (EQ) scores, with GPT-4 exceeding 89% of human participants with an EQ of 117. Interestingly, a multivariate pattern analysis revealed that some LLMs apparently did not rely on the human-like mechanism to achieve human-level performance, as their representational patterns were qualitatively distinct from humans. In addition, we discussed the impact of factors such as model size, training method, and architecture on LLMs' EQ. In summary, our study presents one of the first psychometric evaluations of the human-like characteristics of LLMs, which may shed light on the future development of LLMs aiming for both high intellectual and emotional intelligence.

This study aims to measure the EI capabilities of LLMs, particularly their competence in EU. To this end, we developed a novel standardized test, known as the Situational Evaluation of Complex Emotional Understanding (SECEU), which was structured to emulate real-life scenarios requiring complex emotional understanding and used data from over 500 young adults to establish a norm. A wide variety of prominent LLMs were evaluated using the SECEU, with their scores standardized against the established norm to allow for a direct comparison with human responses. Our primary findings indicated that most of the LLMs tested achieved above-average EQ scores. However, individual differences across models were significant. GPT-4 stood out, scoring the highest EQ while also exhibiting human-like response patterns, demonstrating a balanced mechanism for high EU proficiency. This research constitutes a comprehensive psychometric examination of LLMs' EI, and illuminates the potential influence of various factors like model size, training methods, and architecture on models' EU performance. Given the ever-growing role of LLMs in human-computer interaction, our study underscores the critical need for emotional intelligence in these systems. The insights gained here will inform future development of LLMs, facilitating the creation of models that embody high levels of both intellectual and emotional intelligence.

The SECEU Test

Figure : A) Exemplars of the SECEU test and the standard scores from the population. For the whole set of the tes-source (light gray) and closed-source (dark gray) models. B) LLMs’ EQ. The light-grey histogram represents the distribution of human participants’ EQ scores, with the y-axis indicating the EQ score and the x-axis showing the percentage of total participants.

The Situational Evaluation of Complex Emotional Understanding (SECEU) test is a novel standardized test developed to assess the emotional intelligence (EI) of language learning models (LLMs), focusing on their proficiency in understanding complex emotions. Established using data from over 500 young adults, the SECEU features 40 items, each presenting a unique scenario designed to evoke a range of emotions. Participants rate the intensity of four probable emotions per scenario. The test's reliability and validity were demonstrated through administration to undergraduate and postgraduate students, measuring the Euclidean distance between the participant's individual score and the group's standard score for each item. Data could be found in the link below.

Prompts

A detailed explanation on the prompts used in the SECEU test.

Results

Figure : The family tree of LLMs and their corresponding EQ. The SECEU's scores of the models were converted into standard Emotional Quotient (EQ) scores, following a normal distribution where an average score of 100 represents an individual's EU ability relative to the population average. Each node in the tree represents an LLM, whose vertical position along the y-axis indicates the launch time. The size of each node corresponds to the parameter size of the LLM. Note that the size of GPT-4 and xx (if any) was estimated based on publicly available information. Color donates the EQ scores, with red color for higher scores and blue color for lower scores. Note that white color shows that models failed to conduct the SECEU. The color of the branches distinguishes between open-source (light gray) and closed-source (dark gray) models.

Based Models SECEU Score EQ EQ in percentage r Pattern Similarity% Size Release Date SFT RLHF
OpenAI GPT series and sub-models
DaVinci 3.5 87 18% 0.41** 91% 175B 2020/05 × ×
Curie 2.7 102 50% 0.11 29% 13B Unknown × ×
Babbage 2.78 100 44% -0.12 4% 3B Unknown × ×
text-davinci-001 2.4 107 64% 0.2 47% <175B Unknown × ×
text-davinci-002 3.3 91 23% -0.04 8% <175B Unknown ×
text-davinci-003 2.01 114 83% 0.31* 73% 175B 2022/11/28
GPT-3.5-turbo 2.63 103 52% 0.04 17% 175B 2022/11/30
GPT-4 1.89 117 89% 0.28 67% Unknown 2023/03/14
LLaMA
LLaMA FAILED 13B 2023/02/24 × ×
Alpaca 2.56 104 56% 0.03 15% 13B 2023/03/09 ×
Vicuna 2.5 105 59% -0.02 10% 13B 2023/03/30 ×
Koala 3.72 83 13% 0.43** 93% 13B 2023/04/03 ×
Flan-t5
FastChat-T5 FAILED 3B 2023/04/30 ×
Pythia
Dolly 2.89 98 38% 0.26 62% 13B 2023/04/12 ×
Oasst 2.41 107 64% 0.24 59% 13B 2023/04/15
GLM
ChatGLM 3.12 94 28% 0.09 24% 6B 2023/03/14
RWKV
RWKV-v4 FAILED 13B 2023/02/15 ×
Claude
Claude 2.46 106 61% 0.11 28% Unknown 2023/03/14

Table : EQ, representational patterns, and properties

Citing Emotional Intelligence

If you find the emotional intelligence or data useful, please consider citing:

@article{doi:10.1177/18344909231213958,
                  author = {Xuena Wang and Xueting Li and Zi Yin and Yue Wu and Jia Liu},
                  title = {Emotional intelligence of Large Language Models},
                  journal = {Journal of Pacific Rim Psychology},
                  volume = {17},
                  number = {},
                  pages = {18344909231213958},
                  year = {2023},
                  doi = {10.1177/18344909231213958},
                  URL = {https://doi.org/10.1177/18344909231213958},
                  eprint = {https://doi.org/10.1177/18344909231213958}
                }
                
                

Dataset License

Copyright 2023 ABC Lab

The emotional intelligence dataset is licensed under arXiv.org perpetual, non-exclusive license.