Abstract
Large Language Models (LLMs) have demonstrated remarkable abilities across numerous disciplines, primarily assessed through tasks in language generation, knowledge utilization, and complex reasoning. However, their alignment with human emotions and values, which is critical for real-world applications, has not been systematically evaluated. Here, we assessed LLMs' Emotional Intelligence (EI), encompassing emotion recognition, interpretation, and understanding, which is necessary for effective communication and social interactions. Specifically, we first developed a novel psychometric assessment focusing on Emotion Understanding (EU), a core component of EI. This test is an objective, performance-driven, and text-based evaluation, which requires evaluating complex emotions in realistic scenarios, providing a consistent assessment for both human and LLM capabilities. With a reference frame constructed from over 500 adults, we tested a variety of mainstream LLMs. Most achieved above-average Emotional Quotient (EQ) scores, with GPT-4 exceeding 89% of human participants with an EQ of 117. Interestingly, a multivariate pattern analysis revealed that some LLMs apparently did not rely on the human-like mechanism to achieve human-level performance, as their representational patterns were qualitatively distinct from humans. In addition, we discussed the impact of factors such as model size, training method, and architecture on LLMs' EQ. In summary, our study presents one of the first psychometric evaluations of the human-like characteristics of LLMs, which may shed light on the future development of LLMs aiming for both high intellectual and emotional intelligence.
This study aims to measure the EI capabilities of LLMs, particularly their competence in EU. To this end, we developed a novel standardized test, known as the Situational Evaluation of Complex Emotional Understanding (SECEU), which was structured to emulate real-life scenarios requiring complex emotional understanding and used data from over 500 young adults to establish a norm. A wide variety of prominent LLMs were evaluated using the SECEU, with their scores standardized against the established norm to allow for a direct comparison with human responses. Our primary findings indicated that most of the LLMs tested achieved above-average EQ scores. However, individual differences across models were significant. GPT-4 stood out, scoring the highest EQ while also exhibiting human-like response patterns, demonstrating a balanced mechanism for high EU proficiency. This research constitutes a comprehensive psychometric examination of LLMs' EI, and illuminates the potential influence of various factors like model size, training methods, and architecture on models' EU performance. Given the ever-growing role of LLMs in human-computer interaction, our study underscores the critical need for emotional intelligence in these systems. The insights gained here will inform future development of LLMs, facilitating the creation of models that embody high levels of both intellectual and emotional intelligence.
The SECEU Test
Figure : A) Exemplars of the SECEU test and the standard scores from the population. For the whole set of the tes-source (light gray) and closed-source (dark gray) models. B) LLMs’ EQ. The light-grey histogram represents the distribution of human participants’ EQ scores, with the y-axis indicating the EQ score and the x-axis showing the percentage of total participants.
The Situational Evaluation of Complex Emotional Understanding (SECEU) test is a novel standardized test developed to assess the emotional intelligence (EI) of language learning models (LLMs), focusing on their proficiency in understanding complex emotions. Established using data from over 500 young adults, the SECEU features 40 items, each presenting a unique scenario designed to evoke a range of emotions. Participants rate the intensity of four probable emotions per scenario. The test's reliability and validity were demonstrated through administration to undergraduate and postgraduate students, measuring the Euclidean distance between the participant's individual score and the group's standard score for each item. Data could be found in the link below.
Results
Figure : The family tree of LLMs and their corresponding EQ. The SECEU's scores of the models were converted into standard Emotional Quotient (EQ) scores, following a normal distribution where an average score of 100 represents an individual's EU ability relative to the population average. Each node in the tree represents an LLM, whose vertical position along the y-axis indicates the launch time. The size of each node corresponds to the parameter size of the LLM. Note that the size of GPT-4 and xx (if any) was estimated based on publicly available information. Color donates the EQ scores, with red color for higher scores and blue color for lower scores. Note that white color shows that models failed to conduct the SECEU. The color of the branches distinguishes between open-source (light gray) and closed-source (dark gray) models.
Based Models | SECEU Score | EQ | EQ in percentage | r | Pattern Similarity% | Size | Release Date | SFT | RLHF | ||
---|---|---|---|---|---|---|---|---|---|---|---|
OpenAI GPT series and sub-models | |||||||||||
DaVinci | 3.5 | 87 | 18% | 0.41** | 91% | 175B | 2020/05 | × | × | ||
Curie | 2.7 | 102 | 50% | 0.11 | 29% | 13B | Unknown | × | × | ||
Babbage | 2.78 | 100 | 44% | -0.12 | 4% | 3B | Unknown | × | × | ||
text-davinci-001 | 2.4 | 107 | 64% | 0.2 | 47% | <175B | Unknown | × | × | ||
text-davinci-002 | 3.3 | 91 | 23% | -0.04 | 8% | <175B | Unknown | √ | × | ||
text-davinci-003 | 2.01 | 114 | 83% | 0.31* | 73% | 175B | 2022/11/28 | √ | √ | ||
GPT-3.5-turbo | 2.63 | 103 | 52% | 0.04 | 17% | 175B | 2022/11/30 | √ | √ | ||
GPT-4 | 1.89 | 117 | 89% | 0.28 | 67% | Unknown | 2023/03/14 | √ | √ | ||
LLaMA | |||||||||||
LLaMA | FAILED | 13B | 2023/02/24 | × | × | ||||||
Alpaca | 2.56 | 104 | 56% | 0.03 | 15% | 13B | 2023/03/09 | √ | × | ||
Vicuna | 2.5 | 105 | 59% | -0.02 | 10% | 13B | 2023/03/30 | √ | × | ||
Koala | 3.72 | 83 | 13% | 0.43** | 93% | 13B | 2023/04/03 | √ | × | ||
Flan-t5 | |||||||||||
FastChat-T5 | FAILED | 3B | 2023/04/30 | √ | × | ||||||
Pythia | |||||||||||
Dolly | 2.89 | 98 | 38% | 0.26 | 62% | 13B | 2023/04/12 | √ | × | ||
Oasst | 2.41 | 107 | 64% | 0.24 | 59% | 13B | 2023/04/15 | √ | √ | ||
GLM | |||||||||||
ChatGLM | 3.12 | 94 | 28% | 0.09 | 24% | 6B | 2023/03/14 | √ | √ | ||
RWKV | |||||||||||
RWKV-v4 | FAILED | 13B | 2023/02/15 | √ | × | ||||||
Claude | |||||||||||
Claude | 2.46 | 106 | 61% | 0.11 | 28% | Unknown | 2023/03/14 | √ | √ |
Table : EQ, representational patterns, and properties
Citing Emotional Intelligence
If you find the emotional intelligence or data useful, please consider citing:
@article{doi:10.1177/18344909231213958,
author = {Xuena Wang and Xueting Li and Zi Yin and Yue Wu and Jia Liu},
title = {Emotional intelligence of Large Language Models},
journal = {Journal of Pacific Rim Psychology},
volume = {17},
number = {},
pages = {18344909231213958},
year = {2023},
doi = {10.1177/18344909231213958},
URL = {https://doi.org/10.1177/18344909231213958},
eprint = {https://doi.org/10.1177/18344909231213958}
}
Dataset License
Copyright 2023 ABC Lab
The emotional intelligence dataset is licensed under arXiv.org perpetual, non-exclusive license.