Content area
Purpose
This paper aims to investigate user voice-switching behavior in voice assistants (VAs), embodiments and perceived trust in information accuracy, usefulness and intelligence. The authors addressed four research questions: RQ1. What is the nature of users’ voice-switching behavior in VAs? RQ2: What are user preferences for embodied voice interfaces (EVIs), and do their preferred EVIs influence their decision to switch the voice on their VAs? RQ3: What are the users’ perceptions of their VAs concerning: a. information accuracy, b. usefulness, c. intelligence and d. the most important characteristics they must possess? RQ4: Do users prefer their voice interface to match their characteristics (age, gender, accent and race/ethnicity)?
Design/methodology/approach
The authors used a 52-question survey questionnaire to collect quantitative and qualitative data. The population was undergraduate students (freshmen and sophomores) at a research university in the USA. The students were enrolled in two required courses with a research participation assignment offered for credits. Students must register for research participation credits in the SONA Research Participation System www.sona-systems.com/platform/research-management/ Registered students cannot be invited or sampled to participate in a research study. There were 1,700 students enrolled in both courses. After the survey’s URL was posted in SONA, the authors received (n = 632) responses. Of these, (n = 150) completed the survey and provided valid responses.
Findings
Participants (43%) switched the voice interface in their VAs. They preferred American and British accents but trusted the latter. The British accent with a male voice was more trusted than the American accent with a female voice. Voice-switching decisions varied in the case of most and least preferred EVIs. Participants preferred EVIs that matched their characteristics. Most trusted their VAs’ information accuracy because they used the internet to find information, reflecting inadequate mental models. Lack of trust is attributed to misunderstanding requests and inability to respond accurately. A significant correlation was found between the participants’ perceived intelligence of their VAs and trust in information accuracy.
Research limitations/implications
Due to the wide variability in the data (e.g. 84% White, 6% Asian and 6% Black), the authors did not perform a statistical test to identify the significance between the selected EVIs and participants’ races or ethnicities. The self-reported survey questionnaire may be prone to inaccuracy. The participants’ interest in earning research credit for participation in this study and using SONA is a potential bias. The EVIs the authors used as embodiments are limited in their representation of people from diverse backgrounds, races, ethnicities, ages and genders. However, they could be examples for building prototypes to test in VAs.
Practical implications
Educators and information professionals should lead the way in offering artificial intelligence (AI) literacy programs to enable young adults to form more adequate mental models of VAs and support their learning and interactions. VA designers should address the failures and other issues the participants experienced in VAs to minimize frustrations. They should also train machine learning models on large data sets of complex queries to augment success. Furthermore, they should consider augmenting VAs’ personification with EVIs to enrich voice interactions and enhance personalization. Researchers should use a mixed research method with data triangulation instead of only a survey.
Social implications
There is a dire need to teach young adults AI literacy skills to enable them to build adequate mental models of VAs. Failures in VAs could affect users’ willingness to use them in the future. VAs can be effective teaching and learning tools, supporting students’ autonomous and personalized learning. Integrating EVIs with diverse characteristics could advance inclusivity in designing VAs and support personalization beyond language, accent and gender.
Originality/value
This study advances research on user voice-switching behavior in VAs, which has hardly been investigated in VA research. It brings attention to users’ experiential learning and the need for exposure to AI literacy to enable them to form adequate mental models of VAs. This study contributes to research on personifying VAs through EVIs with diverse characteristics to visualize voice interactions. Reasons for not switching the voice interface due to satisfaction with the current voice or a lack of knowledge of this feature did not support the status quo theory. Incorporating satisfaction and lack of knowledge as new factors could advance this theory. Switching the voice interface to avoid visualizing the least preferred EVIs in VAs is a new theme emerging from this study. Users’ trust in VAs’ information accuracy is intertwined with perceived intelligence and usefulness, but perceived intelligence is the strongest factor influencing trust.
Introduction
Voice assistants (VAs) are software programs that use artificial intelligence (AI), natural language processing, speech recognition and large language models to provide information, perform various tasks, engage in conversations with users and generate responses based on text or speech inputs (Terzopoulos and Satratzemi, 2020). They offer flexible features, allowing users to switch the default voice interface by language, accent and gender or choose the preferred voice and gender while setting up the device (Bilal and Barfield, 2021a). These features have intrigued researchers’ interest in exploring voice-switching behavior as a new form of interaction behavior with VAs. Nonetheless, we still know very little about this behavior, especially from young adult users’ (i.e. college students) perspectives.
VAs have been implemented in many disciplines, including education, to enhance teaching and learning and provide students access to coursework, augment autonomous and personalized learning, and promote computational thinking (Al Shamsi et al., 2022). Students learn to use VAs through trial and error or other experiential experiences. Eliciting young adults “perceptions of VAs” information accuracy, intelligence, trust and usefulness. Eliciting the students’ perceptions will unveil their VAs’ literacy knowledge and skills and whether interventions are needed to enable them to use these devices effectively and efficiently.
Studies revealed that users assign VAs anthropomorphic attributes (Calahorra-Candao and Martín-de Hoyos, 2024), treat them as humans (Ki et al., 2020; Seymour and Van Kleek, 2021; Pitardi and Marriott, 2021), personify them by ascribing friendship (Lopatovska et al., 2019; Pradhan et al., 2019; Schweitzer et al., 2019; Wienrich et al., 2021), trust (Girouard-Hallam and Danovitch, 2022; Seymour and Van Kleek, 2021; Wienrich et al., 2021) and personality traits (Bilal and Barfield, 2021a; Lopatovska and Williams, 2018; Snyder et al., 2023). Users also prefer their VA’s voice to match their gender, language and accent (e.g. Bilal and Barfield, 2021b; Brahnam and De Angeli, 2021). In the case of chatbots and embodied voice interfaces (EVIs), users prefer the characteristics of the agents EVIs to match their race and ethnicity (e.g. Bilal and Barfield, 2021b; Bonfert et al., 2021; Liao and He, 2020).
With the growth of Generative AI (GenAI), such as ChatGPT and similar tools, VAs are increasing in sophistication by performing more conversational than transactional tasks (e.g. Amazon Alexa) or embedding advanced chatbot features (http://gemini.google.com). The aim is to make VAs more human-like and meet user needs. Nonetheless, voice interactions are still the primary form of embodiment in VAs (Bonfert et al., 2021; Pitardi and Marriott, 2021), necessitating more research that provides visual embodiments represented in people of various backgrounds, genders, races and ethnicities and specifically in the context of voice-switching behavior.
This study advances research on user voice-switching behavior in VAs, an essential aspect of human-AI interaction rarely examined in VA research. It brings attention to experiential learning and the role of AI literacy in enriching students’ conceptions of VAs. In addition, this study contributes to research on personifying VAs by incorporating EVIs to make voice interactions more tangible and human-like. This study’s findings have implications for improving students’ personalized learning experiences, interventions by educators and information professionals and enhancing the design of VAs to support users’ interactions and information needs.
Review of the literature
Voice-switching behavior
Bilal and Barfield (2021a) investigated user experience, voice-switching behaviors and preferences for age, gender, accent and personality of VAs. Thirty-one students participated in the study and took a Qualtrics survey of closed and open-ended questions. The study showed that 39% switched the voice interface for various reasons. Most male and all female participants preferred a female-gendered voice. The participants preferred that the VAs’ perceived age and accent match their own. However, neither the age, accent nor gender of the VA influenced the decision of most participants to switch the voice interface in use. In another study, Bilal and Barfield (2021b) involved 210 Amazon mTurk workers aged 18–35 who took a survey eliciting their voice-switching behavior in VAs, preference for the voice accent, gender, age, race and ethnicity to match their characteristics and perceptions of EVIs. They found that 51% of the participants switched the default voice in their VAs. They slightly preferred a female voice over a male voice. Participants preferred EVIs that matched their races, ethnicities and ages.
Trust in voice assistants
Baughan et al. (2023) used a mixed-method approach to identify user failures in VAs and their influence on trust. They interviewed 12 VA users, created a data set of 199 failures from 107 Amazon mTurk participants, and surveyed 268 participants to evaluate the impact of failures on their trust. The authors found that certain failures, particularly lack of feedback, derailed their trust, caused frustration and influenced their willingness to use their VAs in the future.
Zhan et al. (2024) surveyed 300 users and measured their trust in health-care AI VAs. They found functional factors (perceived usefulness, content credibility and relative service quality), one personal factor (stance in technology) and risk factors (privacy risks and safety) significantly influenced users’ trust in their VAs and their intention to use them in the future. Wienrich et al. (2021) recruited 40 students from a university recruitment system offering course credit for participation in their study. They examined the students’ perceived expertise of Amazon Echo Dot VA and its effects on trust. Students interacted with their VAs answered 21 questions about health and completed a survey concerning the trustworthiness of the VA and its provider. Perceiving the VA as a specialist improved user trustworthiness and trusting the VA’s provider affected their perceived VA trustworthiness. Numerous studies also examined trust in VAs, including but not limited to Nasirian et al. (2017) and Lee et al. (2021), who found that interaction quality more significantly impacted user trust than information quality and system quality.
Pitardi and Marriott (2021) surveyed 466 people in the UK with at least some experience using VAs about trust in VAs. They found that social cognition (e.g. warmth and competence) and social presence (e.g. having a sense of social interaction, a human-like attribute) were the unique antecedents of developing trust in VAs. Tolmeijer et al.’s (2021) study showed that the VAs’ gender did not influence user trust. Girouard-Hallam and Danovitch (2022) found that children’s trust in VAs varied by age and the type of information asked. Children trusted VAs for factual questions and humans for personal questions.
Voice assistants’ intelligence
VAs’ perceived intelligence concerns their ability to understand the user’s natural language speech and provide useful information that meets their needs (Moussawi and Koufaris, 2019). Mahmood et al. (2022) surveyed 37 VA users to examine their perceptions of the errors committed by AI agents. Users perceived virtual agents who accepted the blame for errors as more intelligent, likable and effective than agents who disregarded the blame. Bawack (2021) surveyed 278 users to measure their perceived intelligence of VAs. Perception, comprehension, action and learning significantly influenced users’ intention to adopt VAs. Poushneh (2021) recruited 275 users to investigate the personality traits of Microsoft’s Cortana, Google’s Assistant and Amazon’s Alexa mobile applications and their impacts on user experiences and perceptions of intelligence. They concluded that among the personality traits, the VA’s functional intelligence (e.g. level of effectiveness, efficiency, reliability and usefulness of responses) and emotional intelligence (e.g. perceived ability of the VAs as human-like, humorous and modest) significantly enhanced the users’ perceived control to interact with the VAs. Thus, these studies explored VAs’ intelligence in various ways.
Information accuracy
Brewer (2023) interviewed 30 participants to assess how they evaluated information quality on health topics. Participants were presented with five scenarios and ambiguous VA responses to each scenario, which they played on their devices. They engaged with the scenarios in an online survey and randomly played a prerecorded VA voice. The author asked participants about perceived information quality and the cues they used to assess it (e.g. what they liked and disliked and good or poor VA responses). Participants used limited cues to assess information quality.
Pycha and Zellou (2024) investigated the effect of the VAs’ accent (British and American English) and frequency of use on the credibility assessment of responses retrieved by VAs. They recruited 139 native speakers of American English and completed the experiment via a Qualtrics survey. The participants listened to statements in either an American English or British English accent and rated the statements’ accuracy by voice. The information provided by the British accent voice was perceived as more credible than that delivered by the American English accent voice, suggesting that the VA voice can influence the perceived credibility or accuracy of the statements instead of the content.
Data-driven methods have also been used to evaluate the information quality in VAs. For example, Dambanemuya and Diakopoulos (2021) focused on news queries in Alexa using a data set of queries from Google’s top 20 US trending topics daily search trends. They recruited 111 Amazon mTurk to phrase 144 unique query topics, providing three ways of asking Alexa to find information for each topic and collecting 15 unique query phrasings for each query topic. Most of Alexa’s responses to understood queries were relevant, predominantly accurate and timely. VA responses varied by query phrasings and query categories (e.g. news and sports), making it challenging for users to evaluate information quality. Assessing information quality in VAs is more challenging than evaluating other sources or tools, mainly due to a lack of visual cues (Brewer, 2023), preference for specific accents and query phrasings or formulations.
Embodiments
Nunamaker et al. (2011) examined the gender effect on user perception of male and female agents embodied in a kiosk-automated system they developed. Users perceived the agents embodied as males to be more powerful than their female counterparts, but female-embodied agents were perceived as more likable than their male counterparts. Liao and He (2020) recruited 212 diverse participants from Amazon mTurk who interacted with a system the authors developed and completed a postinteraction survey. Racial mirroring had a significant main effect on the participants’ interpersonal closeness with the agent, self-disclosure and satisfaction.
Bonfert et al. (2021) designed three versions of a smart virtual display:
one with a disembodied agent (i.e. a status quo of a smart display with no agent);
one with an artificial embodied agent; and
one with a prerecorded, photorealistic embodied agent.
They created a female agent to represent female assistants in consumer products. They recruited 47 males and 13 females who interacted with the prototypes. The agent’s perceived age, appearance and social presence influenced the users’ experience.
Bilal and Barfield (2021b) investigated user voice-switching behavior in VAs, the effect of voice accents, genders, ages and user preferences for EVIs of various backgrounds. They recruited 214 Amazon mTurk workers aged 18–35 and collected demographic, prior experiences and VA usage data. They found that 51% of the participants aged 18–23 switched the voice interface, compared to 53% aged 24–29 and 44% aged 30–35. A two-way ANOVA test showed a main effect that switching the voice to female EVIs was more frequent than to male EVIs. Participants switched the voice interface for accents they could relate to, for “fun,” “curiosity” and to interact with a voice that matched their ethnicities. Participants’ ratings of the voice matching their gender, accent and age as high confirmed Liao and He’s (2020) racial mirroring. The participants’ qualitative comments reflected biases toward certain EVIs’ appearance, ages, genders and ethnicities. The authors coined interface mirroring, which extends Liao and He’s racial mirroring notion.
Luo et al. (2023) investigated the effect of embodiment and voice gender on users’ perceived anthropomorphism of VAs. They recruited 130 participants who interacted with three types of embodiments (physical robot, virtual assistant with a facial expression on screen and voice-only without embodiment) and generated nine distinct VA samples. Participants rated the perceived anthropomorphism of each voice sample. The embodiment type significantly influenced the participants’ perceived anthropomorphism in VAs. The VAs with a physical robotic appearance were more significant in generating higher anthropomorphic perception than the voice-only assistants. The gender of the voice did not influence the participants’ anthropomorphic perception.
Voice search behavior
Xing et al. (2019) literature review study of voice search behavior resulted in 14 relevant articles published from 2008 to 2018. The articles covered voice query characteristics, reformulation strategies and users’ perception/satisfaction. Sa and Yuan (2021) surveyed 64 users to elicit their opinions and preferences for voice search in VAs. They found that most users’ voice searches were on simple and routine tasks. Although the participants liked voice search for convenience, their success rate was low; 92.31% switched to text input to find needed information. Misunderstanding the users’ speech inputs was the most problematic.
Google, Inc’s (2014) voice search study involved 1,400 users, ages 13–18, showing that 55% used voice search daily and 56% of older teens felt “tech savvy” using voice search. Younger and older teens (38%) talked on their phones while watching TV; 51% of younger and 32% of older teens used voice search “just for fun,” and 40% used it for directions. The study concluded that teens were the most active users of voice search. This study needs updating to identify whether voice search is still predominant among teens.
Voice assistants in education
VAs have been programmed to serve as smart tutors to improve student learning. For example, Winkler et al. (2019) compared 21 groups of undergraduate students to examine the impact of a Smart Tutor they developed in the Alexa Echo Dot device and involved a human tutor (an experienced teacher) in solving a complex, open-ended task. They found that participants who interacted with the Smart Tutor achieved higher task-learning outcomes and better problem-solving than those who interacted with the human tutor. Other studies investigated the effect of using VAs on students’ performance. For example, Devkota et al. (2024) study showed significant differences in study efficacy, effectiveness and efficiency between students who used VAs and those who did not. VAs have also been used as foreign language learning tools. Alharthi (2024) found that integrating Apple’s Siri with in-class instruction resulted in students’ better learning than using only in-class and teacher instructions. Kita et al. (2019) implemented a voice user interface into Moodle online learning management systems to enhance students’ interaction, assist with learning activities and search MoodleDocs. The interface motivated the students, making taking quizzes, asking questions and searching for documents easier.
In summary, the reviewed literature showed abundant studies investigating various aspects of user perceptions of VAs, but limited studies exist on voice search behavior from the information retrieval perspective. VAs were convenient to users and improved student learning. There is evidence of using VAs for simple tasks and resorting to other sources for personal or complex ones. When voice search failed, users switched to text input to find information. While studies investigated VAs embodiments (e.g. chatbot agents, smart visual displays and kiosks), scarce research has used images of people of various races, ethnicities, genders and ages as EVIs, specifically in the context of voice switching. Users’ perceptions of information accuracy and the criteria they use to judge it have rarely been examined. Similarly, users’ perceived trust in VAs’ information accuracy, usefulness, intelligence and judging information accuracy are scarce. To fill these gaps, this study addressed these research questions:
Theoretical framework
Three theories or notions inform this study:
the status quo theory;
embodied interaction; and
interface mirroring notion, which resonates with the similarity-attraction theory.
These theories explain users’ voice-switching behaviors, preferences for embodiments and EVIs.
The status quo theory (Samuelson and Zeckhauser, 1988) is a cognitive bias theory that ascribes people’s preferences for maintaining things unchanged when faced with many options and perceives change as risky, costly, uncertain and may inhibit a loss. In VAs, users may keep the device’s default voice settings as is or change the language, gender and accent of the voice. Users may prefer the status quo if they maintain the default settings despite their awareness of the options for switching the voice interface on their VA. In the context of this study, the status quo theory may explain and substantiate the participants’ reasons for not switching their VAs’ voice interface.
This study incorporated EVIs in the form of images of individuals of various ages, genders, races and ethnicities. “Embodiment constitutes the transition from the realm of ideas to the realm of everyday experience” (Dourish, 1999, p. 5), from intangible to tangible. Dourish notes that embodiment extends beyond the physical world to include other aspects of everyday life. Accordingly, a VA’s voice is a form of embodiment where voice interactions can create deep connections to technology and make it more tangible (Pitardi and Marriott, 2021).
Interface mirroring (Bilal and Barfield, 2021b) posits that VA users prefer the voice interface to match their characteristics (e.g. age, gender, race and ethnicity). Interface mirroring resonates with the similarity-attraction theory, which postulates that people like and are attracted to others who are similar, rather than dissimilar, to themselves (Berscheid et al., 1971; Byrne, 1971). Using EVIs will unveil whether users favor such embodiments and decide to switch their voice interface to visualize their preferred EVIs. Interface mirroring and the similarity-attraction theory could explain the users’ selections of the three most and three least preferred EVIs vis-à-vis their characteristics and whether these selections are due to similarities or other reasons.
Method
This study used a survey questionnaire consisting of closed and open-ended questions, collecting quantitative and qualitative data.
Population and sample
The population was undergraduate students (freshmen and sophomores) at a research university in the USA. The students were enrolled in two required courses with a research participation assignment offered for credits. Students in both courses must register for research participation credits in the SONA management system, a student research participation platform a faculty member manages at a designated department. Registered students in SONA cannot be invited or sampled to participate in a research study. They choose the study of interest from a pool of studies for a research participation credit. The researchers did not have access to SONA. There were 1,700 students enrolled in both courses. We used Qualtrics software to create the survey. After the survey’s URL was posted in SONA, we received (n = 632) responses; (n = 241) were terminated due to the quota setting, (n = 228) were automatically opted out because the students did not use VAs and (n = 13) did not complete the survey. The final number of students who took the survey and provided valid responses was (n = 150).
The quota we set in Qualtrics included an equal distribution of participants across age groups (i.e. 18–23, 24–29 and 30–35), requiring a maximum of 100 participants per group. As a result, Qualtrics terminated the survey for the 101st participant who fell within the first group (age 18–23). In February 2022, we found that many eligible participants within that age range were excluded after the quota was met. Therefore, we removed the quota setting to obtain more participants from that age group.
Instrument
We developed a survey questionnaire, adapted and expanded from the instrument used by Bilal and Barfield (2021b). The survey was designed using Qualtrics software; it consisted of 52 questions, including 25 closed-ended and 27 open-ended items covering four key aspects:
Demographic information (age, gender and race-ethnicity);
VA usage and experience (e.g. type of VAs, most frequently used VA, frequency and duration of VA use, the top five tasks performed in VAs, most and least satisfactory experiences and current VA voice settings (language, gender and accent);
Information about owned VA (current VA voice settings); and
Voice-switching behavior and experience (reasons for voice switching, feelings about the switched voice and questions about perceived trust, information accuracy, usefulness, intelligence and the most important characteristics VAs must possess).
The survey was pilot-tested, used twice in previous studies (Bilal and Barfield, 2021a; Bilal and Barfield, 2021b) and expanded in the present study.
The survey included ten diverse EVIs represented in images of individuals of varying races, ethnicities, genders and ages (Figure 1). These images were obtained from Unsplash.com, a Creative Commons website and enhanced for visual quality (Bilal and Barfield, 2021b). Each image was randomly labeled using A to J. Questions regarding the EVIs consisted of ranking the three most and least preferred ones and explaining the choices.
Our goal for using a survey was to collect data from many participants with various age ranges, genders, ethnicities and races. We aimed to confirm whether voice-switching behavior in VAs is still evolving, the reasons for switching and whether the EVIs would influence the participants’ decision to switch or not switch the voice. We sought emerging themes in voice-switching behaviors, including feelings about the new voice and whether the three most preferred and least preferred EVIs would influence the participants’ decisions to switch the voice interface. Thus, the survey approach collected data from a large sample of participants and allowed performing tests to find statistical significance among various variables. In addition, the survey approach generated data on users’ perceptions of trust, information accuracy, usefulness and intelligence, allowing tests to be performed to find statistical significance between and among these variables.
Procedure
Following the study’s approval by the University’s Institutional Review Board, the first author contacted the faculty member in charge of SONA and shared the survey’s URL. The faculty member reviewed, approved and posted the URL on SONA for students in the two courses to participate in. The survey was active from November 2021 to April 2022.
Results
Demographics and background
The participants were between 18 and 34 years old (Mean age = 19.45 years). Of the 150 participants, 96 (64%) identified as female, 53 (35.33%) as male and 1 participant (0.67%) chose not to disclose their gender. In terms of race or ethnicity, 126 participants (84%) self-identified as White, 9 (6%) as Black/African American, 9 (6%) as Asian and 6 (4%) as “other” races or ethnicities (i.e. Biracial-White and Filipino; non-Hispanic; Middle Eastern; Mixed; Multiracial). All participants attended high schools in the USA and were native English speakers.
Most participants owned an Apple iPhone (n = 147, 98%) and used Siri, while the rest (2%) and the rest owned an Android phone and used Google’s Assistant. The participants used Siri most often, 91 (60.67%), Amazon’s Alexa (n = 48, 32%) and Google’s Assistant (n = 11, 7.33%). Participants reported using VAs for 4.34 years, with an average frequency of 15 times per week.
Voice switching in voice assistants
Most participants (n = 128, 85.33%) were familiar with changing the voice settings of their VAs, and a few (n = 22, 14.67%) were uncertain or unaware of how to change the voice interface. Of the 128 participants, 55 (43%) reported switching their VA voice settings and using a diverse range of preferred accents and genders (Table 1). We performed a Chi-square test to determine the effect of gender on participants’ knowledge of how to switch the voice interface. There was no significant difference between males and females regarding knowledge of switching the VA’s voice settings [χ2 (4,150) = 2.043, p = 0.728, V = 0.083] or voice-switching behavior [χ2(2,150) = 1.331, p = 0.514, V = 0.094].
Reasons for voice switching.
Of the 55 participants who switched the voice interface, 11 (20%) reported choosing a preferred accent. For example, one participant said: “I just really like the British accent” (P50). Two participants (3.64%) switched to a voice of a preferred gender: “I liked the female voice over the male voice” (P6). In contrast, most participants (n = 36, 65.45%) changed the voice interface for fun or to experience something new and different, as one participant noted: [I switched the voice] “just for fun and to create some variety” (P105).
Feelings about the switched voice.
Participants provided a range of responses when asked about their feelings toward the new voice of their VAs. Five (9.09%) enjoyed the new accent, as one participant noted, “It makes me feel smart because the dude is British, and most people think that British people are smarter and more proper than Americans” (P85). Four participants (7.27%) preferred the new voice gender, with one participant explaining that “the male one feels more neutral, I feel neutral about it” (P26). Thirty-two (58.18%) were positive about the new voice and found it more comfortable, soothing or interesting. For instance, one participant expressed: “The new voice is more relaxing and less serious than the old voice” (P82), while another reported, “I like how my new Siri says certain things. It makes me laugh” (P50). However, six participants (10.91%) felt no difference about the new VA voice. One participant noted: “No changes in mood in particular” (P64). Among those who did not like the new voice or could not get used to it, five reverted to the original voice, with one participant explaining: “I switched to Australian voice one time to see what it sounded like but switched back to American because I was used to that accent more” (P119).
Participants’ perceptions of their voice assistants
Trust in voice assistants’ information accuracy.
The participants rated the level of trust in their VA’s information accuracy on a five-point Likert scale (1 = not at all trusted and 5 = highly trusted) and explained their ratings. Most participants (n = 102, 68%) trusted the accuracy of the information their VA provided (ratings = 4 and 5), while 48 (32%) ratings were 3 or below (Figure 2). Those whose trust rating was 4 and 5 mentioned that their VAs could perform online searches using the internet. One participant stated, “Since she’s connected to the internet, I trust that she’ll find the most accurate answer” (P110). Another participant said, “I trust that Google wants to give correct information” (P74). Additionally, participants trusted their VAs because of the consistent and accurate information they retrieved for their queries. For instance, one participant stated, “She has never been wrong before” (P51). Some participants indicated they trusted the accuracy of simple questions their VAs answered, as one participant explained, “Alexa is good at easy things like the weather, but it is usually a better bet to just look on Wikipedia for anything detailed” (P4). Another participant said, “I have a lot of trust in my digital assistant because I only use it for very simple tasks: weather, alarms, music, etc.” (P21). Finally, few participants assigned trust in their VAs based on the design, intended purpose of the technology and expectations; as one participant noted, “I expect her to know how to do anything I ask because that is what she was made for” (P13); and another participant said, “It is AI so they know pretty much everything” (P75).
However, the participants did not trust their VAs when it failed to understand or process requests. For example, one participant rated the level of trust as three and stated that “sometimes (not very often) Siri will completely misunderstand what I am asking for and provide the wrong answer” (P26). Another reason was when the VA retrieved incorrect information. One participant who rated their trust as 2 noted: “Sometimes it looks up incorrect stuff” (P106). Another participant whose rating was 1 (Not at all trusted) expressed concern about the manufacturers manipulating VAs: “Jeff Bezos owns Amazon and the Washington Post. He has been caught altering what news has been put out, especially ones that would criticize him or draw attention to what he is doing. Just in general, search engines curve what we see, so I listen to everything with a pinch of salt” (P142).
Voice assistants usefulness.
VAs’ usefulness was examined based on users’ most satisfactory and least satisfactory experiences generated from responses to open-ended questions on the survey. We categorized the satisfactory experience into three types:
accurately processing users’ requests;
providing accurate or relevant information; and
making life more convenient and joyful.
Accurately processing requests.
Nearly 49% (48.67%) of the participants perceived their VAs as useful because they accurately processed requests and could accurately transcribe their messages. One participant noted, “Siri was able to accurately write out what I told it to text to someone” (P6). Other participants were satisfied when their VAs played the songs they requested, as one participant stated, “When I first learned that I could use my Alexa or Siri to play any song I wanted, I was very satisfied” (P30).
Providing accurate or relevant information.
Forty-five participants (30%) were satisfied with their VAs because they provided accurate or relevant information. For instance, one participant commented, “It would give me the weather every morning. It would give me an accurate description of the weather we were going to have throughout the day, which would accurately prepare me with what I had to wear” (P16).
Convenience and joy.
Thirty-one participants (20.67%) mentioned that their VA made their lives more convenient or joyful because they used it hands-free, as one participant mentioned: “She helps a lot she it comes to the everyday stuff and allows me to be hands-free from my phone” (P141); and saved them time performing routine tasks. One participant said, “In high school I programmed Alexa to wake me up every morning at 7 am for school. This saved me time from having to continuously set numerous alarms” (P114).
Dissatisfactory experience with voice assistants.
The participants’ dissatisfied experiences with their VAs fell into four categories:
processing requests inaccurately;
misunderstanding the requests or queries;
providing irrelevant information; and
other reasons, including misrecognizing the accent, VA settings issues and performing unrequested tasks.
Fifty-two participants (36%) mentioned that their VAs failed to process a task after multiple attempts or made calls to the wrong people. One participant mentioned, “When Siri misheard what I said and called my ex-boyfriend” (P135). Another participant commented, “I tried multiple times to set a reminder, but it messed up over 4 times” (P49). Thirty-eight participants (25.33%) reported that their VAs could not understand their requests, which caused frustration. For example, one participant noted, “I hate when Siri cannot understand me, and after so many attempts, I just have to manually do what I wanted done” (P11). Thirty-one participants (20.67%) were dissatisfied because their VAs retrieved irrelevant results for their questions. For instance, one participant mentioned, “I had once asked for a word and its meaning; however, it pulled up a version from the Urban Dictionary which is mostly inappropriate” (P29).
Twenty-two participants (14.67%) reported issues with their VA, including creepy behavior or going off on their own. One participant said, “It was around 3 AM and my Google was in the other room from me and nobody else was talking, and all of a sudden, it said recipes for cookies here you go. It was very creepy” (P43). Another participant reported, “Often times my Alexa goes off on its own, or has trouble loading. This is always an inconvenient experience” (P74). Only five participants (3.33%) did not have issues with their VA.
Voice assistants’ intelligence.
We asked the participants to rate their VAs’ intelligence level on a five-point Likert scale ranging from one (far below average) to five (far above average) and explain their ratings. Eighty-three (55.33%) participants rated their VAs either “5” or “4,” while 55 (36.67%) rated them “3” (Figure 3). Only 12 (8%) rated their VAs “2” or “1.” Half of the participants (n = 75) provided more than one explanation of their ratings. We categorized the explanations into six categories (Figure 3).
Sixty-two participants believed their VAs were AI systems, used machine learning and connected to the internet. Nineteen participants rated their VAs’ intelligence “5,” 17 rated them “4,” 18 gave them “3” and 8 assigned them “2” (Figure 3). For instance, one participant whose rating was “5” noted, “It is very smart because it’s Siri she pulls from the internet. She also has numerous ways to tell you think and allows you to know exactly what you want to hear, [she] also pulling up websites for you to research” (P140). One participant rated their VA’s intelligence as “2” and noted that “it is intelligent because of internet abilities. I don’t really think she’s intelligent; rather, the programmers and designers make her intelligent. I do not know if that makes her intelligent by association, but I am leaning towards no” (P97).
Voice assistants’ perceived intelligence and trust.
We performed a Spearman’s rho correlation test to examine the association between the participants’ perceived intelligence and trust in VAs in providing accurate information. There was a significant correlation between these two variables (rs = 0.561, p < 0.001), meaning that the more intelligent the participants perceived their VAs, the more they trusted them to provide accurate information.
Gender, perceived intelligence and voice assistant trust.
We performed an independent sample T-test, using gender as the independent variable, to examine its effect on the participants’ perceived intelligence of their VAs and level of trust. There was a moderate statistical significance [t(147) = 0.758, p = 0.021] with a small effect size (d = 0.13). Male participants (mean score = 3.79, SD = 1.1) perceived their VAs as slightly more intelligent than female participants (mean score = 3.67, SD = 0.89). There was a weak statistical significance concerning the level of trust in VAs to provide accurate information [t(147) = −0.648, p = 0.043] with a small effect size (d = 0.099). Females (mean score = 3.83, SD = 0.8) had slightly higher trust in their VAs’ information accuracy than males (mean score = 3.74, SD = 1).
Voice assistant’s most important characteristics.
Participants were asked to name the three most important characteristics their VAs must possess. We analyzed the open-ended responses and categorized them into six primary types (Figure 4). The first and most frequently mentioned important characteristic is being able to process tasks, such as playing music, answering phone calls and texting people. The second is that the VA provides information, such as the exact time, date and weather and answers their questions. The third is understanding and recognizing the participant’s voice. Other important characteristics include accuracy, quick response and user-friendliness (e.g. accessible, smart/intelligent, funny/uplifting, reliable/helpful and conversation-friendly).
Visual embodiment in voice assistants
We asked the participants to rank the three most and least preferred EVIs, images of individuals with varied characteristics we included in the survey. Participants were then asked whether their most preferred EVIs would influence their decisions to switch the voice interface to visualize either EVI in their VAs. They were also asked to describe the race and ethnicity of the individuals in the images.
Figure 5 illustrates the frequency distribution of the participants’ three most preferred EVIs and their explanations, showing that EVI B (Black female, n = 101), J (young White Female, n = 98) and G (mid-age White female, n = 82) as the most preferred. The participants were most attracted to these EVI’s appearance and/or visualization of their voices. For instance, one participant who chose EVI B as the most preferred EVI stated: “She looks like she would have a sweet voice” (P50). The other participant who chose EVI G indicated: “[She looks] older and approachable and wise” (P113).
Another common explanation is that the EVIs aligned with the participants’ visualization of their current VA voice. For example, a participant who chose EVI J noted: “I selected J as my number one choice because she appears to be the most appropriate for the voice that I have on my phone” (P12). Other reasons include matching the age, gender and ethnicity of the EVIs with those of the participants. For instance, a participant who chose EVI J indicated: “I chose J because she looks about my age and we looked similar” (P52). Another participant who selected EVI C (Asian female) explained: “I’m Asian so I select the one I relate to, a woman” (P109). We performed a Chi-square test to examine whether there was a gender difference in the most preferred EVIs. The results showed no significant difference between males and females regarding EVI preferences [χ2(20,150) = 17.68, p = 0.608, V = 0.243].
Most preferred embodied voice interfaces and the decision to switch the voice interface.
We investigated whether the participants would be willing to switch their VAs’ voice interfaces based on their EVI preferences (Figure 6). Most participants (n = 118, 78.66%) reported they would not switch because they did not need to or were satisfied with their current voice. One participant stated: “I genuinely don’t care about what the imaginary person would look like. It’s just a machine, so I don’t really think about it” (P2). Participant (P14) mentioned, “I like the voice of my Alexa and would not prefer to change it.” A few participants reported that their current VA voice matched their preferred EVI. For example, the participant (P38) stated: “My first choice is very similar to how I envision the voice of my digital assistant now.”
Conversely, 32 participants (n = 32, 21.33%) were willing to switch their VA voice if their least preferred EVIs were used to visualize the voice interface. For example, the participant (P3) mentioned: “Yes, because the voice of my digital assistant does not really match the images.” Other participants wanted to make their VA voice more relatable. For example, participant (P65) explained: “They all just seem relatable. I don’t know if this makes sense, but having a relatable voice assistant just seems like cool.” Another reason for switching the VA voice is to make it sound more appealing or interesting. For instance, the participant (P95) noted: “Yeah, I think that it would be fun to change the voice of my Siri just to see which one I liked the most. Having different voice intensities, as well as accents, could be interesting.”
Gender effect.
The Chi-square test we performed showed no significant differences between male and female participants, their most preferred EVIs and the decision to switch the voice on their VAs [χ2(2,150) = 1.90, p = 0.387, V = 0.113]. It also revealed a weak statistical significance between the decision to switch the voice and the selected EVIs [χ2(10,150) = 18.70, p = 0.044, V = 0.353]. The fewer participants who preferred the EVIs C (Asian female), I (young White male) and A (Black male) were slightly more willing to switch their VA voice.
Least preferred embodied voice interfaces.
The participants selected the three least preferred EVIs and provided explanations. Figure 7 presents the frequency distribution of the participants’ preferences and the associated explanations. The three least favored EVIs are H (mid-age White male, n = 141), F (Muslim male or male from another cultural background, n = 85) and G (mid-age White female, n = 57). These participants indicated they did not prefer the EVI’s appearance or the imagined voice that matched their appearance. For instance, the participant who chose the EVI H (P29) stated: “He looks like someone I had an unpleasant experience with, so I would not like them to be telling me information I do not know.” Another participant thought the EVI H “does not seem like a kind voice” (P3). Other common reasons concerned the perceived age or gender of the EVIs. For example, the participant who selected the EVI G said: “The lady seems nice but old, and I want the voice to be young” (P14). The participant who selected the EVI F noted: “Looks like he has a scruffy voice, a man” (P109). Other reasons include concerns about the EVIs envisioned accents and speech barriers. For instance, a participant who chose the EVI D (Asian male) stated, “Based off of their appearance unless they are based in America, we may have a speech barrier.” Some participants also explained that their least preferred EVI did not match their current VA voice. For instance, the participant who chose the EVI G said: “My Siri does not give the characteristics that an elder woman would have” (130). We tested for gender effect and the least preferred EVIs and found no significant difference between male and female participants and their least preferred EVIs [χ2(18,150) = 18.929, p = 0.396].
We asked the participants whether their least preferred EVIs would influence their decision to switch their VA’s voice interface. Thirty-two participants (21.33%) responded “Yes” and said they would switch to a preferable voice interface if their VA voice interfaces were associated with their least preferred EVIs. As participant (P7) noted, “If a female voice came out of those faces, I would have some concerns.” Another participant (P80) mentioned, “I would still want the voice to match the personality shown in the picture” (Figure 8).
Some participants explained that they did not prefer the appearance or the imaginary voice of the EVIs, particularly their age, and thus were unwilling to switch voice interfaces. One participant noted, “Talking to older individuals would be weird for me because it would remind me of an older person talking down to me, as I would be communicating with Siri” (P96). The Chi-square test showed no significant difference between voice switching and the participants’ least preferred EVIs [χ2(9,150) = 8.04, p = 0.530, V = 0.251]. There was no gender effect, voice switching and the least preferred EVIs [χ2(2,150) = 0.299, p = 0.861, V = 0.231].
RQ4: Do users prefer their VAs’ voice to match their characteristics (age, gender and accent)?
We asked participants about their current VA settings, including gender, language and accent, and we asked them to rate the importance of matching these settings to their characteristics on a five-point Likert scale (1 = not at all important; 5 = extremely important).
Voice assistants’ gender, language and accents settings.
Most participants (n = 122, 81.33%) had their VAs set on a female voice, while a few (n = 27, 18%) had a male voice. One participant had a nonbinary option. All participants set their voice language as English. As to the accent, 54 participants (36.67%) reported using the default setting, 64 (42.67%) set their accent as American, followed by British (n = 18, 12%), Australian (n = 8, 5.33%), Irish (n = 4, 2.67%) and Indian (n = 1, 0.67). Most participants (n = 103, 68.67%) reported that it was not at all important, a few (n = 23, 15.33%) indicated it was slightly important, fewer (n = 12, 8%) mentioned it was moderately important and eight (n = 8, 5.33%) said it was very important. The rest (n = 4, 2.67%) stated it was extremely important.
Importance of gender match.
Most participants (n = 98, 65.33%) rated it as not at all important, few (n = 25, 16.67%) mentioned it was slightly important, fewer (n = 15, 10%) indicated it was moderately important and nine (n = 9, 6%) stated it was very important. The rest (n = 3, 2%) said it was extremely important. This finding supports the participants’ preference for a female-gendered voice over a male-gendered voice.
Importance of age match.
Most participants (n = 120, 80%) indicated that it was not at all important for the age of their VAs to match their age. A few (n = 17, 11.33%) mentioned it was slightly important, fewer (n = 9, 6%) rated it as moderately important and very few (n = 3, 2%) said it was very important. The rest (n = 1, 0.67%) indicated it was extremely important.
Significance of gender match with participants’ characteristics.
We performed the independent sample t-test using gender as the independent variable to examine its impact on matching the participants’ age and accent. There was a statistical significance on gender matching [t(147) = −3.37, p < 0.001] with a medium effect size (d = 0.616), age matching [t(147) = 3.87, p < 0.001] with a small effect size (d = 0.356) and accent matching [t(147) = −2.01, p = 0.004], with a medium effect size (d = 0.6). This finding means that it was more important for female participants to have their VAs match their gender (mean = 1.83, SD = 1.13) and accent (mean = 1.71, SD = 1.09), compared to male participants (mean = 1.26 and 1.36, SD = 0.65 and 0.89, gender match and accent match, respectively). However, males (mean = 1.62, SD = 0.99) indicated that having VAs match their age was more important than their gender (mean = 1.15, SD = 0.49).
Significance of gender, accent and age match with participants’ characteristics.
We performed a Spearman’s rho correlation test to examine the association among the variables matching the participants’ gender, accent and age. There was a moderate and statistically significant correlation between gender matching and accent matching (rs = 0.257, p = 0.002) and between gender matching and age matching (rs = 0.198, p = 0.015). The findings indicate that the participants who viewed gender matching as an important factor in VAs may also value accent and age matching. These findings highlight the interrelationships between the participants’ characteristics and those provided in VAs.
Discussion
RQ1: What is the nature of users’ voice-switching behavior in voice assistants? a. For what reasons do users switch the voice interface? b. How does the new voice make users feel?
The results of this study revealed that 43% of the participants who knew how to change the voice interface in their VAs switched voice, confirming that voice switching is an occurring behavior in VAs to personalize the gender, language and accent of the voice. Those who switched their voice interface preferred American and British accents the most. In all cases, the participants who selected the British accent favored female and male voices, whereas those who chose the American accent selected a female voice. Notably, most participants (64%) were females, compared to 35.33% were males. Across all selected languages and accents, a female voice was slightly preferred over a male voice and varied by language and accent. For example, the few participants who chose the British accent selected a male voice and perceived it as more trustworthy than a female voice. In contrast, those who favored the American Accent and chose a female voice perceived it as more compassionate. Preference for a female voice can also be explained by statistical tests on gender and age match of the participants’ characteristics. These tests showed that it was more important for female participants to have their VAs match their gender compared to male participants, who indicated that it was more important to have VAs match their age instead of their gender. In contrast, Bilal and Barfield (2021b) found that most female and male participants favored a female voice. As the VA technology continues to evolve, user preferences may change.
Interestingly, most participants switched the voice in their VAs to have fun, experience a new voice or accent, a language of interest or hear a different gendered voice. Most participants had positive feelings about the new voice. Those who did not like the new voice switched back to the previous voice (default). This VA feature is essential for personalizing these users’ experiences. The few participants (nearly 15%) who did not switch the default voice interface in their VAs were uncertain how to change it and/or were unaware of this feature. Uncertainty is a factor that contributes to the status quo’s cognitive bias (Samuelson and Zeckhauser, 1988). However, a lack of awareness should be part of the status quo theory. This theory posits that people prefer that things stay unchanged to minimize the risks or losses associated with the change. We did not investigate whether uncertainty or perceived risks contributed to the participants’ not switching the voice interface. However, they indicated they were satisfied with the current voice, suggesting that satisfaction with the current voice should also be part of the status quo theory. The participants’ lack of knowledge or awareness of this feature suggests a need for AI literacy. Based on the findings of this study, the participants’ knowledge of their VAs was gained through experiential learning, which is insufficient for conceptualizing their usage. Combining experiential learning with formal AI literacy could bridge this knowledge gap and enhance their learning experiences.
Preference for a female voice is unsurprising since most participants (81.33%) had their VAs set on a female voice by default, while a few (n = 27, 18%) had it on a male voice. One may attribute this finding to more female participants than males (64% and 53%, respectively). However, most participants’ selection of the female voice means that all females and some males preferred the female-gendered voice, confirming the study’s findings by Bilal and Barfield (2021a,b). Many studies explored the gender effect in VAs and user perceptions of different voice genders (e.g. Tolmeijer et al., 2021). Researchers note that the common design of VAs is vastly represented as young and female (Watkins and Pak, 2021). In terms of EVIs, one study showed that switching the voice to female EVIs was more frequent than to male EVIs. Preferences for a female voice may be attributed to cultural and historical aspects, which are beyond the scope of this study.
The gender and accent of the switched voice influenced the users’ perceived trust in the new voice, but most participants who switched the voice considered that trust was not related to the voice itself but to the functions of the VA. Those participants considered information accuracy, understanding voice commands and performing simple tasks effectively contributing to trust more than the gendered voice or accent:
RQ2: What are the users’ perceptions of their voice assistants concerning: a. trust in information accuracy, b. usefulness, c. intelligence and d. the most important characteristics they must possess?
Trust in information accuracy
Most users who highly trusted their VAs’ information accuracy indicated that they performed online searches and used the internet to find information, affirming their lack of VA literacy skills. Other participants whose VAs answered their simple questions accurately and consistently trusted their information accuracy. However, one-third of participants who had a negative experience with their VAs (e.g. misunderstanding their voice inputs, retrieving incorrect responses, failing to process requests and “going whacky,” such as calling people on their own) did not trust them.
Trust in information accuracy varies by the types of queries users ask, such as simple versus complex. Some participants consulted other sources, such as Wikipedia, for detailed or complex questions. This behavior seems to be shared among VA users, regardless of age. For example, Girouard-Hallam and Danovitch (2022) found that children used their VAs to ask simple, routine questions and resorted to family members for personal and other questions. This finding has implications for improving the quality of responses in VAs, especially for complex queries. We did not ask the participants to provide examples of their queries to identify their types (e.g. simple versus complex) or learn about their query formulation strategies to evaluate their VAs’ information accuracy, successes and failures. Nonetheless, the participants’ demographic data showed that, on average, they had 4.5 years of experience using their VAs. This finding also affirms the participants’ experiential experience using VAs, necessitating exposure to AI literacy, including VAs, to enable them to build a more accurate mental model of these tools. As Ng et al. (2021) suggest, AI literacy should be based on pedagogical approaches and techniques that blend VA technology with educational practices.
Usefulness
The participants’ perceptions of their VAs’ usefulness were assessed based on their satisfaction and dissatisfaction. Their judgment of usefulness was attributed to satisfaction with information accuracy, indicating that trust is associated with information accuracy and usefulness. However, the issues some participants experienced (e.g. processing requests inaccurately, misunderstanding the requests or speech, retrieving irrelevant information, misrecognizing accents, setting issues and “creepy behavior” such as going off on their own) contributed to their dissatisfaction with VAs and caused frustration. Most of the issues are system-related and require intervention from VA designers.
Intelligence
Most participants considered their VAs intelligent because they used AI and machine learning and connected to the internet to find information. This perception is erroneous and based on the participants’ inaccurate mental models of VAs. In contrast, some participants perceived their VAs as intelligent because they successfully retrieved the requested information and provided accurate responses. Again, here, information accuracy is associated with intelligence, suggesting that information accuracy is crucial for judging VAs’ intelligence, usefulness and trust. However, failure to understand user requests, providing inaccurate responses and going “whacky” contributed to perceiving VAs as unintelligent. Failures affect users’ perceived trust (Mahmood et al., 2022) and the usefulness of VAs (Zhan et al., 2024). The significant correlation between the participants’ perceived intelligence of their VAs and trust suggests that these two variables are intertwined and should be explored further in future studies.
Characteristics voice assistants must possess
The characteristics participants must possess. Some participants seemed to have programmed their VAs to perform tasks such as making phone calls, texting specific people and playing music, indicating experience using them. They also expected their VAs to provide real-time information such as weather, exact time and date, which are simple and routine tasks. Overall, the desired functional characteristics (e.g. understanding and recognizing speech inputs, intelligence and information accuracy) and social attributes (e.g. user-friendly, conversation-friendly, funny and uplifting) are important for VA users. As Pitardi and Marriott (2021) indicated, the functional and social factors embedded in VAs influence users’ experience, willingness to continue using them and loyalty to the brand. The “programming functions,” including texting specific people, making phone calls, playing music and providing real-time information, such as weather, time and date, are important for personalizing VAs:
RQ3: What are the user preferences for embodied voice interfaces (EVIs), and do their preferred EVIs influence their decision to switch the voice in their VAs?
Most participants were satisfied with the current voice interface; thus, they decided not to switch it. Some participants were uncertain whether they wanted to switch the voice interface or did not know this option was available. Thus, they kept the voice interface unchanged. This finding contradicts Bilal and Barfield’s (2021b) study, where most participants were willing to switch the voice interface to visualize their preferred EVIs. Lack of knowledge and satisfaction with the current status (i.e. voice interface) could advance the status quo theory.
Regarding the EVI’s image-to-voice relationships, the participants selected EVIs of various ages, ethnicities and races, including EVI B (Black female), J (young White Female) and G (mid-age White female), which are females, affirming preferences for female voices among female and male participants. Regardless of selecting these EVIs, some participants stated they would not switch the voice because they liked the current voice on their VAs or that the current voice was like what they envisioned the voice on their VAs. This finding supports the statistical test, which showed no significant gender effect, the most preferred EVIs and the decision to switch the voice on their VAs. Although most participants were White and very few were Black, EVI B (Black female) was the most preferred, followed by EVI J (young White female). Although EVI G did not match the participants’ ages, she was preferred because she was perceived as approachable and wise. This finding suggests that in the context of EVIs and voice switching, the participants who selected these three preferred EVIs did not perceive gender, age, race or ethnicity as important to match their characteristics. Interestingly, the few participants who preferred EVIs C (Asian female), I (young White male) and A (Black male) were more willing to switch the voice interface because they matched their characteristics. Since the participants were predominantly White and very few were of diverse ethnic and cultural backgrounds, we did not test for a significant relationship between these variables and the EVI’s races and ethnicities.
The novel finding is that 21.33% stated they would switch the voice on their VAs if their least preferred EVIs represented the voice interface. In contrast, the previous study (Bilal and Barfield, 2021b) showed that most participants were willing to switch the voice to visualize their most preferred EVIs. Additional research should elicit users’ voice-switching behaviors due to the least preferred EVIs:
RQ4: Do users prefer their voice interface to match their characteristics (age, gender, accent and race/ethnicity)?
Most female and male participants preferred a female-gendered voice and gender, and age matching of their VAs was unimportant. This finding supports Bilal and Barfield’s (2021b) study, which revealed that nearly 50% (48.3%) noted that age matching was slightly or not at all important. Nonetheless, it was more important for female participants to have their VAs’ voice match their gender, and for male participants to have the voice match their age instead of their gender. This finding highlights the interrelationships between the participants’ characteristics and those provided in VAs and the differences between female and male participants. One exception is the preference for a female-gendered voice.
Implications
This study revealed that 43% of the participants switched the voice interface in their VAs. The current standard options in most VAs include selecting a preferred language, accent and gender. These features should be augmented by incorporating EVIs to advance personification and personalization in VAs. Although 43% switched the voice on their VAs, most decided they would not switch the voice to visualize their three preferred EVIs because they were either satisfied with the current voice or that it matched the characteristics of their envisioned preferred EVIs. This finding contradicts previous research, indicating that switching the voice interface in the context of EVIs may not align with the status quo theory. Voice-switching behavior should be investigated to include user satisfaction with the current voice. In this study, most participants who switched their voices felt positive about the new voice, but a few who disliked it switched back to the previous voice setting. The finding that 21% of the participants would switch to the current voice interface if it embodied their least preferred EVIs is a new theme that should be examined in future research. Using EVIs with diverse characteristics could enhance personification and personalization of voice interactions and advance inclusivity in designing VAs.
Trusting VAs’ information accuracy is intertwined with the participants’ perceived VAs’ usefulness and intelligence. However, the reasons most participants trusted the information accuracy of their VAs (e.g. using the internet to find information) are problematic, especially since they had an average experience of 4.5 years using them. This suggests the participants learned to use their VAs through trial and error or other experiential learning and did not form adequate mental models of them. Formal AI literacy (including VAs) could enable the participants to build more adequate mental models to support their interactions, minimize frustration and enhance problem-solving. This AI literacy requires interventions from educators and information professionals to provide customized student training.
The participants’ preference for a female-gendered voice is not surprising and confirms the findings from previous studies. However, this may be attributed to stereotyping or other cultural backgrounds when designing VAs. Much research has explored gender bias in VAs, which subjects users to favor certain gendered voices (see, for example, Mahmood and Huang, 2024).
The novel finding that participants’ perceived intelligence of their VAs significantly impacted their trust in information accuracy advances our understanding of the interrelationships among trust, information accuracy and perceived intelligence of VAs, which should be investigated in future research. The breakdowns that some participants experienced seem to be persistent in VAs (Ammari et al., 2019; Sa and Yuan, 2021). The system design issues (e.g. inaccurate responses to questions, lack of responses and failure to understand speech inputs) should be addressed through interventions by VAs’ system designers. The participants’ expectations of their VAs to be user-friendly, conversation-friendly, funny and uplifting are human-like social characteristics essential for supporting users’ personification of their VAs. As Pitardi and Marriott (2021) indicated, the social factors embedded in VAs influence users’ experience and willingness to continue using them (Ammari et al., 2019; Sa and Yuan, 2021).
Limitations
We collected data on the participants’ ages, genders, races and ethnicities to identify whether these variables affected their selection of the EVIs. Due to the wide variability in the data (see Demographics section), we did not perform a statistical test to identify significance. Another limitation is that we used the self-reported survey questionnaire for data collection, which may be prone to inaccuracy.
The participants’ interest in earning research credit for participation and using SONA is a potential bias; avoiding bias in this type of study may be difficult. However, participation was available to all students in both courses without exclusion criteria. Also, students who did not own or use VAs opted out at the outset of the study survey, leaving those interested in completing the survey, which should have minimized bias.
The EVIs we used as embodiments are limited in their representation of people from diverse backgrounds, races, ethnicities, ages and genders. However, such EVIs could be examples for building prototypes to test in VAs.
Conclusion
The findings from this study extend research on VAs in general, voice-switching behavior, user perceptions of VAs’ characteristics and conceptual understanding of VAs, including judging information accuracy. They also provide a different form of embodiment in VAs that has hardly been investigated to enrich the user experience and voice interactions. This study confirms that voice switching continues to evolve in VAs. However, we need additional research in this area. Information accuracy in VAs has rarely been examined and should be investigated in future studies.
Although 43% of the participants switched the voice interface, most maintained the status quo (Samuelson and Zeckhauser, 1988) either because they were satisfied with the current voice, were uncertain how to change it or were unaware of the voice-switching options in their VAs. This finding advances the status quo theory by considering user satisfaction with the current voice and the lack of awareness of available options as part of this theory and to test in future studies. Nonetheless, the design of VAs should be more intuitive and transparent, allowing users to find the various options available.
The need for formal AI literacy is evident in enabling the participants to build more adequate mental models of their VAs. Educators and information professionals could intervene to remedy misconceptions about judging information accuracy. In this age of AI, users should be educated in various aspects of AI systems, including VAs. As the implementation of Gen-AI, such as ChatGPT and similar tools, has become pervasive and is expected to evolve, educators and information professionals should offer effective AI literacy programs.
This new theme, willingness to switch the voice interface to avoid seeing the least preferred EVIs, advances voice-switching behavior research. Additional research is needed to identify whether this theme is supported, especially since a previous study (Bilal and Barfield, 2021b) showed that the most preferred EVIs influenced most participants’ decisions to switch the voice interface.
Preference for a female-gendered voice by female participants confirms interface mirroring and similarity attraction theory, but preference for this gendered voice by males does not. Thus, gender is an important factor to examine in VAs, and using interface mirroring and similarity attraction theory provided directions for explaining this study’s findings. Still, we need additional research in this area of study.
From the design perspective, this study’s findings confirm that some users experienced breakdowns and frustrations in using VAs, suggesting that VA designers address these issues to enhance users’ interactions and mitigate frustrations. The findings also suggest incorporating EVIs to support inclusive design and personalization beyond language, accent and gender and to enhance personification in VAs. VA providers should leverage the capabilities of GenAI to augment users’ experiences and enhance personalized learning in VAs.
This research was partially funded by the University of Tennessee School of Information Sciences Faculty Research Fund. The authors thank the School for supporting this research.
Figure 1.Embodied voice interfaces (EVIs) as images
Figure 2.Perceived trust in VA’s information accuracy and reasons
Figure 3.Perceived intelligence of VAs, ratings and reasons
Figure 4.Most frequently mentioned important characteristics VAs must possess
Figure 5.Top three most preferred EVIs
Figure 6.Tendency of voice switching and the most preferred EVIs
Figure 7.Top three least preferred EVIs
Figure 8.Reasons for voice switching and the least preferred EVIs
Table 1.
Preferred accents and genders in VA voice switching
| Switched voice (language/accent) | Gender female | Gender male | Unmentioned |
|---|---|---|---|
| American | 7 | 2 | 1 |
| Australian | 2 | 3 | 5 |
| British | 5 | 5 | 9 |
| Indian | 0 | 2 | 0 |
| Irish | 0 | 0 | 1 |
| Unmentioned | 5 | 1 | 0 |
| Other | 1 | 0 | 0 |
Source: Table by authors
© Emerald Publishing Limited.
