Development of a Multi-Platform AI-Based Software

Full text

Turn on search term navigation

1. Introduction

Recent advances in artificial intelligence (AI), natural language processing (NLP), and social robotics have driven the development of interactive systems focused on emotional support and human–machine communication, with applications in fields such as education, eldercare, mental health, and pediatric healthcare, where their impact can be particularly significant [1,2,3].

Among the populations that can particularly benefit from such emotionally intelligent systems are children, who may face emotionally challenging situations in environments that disrupt their usual routines and support structures. For instance, during prolonged hospital stays, pediatric patients often experience feelings of isolation, uncertainty, and emotional vulnerability due to the absence of familiar surroundings, routines, and social connections [4,5]. In such contexts, sustained emotional support and opportunities for social interaction are essential to preserving psychological well-being and fostering a sense of normalcy. Evidence supports this need: a study by the Faculty of Medicine of the University of Chile found that approximately 30% of hospitalized children exhibit symptoms of anxiety or depression related to clinical stress [6]; in the United States, nearly 30% of school-aged children may present symptoms of post-traumatic stress following extended hospitalizations [7]; in China, between 28% and 34% of children report stress during hospital confinement [8]; and in Peru, 55.2% of hospitalized children showed moderate stress levels [9]. Such statistics underline the urgency of exploring innovative, evidence-based interventions capable of mitigating these effects.

Social robotics provides a promising framework for these interventions, prioritizing empathetic interaction and engagement over purely operational tasks. These robots employ verbal, visual, and tactile modalities; take humanoid, animal-like, or abstract forms; and integrate capabilities such as emotion recognition, adaptive behavior, and decision-making. Prior studies have demonstrated the potential of various platforms: Sophia, a conversational humanoid robot designed to interpret and simulate human emotions [10,11]; Elik, capable of facial expressions and tactile responses [12]; Pepper and Nao, widely deployed in early childhood education and geriatric care [13,14]; Loly-MIDI (Musical Instrument Digital Interface), a human–robot–game platform for stimulating learning in children with autism spectrum disorder (ASD), which analyzes facial expressions during gameplay to inform psychological and educational strategies [15,16,17]; Nuka, a seal-like therapeutic robot for users unable to interact with live animals [18]; and Grace, a humanoid nursing assistant equipped with vital sign sensors [19]. While these systems vary in form and function, they share an emphasis on fostering human connection and emotional well-being through responsive and adaptive interaction [20].

The literature also highlights clinically validated applications that go beyond technical novelty to demonstrate measurable impact. For example, the Huggable project (an interactive robotic teddy bear developed by Northeastern University and deployed at Boston Children’s Hospital) improved emotional engagement and communication during recovery [21,22]. Similarly, the PARO seal robot has been used not only for companionship but also for early detection of mental health conditions through behavioral analysis [23]. Compared to these established solutions, the present work aims to integrate state-of-the-art AI, NLP, and multimodal sensing into a flexible architecture capable of adapting to varied pediatric contexts while maintaining a strong focus on emotional intelligence.

To synthesize the key characteristics of these systems and contextualize the advantages of the proposed solution, Table 1 and Table 2 provide a comparative overview of the most relevant robotic platforms used for child companionship. Table 1 focuses on technical metrics such as versatility, scalability, response time, motion accuracy, and cost, while Table 2 highlights emotional and interaction capabilities, including empathetic interaction and immersiveness. Together, these tables offer a comprehensive perspective on both the operational and affective performance of the evaluated systems.

In the field of NLP, OpenAI’s ChatGPT API (Application Programming Interface) has established itself as one of the most advanced and versatile solutions compared to alternatives such as Google Cloud Natural Language AI, AWS AI Services (Amazon Web Services Artificial Intelligence Services), IBM Watsonx (International Business Machines Watsonx), and Azure AI Services. Its main advantage lies in generating coherent and contextualized text, thanks to its architecture based on large language models (LLMs) like GPT-4 [24,25,26]. Unlike IBM Watsonx and Google Cloud Natural Language AI, which focus on structured text extraction, ChatGPT-3 excels in generating natural responses, mimicking the fluidity of human language [27,28]. Its ability to understand complex semantic dependencies differentiates it from models like Google’s AI/ML API (Artificial Intelligence/Machine Learning Application Programming Interface), which struggle with prolonged conversations [29].

This work presents a multiplatform software architecture based on artificial intelligence, designed for integration into humanoid robots to provide emotional companionship through personalized dialogues and engaging activities. Although the system is intended for general use in emotionally sensitive environments, its application in pediatric hospital settings is highlighted as a representative case due to its practical relevance and supporting evidence. A small proof-of-concept implementation is included to demonstrate the technical feasibility of the proposed approach. A complete experimental validation involving pediatric patients (detailing aspects such as sample size and interaction duration) is currently in preparation, pending ethical approval and clinical coordination. The architecture is designed to be cross-platform compatible, allowing flexible integration of AI-driven interaction models across different devices and hardware configurations.

The paper is organized as follows. Section 2 describes the motivation. Section 3 reviews the related work. Section 4 details our software architecture. At the same time, the tests carried out in controlled environments and the application of the system in other robots are described in Section 5. Finally, we present the conclusions and future scopes in Section 6.

2. Motivation

In Ecuador, the lack of technological tools for emotional and psychological support in hospitals is particularly notable in the pediatric field. This deficiency is mainly due to the significant investment in infrastructure and relevant training that must be made [30]. One possible application of artificial intelligence-based systems is to improve the experience of children undergoing medical procedures during their hospital stay. This example illustrates how technology can offer emotional companionship and mitigate the social isolation experienced by hospitalized children.

Pediatric hospitalization represents a considerable challenge for the psychological and physical well-being of children, who often struggle to understand what is happening around them. A key factor contributing to this difficulty is the fear of the unknown, which increases feelings of uncertainty and vulnerability in young patients. In addition to dealing with the illness that brought them to the hospital, children experience a constant state of tension, which can exacerbate their discomfort and negatively affect their ability to cope with medical treatments. According to Dr. Maricarmen Díaz Gómez, “Hospitalization can trigger short- and long-term consequences that affect not only the physical health but also the psychological state of a patient.” [31]. In 1995, López-Fernández and Álvarez-Llanez classified the different factors involved in psychological effects, such as the duration of hospitalization, deprivation of family interaction, parental stress, and lack of adequate information, among others [32].

In 2015, Olga Lizasoain stated that the influence on child development is shaped by the relationships between the child and their environment. When a child falls ill, these relationships are altered. The effect of the illness on development will depend on the severity, progression, and prognosis of the disease, as well as the restrictions and delays it causes, the child’s temperament, and the reactions of those around them, such as parents, siblings, peers, and professionals [33,34].

Young children require special attention in emotionally demanding environments, as they are in a critical stage of cognitive and emotional development. According to Piaget’s theory of development, children between the ages of 2 and 7 go through the preoperational stage, in which their logical reasoning ability is still limited, making it difficult for them to understand complex situations such as medical conditions [35]. This makes them particularly vulnerable to fear and anxiety, as they heavily depend on their environment and the way adults explain their condition and the medical procedures they will undergo. The lack of information adapted to their cognitive level can generate feelings of insecurity and hopelessness, affecting their response to stress and recovery processes.

According to Cervantes [36], the way to help a stressed child is to listen attentively without dismissing their concerns, seek effective communication, guide them in finding solutions whether for basic problems or complex situations, and not demand more than what children can handle at their age. This observation highlights the importance of designing intervention strategies that consider children’s capacity to process information and manage their emotions. For this reason, in potential applications such as hospital settings, the proposed project includes emotional companionship through active listening and simple, meaningful conversations with the child. Additionally, this interaction data will be stored in encrypted files to assist healthcare professionals in personalizing care.

3. Related Work

The development of technological tools aimed at emotional support, particularly social interaction robots, has gained increasing attention as a promising approach in healthcare and education. These systems can act as mediators in communication, adapting explanations to the child’s cognitive level and providing entertainment and distraction during emotionally challenging situations [1,2,3]. Previous research has demonstrated that such robots can foster children’s social and cognitive development [37], strengthen social and emotional competencies in early education [38], and enhance emotional well-being through adaptive, AI-powered interaction [39].

Recent studies have expanded this perspective to clinical contexts, where social robots have been shown to reduce stress and anxiety in hospitalized children during medical procedures [40]. Randomized controlled trials in pediatric emergency departments, such as those using the NAO robot, have reported measurable physiological benefits, including reductions in salivary cortisol levels, indicating a decrease in stress [41]. These findings reinforce the potential of robot-assisted interventions to provide both emotional comfort and distraction in high-stress medical environments.

Beyond healthcare, research in educational settings has explored the role of robots as mediators in collaborative activities, promoting friendship behaviors and social bonding among children [42]. Studies in early childhood classrooms have also shown that robots can scaffold learning processes, facilitate group cooperation, and support literacy development [43]. This capacity for playful engagement and empathetic communication has been highlighted as a central driver of positive child–robot interaction [1,2].

Although these technologies could also be applied to geriatric patients, children represent a particularly promising target group due to their greater adaptability to new technologies. According to the International Telecommunication Union (ITU), in 2022, 75% of people aged 15 to 24 used the internet, compared to 65% of those over 25 [44]. This trend suggests that younger generations are more familiar with technological environments, facilitating the implementation of innovative solutions in their medical care. Additionally, studies have shown that children between 6 and 14 years old exhibit a greater capacity to adapt to new technologies than adults. For example, according to an OECD report [45], children in this age range develop digital skills more quickly due to early exposure to electronic devices and cognitive flexibility. Another study published by the British Journal of Educational Technology [46] highlights that children under 15 can learn to use new interfaces with less resistance and greater efficiency than adults, making them an ideal audience for implementing interactive technologies.

In this context, our work focuses on developing and implementing an AI-based system designed to provide emotional support through natural and interactive communication. Although the hospital environment exemplifies a critical application, the system is conceived as adaptable to other emotionally demanding contexts, fostering companionship, emotional expression, and positive engagement. Our approach integrates advanced NLP and AI components in a multiplatform architecture, aiming to enhance emotional intelligence beyond prior implementations.

Given the intended future use involving pediatric patients in clinical settings, all experimental phases involving human subjects will proceed only after obtaining full approval from accredited ethics committees. The project adheres strictly to ethical guidelines regarding informed consent, data privacy, and emotional safeguarding, with particular attention to the psychological vulnerability of hospitalized children.

4. Methodology

This section details the complete methodology used to design, develop, and implement the multi-platform AI-based software interface. We begin by outlining the overall software architecture, which is built upon ROS to ensure modularity and scalability. Following this, we provide a detailed breakdown of the core functional modules: the voice processing and generation system for verbal interaction, the facial expression system for non-verbal communication, and the movement generation module responsible for physical gestures and torque-based control.

4.1. Software Architecture

The software design of this system is organized based on the ROS architecture [47], allowing the components to be structured in a modular and efficient manner. Each system element fulfills a specific role, enabling smooth and coordinated interaction among different modules. All this information is organized in a Github repository https://github.com/RAMEL-ESPOL/AI-FaceSoftware accessed on 1 July 2025. It is important to note that the software is cross-platform; however, for practical purposes, the architecture will be integrated into a half-torso humanoid robot. Figure 1 shows how the system operates within the Gazebo simulation environment, highlighting its nodes and topics.

With this in mind, the Voice module (audio processing and generation) enables verbal interaction between the user and the system. Through the /speech_recognition node, voice commands are captured and processed via the /audio_input topic, which receives them. Meanwhile, /speech_processing interprets these commands and has two options: generating responses through AI or executing requested actions (/voice_commands), both paths being reinforced by facial expressions in the /gui_interface node.

The first option, responding to the user, is carried out through the /text_to_speech node, which converts text into audio, generating verbal feedback that is emitted via /response_audio_output. The second option, where textual actions are sent to the /motion_control node, allows position targets to be sent to the motion controller, /robot_arm_controller, which in turn communicates with /yaren_motor_communication https://github.com/RAMEL-ESPOL/YAREN.git accessed on 1 July 2025, (part of the robot’s specific control software), to translate high-level commands into specific signals for the motors.

In this context, the /yaren_motor_communication node functions to send target positions for movement execution. Additionally, this node can also receive information, specifically the torque of the arm joint motors. This allows the child to use the robot’s limbs as levers to play video games on the system’s LCD screen. This is where the /gui_interface node comes into play, along with the /game_state topic, which publishes the game state and, consequently, allows it to be displayed.

In summary, this package of nodes (/move_group) plays a central role in trajectory planning and torque reception, integrating with the simulated environment in Gazebo, where movements are executed through the updating of topics such as /joint_goals, /follow_joint_trajectory, and /execute_trajectory.

The Graphical Interface module (facial expressions and emotional state) is designed to complement user interactions through visual representations. The /emotion_processor node analyzes the emotional context and selects appropriate facial expressions, which are controlled and represented by the /gui_interface node. This module utilizes topics such as /emotion_state to determine emotional states and /facial_expression_output to graphically represent the robot’s emotions. In this way, the system responds verbally and accompanies responses with dynamic facial expressions.

To implement the described functionalities, various specialized tools and libraries were employed to materialize the project’s logic. Gazebo was used as the simulation environment to test the robot’s behavior, validating both trajectories and the overall system functionality. Control and communication between different modules were managed using Rospy, the Python library for ROS, through which nodes, topics, services, and parameters were handled. The position and speed of the joints were controlled using the sensor_msgs package. Pygame was used for the robot’s visual representation, playing a fundamental role in designing facial expressions and gestures within the graphical interface [48]. Additionally, Whisper was responsible for converting the user’s voice into text, facilitating interaction with the system.

Finally, response generation was carried out through the OpenAI API, chosen for the quality and coherence of its responses. A key advantage of the ChatGPT API is its versatility, as it adapts to multiple applications. Its fine-tuning capability via reinforcement learning with human feedback (RLHF) optimizes its performance in specific scenarios [49,50]. Moreover, its integration is more accessible than other AI APIs, which require more complex configurations [51].

Regarding performance and cost, ChatGPT represents a competitive option compared to other market solutions. Although AWS AI Services and Azure AI Services are optimized for large-scale processing, the ChatGPT API offers an efficient alternative with fewer additional configuration requirements. Its optimized architecture allows for responses with greater accuracy and coherence, reducing post-processing work and improving the user experience [52]. These characteristics make it a preferred solution for applications requiring advanced conversational interaction.

4.2. Voice Processing and Generation

The voice module is the central component that enables verbal interaction between the user and the robot, facilitating communication through speech recognition, processing, and response generation. This module not only provides auditory responses but is also closely integrated with the graphical interface, allowing synchronization of facial expressions and gestures according to the conversation content.

For speech-to-text conversion, OpenAI’s Whisper base model is used, renowned for its efficiency and accuracy in audio transcription [53,54]. Sound capture is performed using the sounddevice library, employing a 16 kHz sampling rate to ensure compatibility with artificial intelligence models [55]. The system operates in 5-s windows, enabling quick command processing without prolonged interruptions. A user’s input is considered complete after 1.5 s of silence, ensuring a fluid conversation without additional confirmations.

To prevent accidental activations, the voice module incorporates three detection mechanisms. First, the system responds to the keyword “Bobby” chosen for its ease of pronunciation and detection. This selection is based on research on early language acquisition, such as Stark’s (1980) observations, which highlight that young children tend to pronounce words containing voiced consonants like “b”, “m”, and “p” more clearly due to their phonetic simplicity [56]. These consonants are easy to articulate and are often accompanied by vowels, further facilitating pronunciation since children acquire sounds requiring less articulatory effort more easily. It is important to note that the operator can modify this keyword if necessary to optimize interaction. This method allows the robot to activate similarly to virtual assistants like Alexa or Google Assistant. Second, the robot can be activated through predefined phrases such as “hello”, “goodbye”, or “how are you?” enabling more intuitive and realistic interaction. Finally, the third method consists of detecting specific commands such as “dance”, “fight”, or “hug”, allowing immediate execution of actions.

When the user issues a query that does not match a predefined command, the input is sent to OpenAI’s API using the GPT-3.5-Turbo model. To optimize API consumption and ensure concise responses, a maximum token limit and word count are set in the prompt. Additionally, the AI is configured to filter inappropriate content and always respond politely and coherently within the medical context.

If the system cannot process a request due to an incomplete or missing voice input within the active recording window, a feedback mechanism is provided to guide the user in rephrasing their message. The assistant prompts the user to repeat or clarify their request, ensuring the interaction remains smooth and effective. Furthermore, although the system prioritizes complete and clear commands, its intelligence allows it to interpret the conversation’s context and capture the user’s intent, facilitating a more natural and continuous interaction.

Once a response is generated, it is processed in two formats: text and audio. The text is displayed on the graphical interface. At the same time, the audio is synthesized using gTTS (Google Text-to-Speech), chosen for its more natural tone compared to other synthesizers like Festival or Pico TTS (Text-to-Speech) [57]. The synthesis of the described process can be seen in Figure 2. Finally, audio playback is managed with Pygame, enabling clear and fluid verbal response delivery.

Command processing follows a sequential order, meaning the robot executes the first received instruction before accepting new commands. If the AI does not understand a request, the system invites the user to rephrase it, ensuring the expected information is provided. Additionally, the system’s modular design allows for customization and expansion of the voice module without retraining models. Configuring files such as config.yaml or .json can easily add new commands and conversations, enabling the system to adapt to different environments and needs.

The system is designed so that each auditory response is complemented by an appropriate facial expression, enhancing the robot’s non-verbal communication. To achieve this, the /gui_interface node selects and projects the most suitable expression for the message, choosing from several predefined images. For example, a happy expression is displayed when the response is positive, a thinking expression when the robot analyzes a question and a neutral expression in standard dialogues. Additionally, the system uses image cycles with Pygame for basic animations, such as the robot’s blinking face, adding naturalness to the interaction.

4.3. Facial Expression System

To graphically design the face, research was conducted on designs applied in fields related to entertainment, psychological care, and early childhood education [58]. Additionally, various styles used to represent robots in animated series aimed at children were explored.

The research showed that a predominant characteristic among robot designs is a cute and visually appealing face, usually characterized by large eyes and a proportionally smaller mouth [59]. Based on these patterns, a design inspired by Japanese cartoons was chosen, as shown in Figure 3, to maximize emotional connection with young users.

For emotional representation, basic expressions commonly found in entertainment—such as happiness, anger, and sadness—were selected. Additionally, expressions and accessories were included (see Figure 4) that slightly break away from the base facial design (similar to emoticons) to create a more pronounced cartoon effect that harmonizes with the robot’s programmed commands.

Using specialized digital drawing software called “CLIP Studio Paint”, various designs for eyes and mouths were developed to represent different emotional states. To ensure flexibility and modularity, the graphic elements were individually stored in PNG (Portable Network Graphics) format, allowing their integration into the platform via code. This enabled the graphical expressions to be displayed on a screen, enriching user interaction. This approach facilitates system expansion with new gestures or visual adjustments as needed.

In particular, the mouth was treated as an independent component, where key points were assigned to its contour. This process was performed manually rather than programmatically due to the variability in mouth shape, which does not follow a strictly elliptical geometry. A total of 24 reference points were defined, allowing the formation of an optimal triangular mesh for movement interpolation, as shown in Figure 5. Although the number of points could be increased, doing so would extend the manual adjustment process without providing significant visual improvement.

This process was designed to synchronize the robot’s mouth movements with speech. Each character is associated with a specific mouth image, enabling an accurate representation of articulation. Figure 6 illustrates how each letter of the alphabet is linked to a mouth image that reflects its correct pronunciation. Additionally, the process is demonstrated with the words “hospital” and “robot”, highlighting the synchronization between mouth movements and speech. It is worth noting that to extend this methodology to other languages, it is recommended to adapt the image bank according to the phonemes of the new language or, if necessary, add new images. A detailed technical breakdown of the interpolation method, frame generation, and opacity blending is provided in Appendix A.

To implement synchronization between the images and the generated audio, specialized libraries were used to facilitate image manipulation and create the illusion of movement on the robot’s face. Notably, Scipy provides advanced tools for interpolation and geometric calculations [60]. Pillow is used for handling the generated images, allowing parameter adjustments such as resolution, size, and format [61]. Additionally, MoviePy assembles the generated frames into a video sequence, synchronizing them with the audio to ensure proper alignment between spoken phrases and mouth movements [62].

4.4. Movement Generation

The motion generation module is based on a torque control scheme, allowing the robot’s joints to dynamically respond to the force applied by the user. Unlike a purely position-based control, where the robot follows predefined trajectories, torque control provides a more intuitive interaction, as the joints respond according to the direction and magnitude of the force exerted on them. To facilitate its implementation, the shoulder and elbow joints were designated as the primary interaction points: the elbow controls vertical displacement, while the shoulder enables horizontal movement. Additionally, it is possible to combine both movements for more dynamic and versatile actions. To implement this functionality, the system monitors in real-time the torque values in each arm joint through the node /yaren_motor_communication. This node evaluates whether the recorded values fall within a predefined range, classifying them as positive or negative movements along horizontal and vertical axes. To ensure that movements are perceptible and consistent, oversized torque ranges were defined, preventing minor fluctuations from causing unintended responses. The processed values are sent to the /motion_control node, which regulates the system’s response, enabling fluid and natural movements.

Torque control enables the integration of various interactive activities, such as manipulating virtual objects or performing gestures within a digital interface. One of its most significant uses is integration with games, where users can control graphical elements on a screen by physically manipulating the robot’s arms. Thanks to communication between /yaren_motor_communication and the /gui_interface node, the user’s movements can be reflected in the interface in real-time, enabling immersive interaction.

The system detects the force applied by the user and adjusts the robot’s response accordingly, facilitating dynamic experiences without the need for additional buttons or physical controls. To prevent erratic responses and ensure that only intentional movements are registered, minimum torque thresholds have been defined. This allows the user greater control over the system and ensures movements are predictable and stable. To guarantee a safe and accessible experience, the motors have a maximum torque limit, preventing sudden or uncontrolled movements and ensuring the robot can be safely operated even by small children. This adaptability to different user force levels and skills makes the system inclusive and suitable for various contexts ranging from entertainment to educational and therapeutic applications. In addition to enabling a different type of interaction with the robot, torque control facilitates system integration with different applications without requiring significant adjustments to the existing architecture. Thanks to its modular design, the system can adapt to various interaction dynamics, allowing the incorporation of new activities and functionalities without modifying its core architecture.

5. Results

The following section presents the results obtained from the software architecture designed for robot-assisted child companionship [63]. Practical examples of human–robot interaction will be included, highlighting various aspects of the system. In addition, a test conducted in a hospital in the city of Guayaquil will be analyzed to illustrate a potential application scenario for the system.

5.1. Robot Emotional System

The developed system integrates multiple components to optimize the robot’s interaction with its environment. Upon activation of the code, the interface automatically initiates a blinking sequence. When the robot receives conversational input, either through artificial intelligence or from a pre-programmed repertoire, the assistant generates a response. This response can be verbal, non-verbal, or both. The non-verbal response includes facial expressions, body movements, and other gestures that enrich the robot’s communication. The response, provided in text format, is processed through a video generation function that synchronizes facial expressions with the corresponding audio. In real-time, the screen displays the emotion associated with the conversational context and the voice output, as illustrated in Figure 7.

The video generation process is divided into two stages: first, mouth animation is generated, followed by the eyes. Once the assistant produces the response, audio is generated and subjected to trimming to remove silent segments. Subsequently, an artificial intelligence model classifies the response within a set of emotions, such as happiness, sadness, surprise, etc. The detected emotion determines both the eye expression and the mouth animation, ensuring a natural and coherent synchronization during the robot’s interaction, as illustrated in Figure 8. Finally, the blinking sequence is adjusted in timing and pattern according to the detected emotion, contributing to a more realistic facial expression.

5.2. Movement and Control

A structured protocol is followed for creating, capturing, and executing movements. Initially, the torque for each motor can be deactivated, which allows the robot’s joints to be manually moved into a specific position, as shown in Figure 9a. Subsequently, the torque is activated to maintain this position (Figure 9b) and record the joint values. These values are presented in array format, ready to be directly used in the code. This methodology simplifies the routine creation and management of multiple position arrays for the robot, allowing users to customize the system according to the specific requirements of the operational environment.

The implemented control system enables the robot to execute smooth and suitable movements for interaction. However, a slight offset occurs between the final and target positions. This offset is minimal and does not affect the system’s functionality. The target position is defined manually by placing the robot’s joints in a specific configuration, as previously explained. The final position is reached once the recorded joint values are executed during movement playback. Movements perform their intended function satisfactorily, with an almost imperceptible error that does not interfere with the perception of gestures during interactions with the child. Upon completing this protocol, the system can execute dynamic and expressive actions based on the stored poses, such as performing movements synchronized to music. Figure 9c shows an example of the robot performing such a movement, illustrating the practical application of this capture-and-replay methodology.

5.3. Voice Assistant

This section describes the operation of the voice assistant, the different types of conversations it can handle, the transcript of questions asked, and the generated responses, along with their execution times. Figure 10 shows an interface displaying a conversation between a child and the system in the hospital.

The first conversation type corresponds to everyday interactions, such as ”Hello”, “How are you?” “Can you help me?” or “Goodbye”. In the child’s first two interventions, the system responds without requiring the activation word “Bobby”. This preprogrammed repertoire has two main purposes: first, to maintain naturalness in conversation, as everyday language does not always start with a keyword; and second, to optimize response time by avoiding processing through artificial intelligence. The execution times for these responses are 0.51 and 0.43 s, respectively, indicating how quickly the system generates a response once the user finishes speaking. Notably, these values do not include video and audio generation time, which varies between 2 and 3 s, depending on the length of the message.

In the third interaction, the second type of activation occurs when the user mentions the keyword “Bobby”, indicating the question should be processed by artificial intelligence to generate a personalized response. Despite this, the response time remains efficient, not exceeding 0.46 s.

Finally, the conversation concludes with an action request when the child asks the robot for a hug. This represents the third type of activation: action commands. The system recognizes the instruction and executes it within 0.45 s, demonstrating its quick responsiveness.

The assistant can respond to any verbal interaction within predefined ethical parameters and constraints, ensuring safe and context-appropriate communication. Additionally, the robot’s facial expression system dynamically updates throughout the conversation, reflecting emotions that make the child feel understood and foster a trusting environment in which the child can express themselves freely.

Another notable functionality is the possibility of playing games on the robot’s LCD screen, offering a more immersive and interactive experience. Figure 11 illustrates a maze featuring two characters: a cat and a mouse. The game’s objective is for the cat to attempt to catch the mouse, while the mouse must collect all the pieces of cheese before being captured and find the exit to unlock the next level. The game’s dynamics can be observed in Figure 12, which presents a chronological interaction sequence.

It is important to highlight that these games can be enjoyed by up to two players, enabling children to interact by using the robot’s arm joints as joysticks. This functionality leverages the torque of the motors for game configuration and control. As shown in the previous figures, the described logic functions correctly, demonstrating appropriate responses based on movements made using the robot’s joints.

5.4. Software Evaluation on Other Robots

Cross-platform compatibility is a key factor in developing software intended for a broad audience. Within the context of hospitals, educational institutions, and clients with diverse technological infrastructures, it is essential that systems remain flexible and adaptable to various environments. In this context, we introduce Loly-Midi, a project developed by the Faculty of Art, Design, and Audiovisual Communication at ESPOL in collaboration with the Research, Development, and Innovation Center for Computational Systems (CIDIS) [15].

The goal of Loly-Midi is to enhance the development of social and cognitive skills in children with special educational needs, with particular emphasis on those with Autism Spectrum Disorder (ASD). The developed software was integrated into this robotic bust, as illustrated in Figure 13. Initially designed for interacting with children through educational applications and games, Loly-Midi now offers a more immersive experience. In addition to games on electronic devices, children can interact directly with their robot friend, asking questions about their classroom sessions or any other concerns that help improve their learning and therapeutic experience. Similar to “Bobby”, the system also has the functionality to perform movements, such as waving, through voice commands, fostering a closer connection with children. This physical interaction is complemented by collaborative games in which children can use the robot’s limbs to control characters on the LCD screen, enhancing teamwork and motor skills.

Additionally, Loly-Midi incorporates dynamic facial expressions that reinforce empathy and emotional connection with children. As shown in Figure 14, the facial expression system spontaneously responds throughout the interaction, allowing emotions to be conveyed and strengthening non-verbal communication, a key aspect in the socio-emotional development of children with ASD.

Walter Waiter is an autonomous navigation robot developed by CIDIS at ESPOL, specifically designed to perform waiter-related tasks. Due to the necessity of continuous interaction with customers, integrating the developed software substantially improves its communication abilities, including both verbal and non-verbal interactions. Figure 15 illustrates this social robot equipped with the newly developed system, highlighting its expressive and prominent eyes designed to effectively capture the attention of users (see literal a).

Cordial interaction is particularly significant within restaurant environments. Such interactions are supported through gestures like greetings or specific body movements, which often convey more meaning than verbal communication alone. These movements, enabled by the integrated control system, are exemplified in literal b, depicting Walter positioned to perform a handshake gesture with customers.

A key aspect of the family-friendly environment has also been considered: the presence of children. Managing wait times for young children is a common challenge in service environments. For this reason, while waiting for their orders, they can now enjoy games projected on the robot’s LCD screen and control them using arm movements, as shown in literal c. This functionality turns Walter into an interactive, complete, and innovative system that not only improves service, but also provides a different and enjoyable experience for customers.

To enable more complete and engaging interactions, a facial expression system with humanized features has been implemented. Its purpose is to generate a positive impression on customers, allowing them to feel understood at all times. Some of these expressions can be seen in Figure 16, with pleasant and happy emotions being especially relevant, as they are the most frequently used in customer service contexts.

To quantitatively evaluate the software’s performance across different robotic platforms, Table 3 presents a comparison of average latency, frame rate, and power consumption for YAREN, Loly-Midi, and Walter Waiter. These metrics were estimated based on hardware specifications, such as Intel NUC processors used in the robots, and benchmarks from similar AI-based interaction and control system. The results demonstrate that the software maintains consistent real-time responsiveness and energy efficiency across diverse platforms, supporting social interaction scenarios without compromising performance. This cross-platform compatibility highlights the system’s flexibility, confirming that the same software core can be deployed effectively in robots with different purposes, computational capacities, and interaction modalities.

In summary, the developed software has been successfully integrated into two additional robotic platforms: one therapeutic, aimed at patients with Autism Spectrum Disorder (ASD), and another social, designed for customer service in restaurants. In both cases, the system has enhanced the hardware’s functionalities, reinforcing the vision promoted by the Faculty of Art, Design, and Audiovisual Communication in collaboration with CIDIS.

Finally, it is worth highlighting a relevant particularity: both robots lack an intelligent mouth. This fact further emphasizes the value of the work presented in this article, as very few developers venture into this area of the face due to the complexity involved in achieving precise synchronization between speech and mouth movements. The proposed system successfully addresses this limitation by offering a multi-platform solution capable of producing natural and visually pleasant results for human perception.

6. Conclusions and Future Scope

This work presents a modular and hardware-independent software architecture that is integrated successfully into robots to provide safe, engaging, and emotionally supportive companionship. Unlike many existing solutions, the proposed architecture combines conversational AI, coordinated motor control, safe physical interaction through torque modulation, and expressive nonverbal communication into a single, cohesive system. This holistic approach not only addresses the technical requirements of robot integration, but also is designed to promote a positive and supportive experience, making it especially relevant in scenarios where children experience loneliness or vulnerability.

The integration of ROS-based control enables smooth, real-time conversations while synchronizing gestures and movements with voice commands. This results in both simple interactions, such as greetings, and more complex entertainment routines, including dances and character-based performances. These dynamic interactions immerse children in experiences that foster relaxation, joy, and the development of social skills.

A key technical contribution of this work is the integration of torque control and an autonomous facial expression generator to create richer, safer, and more engaging interactions. The torque control ensures precise and safe movements while allowing interactive play by transforming the robot’s arms into intuitive, responsive levers for controlling on-screen games, blending physical activity with digital engagement to enhance fine motor skills, coordination, and sustained attention. In parallel, the facial expression generator produces context-appropriate expressions during conversations, improving nonverbal communication and strengthening the child’s perception of the robot as a compassionate and trustworthy companion. Together, these features bridge the gap between physical and emotional interaction, offering an integrated engagement experience that purely digital or purely physical systems cannot match.

In conclusion, the proposed architecture offers a flexible, scalable, and portable framework for integration into various platforms, opening pathways for deployment in hospitals, educational settings, and therapeutic environments.

Future research will involve longitudinal user studies that aim to quantify both the emotional and therapeutic impact of the system on children over extended periods of use. These studies will be conducted in compliance with established ethical standards for research involving minors, requiring prior approval from the corresponding institutional ethics committee, with informed consent from parents or legal guardians and assent from the participating children. Another line of development will focus on expanding the system’s emotional recognition capabilities through multimodal inputs, including voice analysis, facial expression detection, and body posture assessment. This enhancement will enable the robot to better interpret the child’s emotional state, resulting in more adaptive and context-aware responses. Finally, efforts will be directed toward integrating remote monitoring and caregiver feedback mechanisms, which will enhance safety by allowing caregivers to oversee interactions in real time while enabling personalized adjustments to the robot’s behavior based on individual needs and preferences.

By combining technical innovation with a human-centered design philosophy, this research demonstrates the potential of humanoid robots to evolve into meaningful, safe, and adaptive companions that actively contribute to children’s emotional well-being and development.

Author Contributions

Conceptualization, I.L., C.R. and F.Y.; methodology, I.L. and C.R.; software, I.L. and C.R.; validation, I.L., C.R. and I.D.; formal analysis, I.L., C.R. and I.D.; investigation, I.L., C.R. and I.D.; resources, F.Y.; writing—original draft preparation, I.L., C.R. and I.D.; writing—review and editing, I.L., C.R. and I.D.; visualization, I.L., C.R. and F.Y.; supervision, B.P., D.P., N.S., M.F.-P., H.M. and F.Y.; project administration, F.Y.; funding acquisition, F.Y. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

The primary focus and contribution of this manuscript is the development and validation of multiplatform software architecture, not a clinical study involving human subjects. The paper does not present results from a formal clinical trial. We clearly state that this was a preliminary, non-interventional, and observational proof-of-concept demonstration, conducted with full institutional permission and consent, intended solely to assess the system’s technical feasibility in a real-world setting. Thus no ethical approval is needed.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1 Software architecture. General diagram of the system’s inputs and outputs, including the voice processing, motion control, and graphical interface modules. The turquoise blocks represent nodes, while the red words represent the associated topics.

Figure 2 Voice generation and processing flow, There is interaction between the modules responsible for capturing, processing, and generating responses, allowing for fluid communication with the user.

Figure 3 Default base face. The system will maintain this facial expression when not interacting.

Figure 4 A confident expression. This facial expression is displayed by the system when the child makes an amusing remark or shares something positive about their day.

Figure 5 (a) Assignment of points to the mouth contour. (b) Generation of a triangular mesh for interpolation. The arrangement of the 24 points that will enable the facial animation process is shown.

Figure 6 (a) Image-letter assignment. (b) Visual phonemes for the word “hospital”. (c) Visual phonemes for the word “robot”. A specific mouth configuration is assigned to each letter, allowing the system to gesture in sync with speech and achieve a natural representation of lip movement. The presented examples show the different mouth positions during the interaction.

Figure 7 Emotions of the facial expression system. Its dynamic and intelligent design enables the robot to express key emotions, such as empathy and understanding, which are essential for interaction with children in delicate environments, facilitating closer and more humanized communication.

Figure 8 Dynamic eye synchronization during interaction with the child. The system includes eyes that continuously update throughout the conversation, providing more expressive communication and fostering a supportive atmosphere for the child.

Figure 9 (a) Manual adjustment of the joints. (b) Torque-based fixation of the selected position. This procedure allows the robot to be manually positioned into a specific pose, activate torque to maintain it, and store the values for later use. Thanks to the developed intelligent motors, the system facilitates the programming of customized movements and optimizes precision in gesture execution. (c) Robot playing and dancing to a song. This process integrates synchronized movements and facial expressions to create a more immersive and natural interaction with children, enhancing the emotional impact of the system.

Figure 10 (a) Transcript of the child’s conversation with the system. (b) Interaction of young patients with the system in the hospital. The transcript demonstrates the various methods of voice activation, as well as the dynamic and intelligent facial expression system, which adapts its gestures in real time to enhance emotional communication. This interaction illustrates the robot’s ability to foster trust and closeness, providing support within a medical environment.

Figure 11 Game displayed on the robot’s LCD screen. Integrating this interactive dynamic provides an immersive experience, helping distract children during medical treatments and offering an entertaining moment that promotes their emotional well-being.

Figure 12 (a) Initial game setup. (b) Horizontal movement of the character. (c) Movement in both directions. (d) Two-player game mode. In addition to providing entertainment, this interaction contributes to the development of children’s fine motor skills, playfully promoting coordination and manual dexterity.

Figure 13 (a) Software implemented on the Loly-Midi robot. (b) Motion control activated through voice commands. (c) Therapeutic play supported by motor-based interaction. The integrated system enhances Loly-Midi’s capabilities, enabling verbal interaction that supports children’s social and cognitive development. It also allows the robot to perform movements in response to voice input, fostering a more engaging and personalized connection with users. Additionally, its arm joints serve as interactive levers for controlling games on the LCD screen, encouraging immersive, collaborative play and promoting fine motor skills—particularly beneficial for children with ASD who face challenges in motor control and non-verbal interaction.

Figure 14 Facial expression system in Loly. The developed system enables dynamic emotional responses that enhance non-verbal communication, reinforcing empathy and emotional connection—particularly supporting the robot’s original purpose of fostering socio-emotional development in children with special educational needs.

Figure 15 (a) Multi-platform software integrated into the waiter robot Walter. (b) Execution of social gestures through motor control. (c) Interactive entertainment for children via arm-based input. The integrated system enhances verbal and non-verbal communication through expressive eyes and dynamic facial animations, enabling more natural and empathetic interaction. Walter can also perform gestures such as waving and handshakes. Additionally, children can control games on the LCD screen using the robot’s arms as interactive levers, transforming waiting time into an engaging and playful experience in family-friendly service environments.

Figure 16 Facial expression system for customer interaction in Walter. Walter uses a dynamic facial expression system to display emotions such as happiness and friendliness. These expressions help create an emotional connection with customers, making interactions more engaging and human-like—especially important in hospitality environments where warmth and empathy are valued.

Table 1

Technical comparison of robotic systems for child companionship.

Feature	AI-Based Software (Proposed System)	Pepper	NAO	Huggable	Paro
Versatility	High (hospitals, daycare centers, schools, homes)	Medium (retail, education, healthcare)	Medium (education, research)	Low (hospitals only)	Low (specialized therapy)
Multiplatform scalability	High (hardware-independent)	Low (SoftBank hardware only)	Low (closed ecosystem, not compatible with other systems)	Low (non-modular, limited to clinical use)	Low (proprietary design, no expansion)
Response time	400–500 ms (voice/movement synchronized)	500–700 ms	≈400 ms	1–2 s (remote latency)	Instantaneous (no verbal processing)
Motion accuracy	High (torque modulation and coordinated movements)	Medium (basic gestures, no torque)	Medium (predefined algorithms)	Low (limited fine motor skills)	Not applicable
Cost	Free software + replicable hardware ∼5000 USD	∼34,000 USD	∼10,000 USD	Experimental prototype (no commercial price)	∼6000 USD

Table 2

Emotional and interaction comparison of robotic systems for child companionship.

Feature	AI-Based Software (Proposed System)	Pepper	NAO	Huggable	Paro
Empathetic interaction (verbal/emotional)	High (detects child’s emotion and adapts facial expressions and dialogue)	Medium (basic emotional recognition and verbal response)	Low (requires manual programming for responses)	Medium (emotional response via human-controlled avatar)	Not applicable (sounds and touch only, no verbal interaction)
Immersiveness	High (gestures, physical games, interactive torque)	Medium (gestures and touch screen)	Medium (basic gestures, eyes with LED – Light-Emitting Diode)	Low (no motors)	Low (head and eye movement only)

Table 3

Cross-platform performance comparison of the AI-based software on three robots.

Metric	YAREN	Loly-Midi	Walter Waiter
Average Latency (ms)	400–500	400–500	400–500
Frame Rate (FPS)	25–30	20–25	22–28
Power Usage (W)	15–18	18–22	17–20

Appendix A

This appendix provides a detailed technical explanation of the lip-sync animation process described in Section 4. To achieve a smooth visual transition between different mouth shapes (phonemes), a multi-stage method involving geometric interpolation and opacity blending was implemented. The transition from one character to another occurs in three stages:

A sequence of intermediate images is generated to smooth the transition between mouth positions. To achieve this, progressive deformation is applied using Delaunay triangulation [64]. This method helps maintain the visual structure of the mouth while changing its shape to pronounce different phonemes, preventing abrupt distortions. The connected triangles help distribute the control points evenly, allowing precise interpolation of each point’s position in the intermediate frames. Figure A1 illustrates how the mouth image deforms in successive steps until reaching the final configuration. The number of steps (n) depends on the frames per second (fps) the animation uses. A frequency of 24 fps is recommended to ensure natural transitions.

Once the interpolation in one direction (from the initial to the final character) is complete, the process is repeated in reverse. This improves visual fluidity and enhances speech perception during user interaction.

The generated images, both from the initial to the final character (row 1) and in reverse (from final to initial, row 2), show a gradual decrease in opacity from column 1 to n, as seen in Figure A2a. The goal is for the initial transition images to maintain higher opacity, making them resemble the original image more closely. As the transition progresses, opacity gradually decreases, facilitating a smooth blending between mouth positions and improving the visual perception of speech. Figure A2b shows the inversion of the second row to achieve the intended transition from the initial to the final character image. Opacity interpolation is linked to the number of generated images, ensuring a gradual merge between mouth states in the animation. To optimize performance, the generated images are stored in a cache folder, avoiding redundant calculations and reducing the system’s response time.

Figure A1 (a) Delaunay triangulation. (b) Intercharacter deformation matrix. The interpolation process progressively transforms the mouth shape through a sequence of images generated using the computational geometry technique described above. This allows for smooth visual transitions between phonemes, avoiding abrupt deformations and ensuring natural animation synchronized with pronunciation.

Figure A2 (a) Introduction to the concept of visual opacity. (b) Opacity interpolation. It is essential to highlight this concept, which is key to smoothing the transition between phonemes, ensuring a gradual blending of images, and improving the visual perception of speech.

On the final screen, the transition sequence between the two characters is displayed. First, the original image of the initial character is shown, followed by the interpolation generated in the previous step, consisting of n intermediate images that smooth the transformation. Finally, the image of the final character is displayed.

To enhance visual perception and ensure natural animation, the original images of the initial and final characters remain on screen longer than the intermediate transition frames. Figure A3 illustrates this process, showing the overlap of images during the transition and the final sequence of two characters.

Figure A3 (a) Row overlap. (b) Final sequence for two characters. The methodology used ensures continuity and smoothness of movement, achieving realistic animation of mouth movements and replicating the natural dynamics of human speech.

Word count: 9958

Show less

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

The absence of parental presence has a direct impact on the emotional stability and social routines of children, especially during extended periods of separation from their family environment, as in the case of daycare centers, hospitals, or when they remain alone at home. At the same time, the technology currently available to provide emotional support in these contexts remains limited. In response to the growing need for emotional support and companionship in child care, this project proposes the development of a multi-platform software architecture based on artificial intelligence (AI), designed to be integrated into humanoid robots that assist children between the ages of 6 and 14. The system enables daily verbal and non-verbal interactions intended to foster a sense of presence and personalized connection through conversations, games, and empathetic gestures. Built on the Robot Operating System (ROS), the software incorporates modular components for voice command processing, real-time facial expression generation, and joint movement control. These modules allow the robot to hold natural conversations, display dynamic facial expressions on its LCD (Liquid Crystal Display) screen, and synchronize gestures with spoken responses. Additionally, a graphical interface enhances the coherence between dialogue and movement, thereby improving the quality of human–robot interaction. Initial evaluations conducted in controlled environments assessed the system’s fluency, responsiveness, and expressive behavior. Subsequently, it was implemented in a pediatric hospital in Guayaquil, Ecuador, where it accompanied children during their recovery. It was observed that this type of artificial intelligence-based software, can significantly enhance the experience of children, opening promising opportunities for its application in clinical, educational, recreational, and other child-centered settings.

Details

Title

Development of a Multi-Platform AI-Based Software Interface for the Accompaniment of Children

Author

León, Isaac¹

; Reyes, Camila¹

; Davila Iesus¹; Puruncajas Bryan¹

; Paillacho Dennys²; Solorzano Nayeth³; Fajardo-Pruna Marcelo¹

; Moon Hyungpil⁴

; Yumbla Francisco¹

¹ Facultad de Ingeniería en Mecánica y Ciencias de la Producción, Escuela Superior Politécnica del Litoral, ESPOL, Campus Gustavo Galindo, Km. 30.5 Vía Perimetral, Guayaquil 090902, Ecuador; [email protected] (I.L.); [email protected] (C.R.); [email protected] (I.D.); [email protected] (B.P.); [email protected] (M.F.-P.)
² Development and Innovation Center of Computer Systems—CIDIS, Escuela Superior Politécnica del Litoral, ESPOL, Campus Gustavo Galindo, Km. 30.5 Vía Perimetral, Guayaquil 090902, Ecuador; [email protected]
³ Facultad de Arte, Diseño y Comunicación Audiovisual, Escuela Superior Politécnica del Litoral, ESPOL, Campus Gustavo Galindo, Km. 30.5 Vía Perimetral, Guayaquil 090902, Ecuador; [email protected]
⁴ Department of Mechanical Engineering, Sungkyunkwan University, 2066 Seobu-ro, Jangan-gu, Suwon 16419, Republic of Korea; [email protected]

First page

Publication year

2025

Publication date

2025

Publisher

MDPI AG

e-ISSN

24144088

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/mti9090088

ProQuest document ID

3254614637