Content area
This paper describes a real-time system for recognizing voice commands for resource-constrained embedded devices, specifically a PIC microcontroller. While most existing speech ordering support solutions rely on high-performance processing platforms or cloud computation, the system described here performs fully embedded low-power processing locally on the device. Sound is captured through a low-cost MEMS microphone, segmented into short audio frames, and time domain features are extracted (i.e., Zero-Crossing Rate (ZCR) and Short-Time Energy (STE)). These features were chosen for low power and computational efficiency and the ability to be processed in real time on a microcontroller. For the purposes of this experimental system, a small vocabulary of four command words (i.e., “ON”, “OFF”, “LEFT”, and “RIGHT”) were used to simulate real sound-ordering interfaces. The main contribution is demonstrated in the clever combination of low-complex, lightweight signal-processing techniques with embedded neural network inference, completing a classification cycle in real time (under 50 ms). It was demonstrated that the classification accuracy was over 90% using confusion matrices and timing analysis of the classifier’s performance across vocabularies with varying levels of complexity. This method is very applicable to IoT and portable embedded applications, offering a low-latency classification alternative to more complex and resource intensive classification architectures.
Full text
1. Introduction
Today, voice recognition technologies can be found in modern applications such as smart homes [1,2], wearable devices [3], robots [4], and car interfaces [5]. They provide an easy and hands-free way to control something, and their growth in popularity is driven by improvements in speech recognition [6,7]. However, traditional solutions often involve high-performance processors or cloud-based computation, which can be problematic for small, battery powered embedded devices from the perspectives of latency, privacy, and connectivity [8,9].
There are many challenges in developing voice command recognition in microcontrollers, including limited CPU cycles, memory, and power consumption [10,11]. While previous researchers have achieved embedded speech processing using DSP solutions on ARM Cortex-M platforms [12,13], they have not been successful in developing an efficient method for real-time recognition of commands. More recent studies have reduced the computational burden by using lightweight audio features such as MFCCs [14,15] or simple time-domain features like signal energy and zero-crossing-rate (ZCR) [16,17]. Some researchers have reduced processing loads even more by using simple classifiers, such as SVMs [18], or using small neural networks [19,20].
PIC microcontrollers now include DSP blocks and floating/fixed-point arithmetic, enabling advanced on-chip signal processing [21]. Some studies have demonstrated small neural networks running on PICs for limited-vocabulary recognition [22,23]; however, real-time performance for multiple commands has not been fully demonstrated.
In this work, we propose a real-time voice command recognition system using a PIC microcontroller. Audio is captured via a low-cost MEMS microphone, and simple time-domain features (ZCR and STE) are extracted. These features feed a small multilayer perceptron (MLP) neural network that classifies four basic commands: “ON”, “OFF”, “LEFT”, and “RIGHT” (See Figure 1).
The system design focuses on lightweight, adaptable, and resource-efficient keyword detection with advanced latency management to demonstrate robustness across acoustic conditions. The neural network is trained offline and quantized as fixed-point to fill the constraints of the PIC platform. Our experimental results show > 90% accuracy, latency under 50 ms, and low power consumption verified through confusion matrices, measured timing, and measured power. Despite consumer devices supporting complex voice commands, our work shows that real-time voice command recognition can even be executed by ultra-low-power microcontrollers. The presented approach is well-suited for IoT devices, embedded robotics, and smart home use cases, whereby latency, memory, and power constraints would rule out cloud or higher-performance solutions. In this way the present work establishes a baseline framework for lightweight on-device speech requests that can extend to larger vocabularies or more complex command sets.
The main contributions included in this work are the achievement of the following goals: ✓. Create, design, implement, and evaluate a voice-controlled system with a PIC microcontroller. ✓. Structure a keyword-spotting framework that is lightweight, versatile, and can be implemented in limited resource environment. ✓. Make sure system components are low-latency communicated and can keep the system functional and useful. ✓. Use progressive latency management to adapt to ambient noise changes. ✓. Test and observe whether the system works as expected. ✓. Provide modular and expandable support to future advancement of the system and to include added functionality without restricting other aspects.
This paper is structured as follows: In Section 2 we provide a review of the literature that is relevant to voice command recognition systems and their embedded implementations. In Section 3 we describe the methodology involving dataset preparation, feature extraction, and model training. Section 4 details the experimental setup and reports the results of testing the system that we implemented on embedded hardware. Section 5 evaluates the performance of the system; including accuracy, inference time, and power consumption, while identifying the issues it has at present. Section 6 provides a broader discussion of the findings, the challenges in implementation and opportunities for improvements. Finally, Section 7 gives a conclusion and future work.
2. Literature Review
This section looks at the history and current developments in real-time voice command recognition, with special reference to the implementation of this technology in PIC microcontrollers. Subjects of relevance are system architecture, algorithms, signal processing techniques, hardware constraints, and optimization methods. Voice command recognition has gained an important place in embedded systems, especially applications that require real-time processing and low resource environments.
2.1. Voice Recognition Technology’s Progression
Voice recognition technologies have evolved from being dependent on powerful cloud-based resources that are not very portable to lighter and more efficient options that can be embedded. Traditional systems required a combination of algorithms, along with the computational expense of running them on powerful servers or processors. In recent years, advancements in technology have started to allow lighter-weight voice recognition to happen directly onboard the device with embedded devices [24], as there is an increasing demand for hands-free control options, especially in assistive technology, industrial automation, and smart homes.
Contemporary systems rely on signal capture, preprocessing, feature extraction, and classification followed by executing the command, [25]. Higher-end systems facilitate deep learning models that provide reliable recognition; however, they fall short for devices that are constrained on resources, like microcontrollers, unless they are highly optimized.
2.2. Role of Microcontrollers in Voice Recognition
Microcontrollers are valued in embedded applications for their low power consumption, real-time processing, and cost efficiency. The PIC family (e.g., PIC16F877A, PIC18F45K22) exemplifies devices capable of real-time operation despite their limited processing power and memory [26]. In this context, microcontrollers are chosen to implement voice-controlled systems where simplicity, reliability, and resource efficiency are critical. The challenge lies in adapting standard voice recognition pipelines to fit these constraints, which often requires optimization at both algorithmic and hardware levels [27].
2.3. Embedded Systems Feature Extraction Techniques
Feature extraction, which converts raw audio data to a simpler and more understandable format for later categorization, is an essential part of voice recognition. Mel Frequency Cepstral Coefficients (MFCC) are widely used methods for obtaining the tumbrel information of speech and are popular due to their high accuracy and efficiency [28]. Linear Predictive Coding (LPC), which simulates the human vocal tract, is an efficient technique that works well for low power devices. Fast Fourier Transform (FFT) is also an important technique that transforms time-domain data to the frequency domain, enabling more in-depth spectrum analysis.
Lightweight and real-time techniques are important in embedded systems and microcontrollers, as they have limited processing capabilities. Recent work has focused on improving MFCC for microcontroller platforms by optimizing the filter bank, making the frame sizes smaller while retaining accuracy and ensuring real-time performance [29].
2.4. Classification of Voice Commands
Once the command characteristics are captured, classification methods determine the spoken command. KNN is simple to use and efficient with small datasets. Decision trees are light in memory and quick in execution. SVMs may need more resources, but generalize better. ANNs incur more CPU resources but can model complex patterns. Fast models are preferred for the real-time requirements of embedded systems [30,31].
2.5. Experimental Implementations on PIC Microcontrollers
Multiple studies have experimentally verified the voice recognition abilities of PIC microcontrollers under laboratory conditions. For example, a PIC microcontroller (PIC16F877A) was tested with an ISD1820 module and utilized to categorize five voice commands for a home automation project in relatively quiet ambient surroundings [32]. Another PIC microcontroller (PIC18F4550) was also shown to perform simple MFCC-based recognition from external memory [33]. The recognition from the MFCC-based approach had a 85% accuracy level. We further bring attention to the importance of utilizing these external speech processing modules, not only to reduce load on the processor, but also to enable real time operation. The trials also confirmed that several variables affect recognition performance, including ambient noise level, number of voice commands, and sampling rate. To improve efficiency, the phrase “lightweight” is put forward in terms of utilizing a lightweight feature extraction method (such as a simplified MFCC or FFT), to gain features from voice commands, while providing accurate recognition performance and providing the much-needed processing and memory efficiency of the PIC microcontroller. In conclusion, it is shown that, despite limited resources, PIC microcontrollers can be practically used for voice command recognition when combined with optimized algorithms and modules.
Table 1 is a comparative summary of certain embedded speech recognition systems displaying notable information about the microcontroller, voice module, feature extraction and classification methods, real time actions, and the primary goal of each study. The work presented here is distinct from many more complex systems that use more complex feature extraction methods, such as MFCC, whereas we have combined a low-complexity neural network classifier operating in real-time on a low-power PIC microcontroller that only utilizes the most basic time-domain features of energy and zero-crossing rate (ZCR).
The PIC microcontroller was chosen as a low-cost, low-power, and widely available component typically used in embedded systems. ARM or Raspberry Pi-class solutions deliver more resources but at a higher cost and power consumption. Thus, the PIC microcontroller is more representative of the resource-constrained context in which TinyML solutions can be implemented more effectively. In comparison to prior work in Table 1, our system can be seen to improve upon previous designs. For example, compared to [26,27], our system has implemented full real-time voice command recognition rather than simply basic triggering/playback. Or, unlike [28,29], our system can perform recognition using internal PIC microcontroller resources without additional external components such as a DSP or high-performance processing board. From a processing efficiency perspective, [30,31] require FFT or complex hybrid rules, while our approach uses lightweight ZCR and STE features and can execute recognition in <50 ms. Finally, unlike [33], where static lookup is applied, our design features a quantized neural network classifier that has the potential to generalize recognition performance across different commands. In summary, these contributions reflect the novelty of the proposal for real-time TinyML-based voice recognition using ultra-low-power microcontrollers as well as the implications of this research.
3. Methodology
This section outlines the design and implementation of the real-time voice command recognition system on a constrained PIC microcontroller. The audio signal is collected ergonomically through a simple microphone and uses low complexity features, signal energy and zero-crossing rate, to characterize the speech signal. The low complexity features are input into a small neural network, which is trained offline and implemented using fixed-point logic to obviate microcontroller limitations. The system performs localization of commands in the embedded software, enabling rapid and low-power recognition of four basic voice commands. This approach allows for real-time operation without sacrificing accuracy and performance in an embedded context.
3.1. System Architecture
The proposed system enables the PIC microcontroller to perform real-time voice command recognition. The system consists of two phases: offline training and embedded inference, as illustrated in Figure 2.
During offline training, a small quantized Multi-Layer Perceptron (MLP) is trained using a dataset of labeled voice commands. While common feature extraction techniques such as MFCC, LPC, and FFT are widely used in the literature, they were not employed in this work due to their high computational demand. Instead, the system relies on Zero-Crossing Rate (ZCR) and Short-Time Energy (STE), lightweight features that enable real-time execution on resource-constrained devices such as the PIC microcontroller.
In real-time operation, analog voice signals are captured by a low-cost MEMS microphone at 16 kHz and digitized by the PIC’s ADC. The signals are framed and windowed, then the ZCR and STE features are computed. The resulting feature vectors are fed into the quantized MLP model deployed on the PIC, which classifies the commands. Finally, the recognized command is translated into a control signal for the actuation system, such as turning lights or motors on/off.
The functional architecture developed for real time voice command recognition on a PIC microcontroller is shown in Table 2. The architecture is composed of five main functional blocks, where each component specializes in a significant step in the recognition pipeline.
The functional architecture is summarized in Table 2, which presents the main processing blocks and their interactions.
Role and Novelty of Motor Control in the Proposed System: The motor control component in the proposed system is not mandatory for the TinyML model or the voice recognition flow itself, but it serves as a practical demonstration of the system’s capability to translate recognized voice commands into real-world actions. By integrating motor or actuator control, the system showcases how voice commands can directly interact with embedded hardware, which is critical for applications such as smart home automation, robotics, or IoT devices. The novelty lies in combining ultra-low-power and real-time voice recognition with direct actuation, demonstrating that a resource-constrained PIC microcontroller can reliably handle both classification and control tasks simultaneously. This integration emphasizes the efficiency, low latency, and modularity of the proposed system, highlighting its applicability to embedded systems requiring hands-free control of physical devices.
3.2. Flowchart of the Microcontroller Software
The flowchart of the microcontroller software is depicted in Figure 3 and is executed on a continuous real-time loop. After setting the components (ADC, timer, GPIO, MLP weights up), the MCU samples one second of audio, preprocesses the signal and extracts features. These features are categorized by the quantized MLP, which activates the control signals to the actuators. A loop returns the system to sample more.
3.3. Voice Command Processing Pipeline on Microcontroller
This section describes the whole data processing pipeline running on the microcontroller. It discusses the acquisition and preprocessing of the raw audio signal, and extraction of pertinent features from the voice input.
3.3.1. Audio Capture and Preprocessing
The complete process of recording and digitizing the voice signal and preparing it for feature extraction is described in this subsection.
As shown in Table 3, the system uses a MEMS microphone first, and then ADC conversion, windowing, framing, and deposing, constituting a high-performance signal preprocessing sequence for real-time voice command recognition on microcontroller-based platforms.
3.3.2. Feature Extraction
In this work, the implemented features on the PIC microcontroller are Zero-Crossing Rate (ZCR) and Short-Time Energy (STE). These features were selected because they are computationally simple, require minimal memory, and are suitable for real-time execution on resource-constrained embedded systems. While MFCC, LPC, and FFT are commonly used in embedded voice recognition systems, they were not used here due to the limited processing capability of the PIC microcontroller. ZCR and STE provide sufficient discriminative power for the small vocabulary of four commands (‘ON’, ‘OFF’, ‘LEFT’, ‘RIGHT’) and allow for rapid inference below 50 ms per command.
(1)
whereRepresents the value of the audio signal at the n-th sample.
is the sign function defined as
(2)
This expression counts the number of sign changes between adjacent samples, which is equivalent to the number of zero-crossings and then normalizes it by the total number of samples in the frame. The division by ensures the ZCR value is constrained to the range of [0, 1].
In addition, the Short-Time Energy (STE) computes the energy of each frame of signal, indicating voiced regions.
(3)
wherexn is the n-th sample in the frame,
N is the total number of samples in the frame.
This equation is essentially summing the squared amplitude for every sample in that frame and returning a value proportional to the energy of that segment.
Equations (1) and (3), demonstrate that both traits can be reconstructed from the input signal over short overlapping frames. They are suitable for real-time execution on a PIC microcontroller, which has limitations on memory and processing power, due to their simplicity.
3.3.3. Frame Processing and Feature Vector Construction
The audio input is divided into overlapping frames (typically 20–30 ms in length with 50% overlap) to extract these properties in real time. The ZCR and STE are computed and combined into a two-dimensional feature vector for every frame. Let the feature vector extracted from the frame be represented as in Equation (4).
(4)
3.3.4. Neural Network Classifier
This section describes the structure and training of the neural network used for classification. A small Multi-Layer Perceptron (MLP) architecture was trained offline and later quantized for efficient fixed-point inference on the target embedded PIC microcontroller. The model parameters were exported as C arrays for deployment. The architecture and deployment details are summarized in Table 4.
4. Experimental Results
This section provides a review and investigation of the proposed voice command recognition system. It describes the embedded software architecture and hardware platform deployed using a PIC microcontroller to process command input at run time. The section also presents performance measures including classification rate accuracy, confusion measures, inference time, and total power consumption representative of observations and testing actions completed. These results show the system’s usefulness and viability for embedded voice control applications.
4.1. Hardware Description
This section presents the physical setup and embedded software architecture used to implement the real-time voice command recognition system.
4.1.1. Hardware Setup
The system is implemented using the PIC18F4550 microcontroller controlling two power arms, with each power arm comprising an IR2112 high- and low-side driver and two IRG4PC50KD IGBT transistors. The microcontroller generates PWM control signals driving the inputs to the IR2112 which controls switch the IGBTs to generate the desired voltage waveform after passing through the protective resistors on the gate lines. The circuit is powered with 5 V for the microcontroller circuit and direct and higher voltage for the power stage (see Figure 4).
4.1.2. Command-to-Action Mapping
Table 5 shows the matching of voice commands to motor actions. The PIC16F877A microcontroller takes a voice command and generates the appropriate signals on PORTD, which drives the L293D driver in order to energize the motors. For instance, the voice command of “ON” would energize both motors to go forward. The voice commands of “LEFT” and “RIGHT” will rotate the system in different direction.
4.2. Dataset and Training Procedure
The dataset used for training and testing was recorded utterances of the four commands words “ON”, “OFF”, “LEFT” and “RIGHT.” Samples were collected from multiple voice speakers and varying acoustic conditions to support generalization. The audio data was separated into 1 s clips and processed offline to extract time-domain features—Energy and Zero Crossing Rate. The Multi-Layer Perceptron (MLP) Module was trained using fixed-point quantization to minimize the memory size and deploy it to the PIC microcontroller. The training used standard supervised learning with cross-entropy loss and used early stopping to avoid overfitting.
4.3. Classification Performance
This section addresses the validity and reliability of the voice command recognition system. The section includes results, including per-command accuracy rates and a confusion matrix, to provide clarity into the successes and failures of classification. The evaluation demonstrated that the system was able to recognize commands with a high degree of accuracy while also showing the common misclassifications to provide a review of the systems practical capabilities.
4.3.1. Accuracy per Command
A summary of individual command categorization accuracy can be seen in Figure 5. While there was strong recognition of all of the commands, “ON” and “OFF” did slightly better compared to the directional commands. This may be due to their distinct phonemes.
4.3.2. Confusion Matrix
Table 6 presents the classification results from testing the system on the four target commands in the confusion matrix. The matrix shows that most commands were recognized accurately with high confidence. The commands “LEFT” and “RIGHT” were the most likely to create confusion, probably because they variety of phonetic similarities between the two commands.
Figure 6 shows the confusion matrix for the four voice commands “ON”, “OFF”, “LEFT”, and “RIGHT”. The classification of the commands is indicated on the diagonal of the matrix with the true positive values, and the off-diagonal entries indicate the misclassifications. The overall performance of the model is quite good, with only two misclassifications of similar phonetics between “ON” and “OFF”, which would have very close acoustic levels based on measurement. The “LEFT” and “RIGHT” classifications were quite distinct, according to their acoustic structure.
4.3.3. Inference Time and Power Consumption Measurements
The inference time per command on the PIC microcontroller was as low as 50 milliseconds, ensuring real-time responsiveness to the user. The fixed-point implementation enabled efficient computation with low latency, achieving performance comparable to the original implementation. Furthermore, the power consumption remained minimal, with less than 10 mA drawn during active recognition. The combination of low latency and low power consumption makes this approach suitable for battery-powered embedded devices (see Table 7).
Figure 7a shows the measured inference time for each voice command, including the observed range across multiple cycles. Figure 7b presents the corresponding power consumption in milliamperes for each command. Displaying the data in two separate subplots clearly differentiates computational latency from energy usage, providing an accurate and standardized representation of the system’s performance for real-time embedded applications.
4.4. Real-Time System Integration and Response Evaluation
In this section, we present the configuration of the real-time voice command recognition system with the embedded hardware and consider the response of the system to voice commands by collecting and analyzing audio and electrical control signals.
4.4.1. Audio Signal Acquisition and Visualization
To enhance the accuracy of voice command recognition under various acoustic environments, the captured audio signal is further processed using real-time digital filtering on the microcontroller. In this study, a band-pass filter is applied to isolate the frequencies associated with human speech (typically 300 Hz to 3400 Hz) and to reduce background noise and interference (see Table 8).
Figure 8 shows sample audio waveforms of the spoken command “LEFT”. The top figure displays the raw unfiltered audio signal, which includes background noise, while the bottom figure shows the signal after band-pass filtering. Filtering effectively reduces background noise, resulting in a clearer speech component and improved recognition performance.
4.4.2. Voltage Signal and Control Output
The microcontroller issues PWM signals as input to the power drivers which modulate the output voltage to the load. To verify the hardware behavior for different voice commands, the output voltage waveform for each command was measured using an oscilloscope.
Table 9 displays an overview of all voice command PWM duty cycles and their relationship to output voltage.
Figure 9 shows the raw and filtered audio data that provided voice command signals for voice command recognition, as well as the PWM voltage control signals for the voice commands “ON,” “OFF,” “LEFT,” and “RIGHT.” This figure shows the pre-processing and real-time control signals from the embedded system.
4.4.3. Precision, Recall, and F1-Score Analysis
To provide a more comprehensive evaluation of the voice command recognition system, we calculated precision, recall, and F1-score for each of the four commands (“ON”, “OFF”, “LEFT”, and “RIGHT”). These metrics complement the accuracy and confusion matrix by quantifying the model’s ability to correctly identify each command while accounting for false positives and false negatives.
Table 10 presents the precision, recall, and F1-score for each command based on the experimental results obtained from the PIC microcontroller implementation. The metrics confirm that the classifier performs consistently across different commands, with slightly lower performance for phonetically similar commands (“LEFT” and “RIGHT”), which is consistent with the confusion matrix analysis.
Figure 10 shows the raw audio signal, filtered audio, and corresponding PWM voltage output for each command. It illustrates the real-time mapping of recognized voice commands to actuator control, confirming that correct classification reliably triggers the expected voltage output even in the presence of minor misclassifications.
4.5. Comparison with Previous TinyML Voice Recognition Works
To better evaluate the performance of the proposed system, a comparison with previous TinyML voice recognition implementations on PIC and ARM Cortex-M microcontrollers was conducted. Unlike many prior works that rely on computationally intensive features such as MFCC or large neural network models, this study utilizes simple time-domain features—Zero-Crossing Rate (ZCR) and Short-Time Energy (STE)—to achieve lightweight and fast inference suitable for real-time embedded applications.
Table 11 shows that the proposed system achieves comparable or better accuracy than previous works while significantly reducing inference time and power consumption, demonstrating the feasibility of running TinyML voice recognition on ultra-low-power devices. This comparison highlights the novelty of the proposed approach: efficient real-time performance, low computational cost, and the integration of motor control outputs, which were not commonly addressed in previous TinyML voice recognition works.
The advantages presented in Table 11 result from both the hardware configuration and the proposed methodological improvements. The PIC18F4550 microcontroller provides a hardware advantage due to its integrated ADC module, low instruction cycle latency, and optimized memory architecture, which together contribute to faster inference and reduced computational overhead. However, the main performance improvement comes from the proposed algorithmic approach, which combines ZCR and STE features. These features are computationally efficient and particularly suitable for short command recognition, allowing the model to maintain accuracy while minimizing processing time.
Additionally, the fixed-point implementation of the neural network further reduces energy consumption by avoiding floating-point operations, which are more demanding in embedded systems. The results demonstrate that the proposed system achieves a significantly lower power consumption (~8 mA) compared with ARM Cortex-M (12–15 mA) and Raspberry Pi-based systems (>200 mA), confirming the suitability of the design for low-power, real-time embedded applications.
The classification accuracy (left axis) and inference time (right axis) of the proposed PIC-based system are shown in Figure 11, highlighting the trade-off between accuracy and latency. The proposed system maintains high accuracy while achieving significantly lower inference time and power consumption than prior implementations.
Figure 11 presents a bar chart comparing the classification accuracy (left axis) and inference time (right axis) of previous TinyML voice recognition systems and the proposed PIC-based system. The chart highlights the trade-off between accuracy, latency, and power consumption, showing that the proposed system maintains high accuracy while achieving significantly lower inference time and reduced power consumption, demonstrating its suitability for ultra-low-power, real-time embedded applications.
5. Performance and Limitations
The voice command recognition system achieves a good compromise in accuracy, computational complexity, and power consumption, which are important parameters for embedded real-time applications. The experimental performance results show more than 90% correct identification of four simple commands with good acoustic conditions. Performance will decline under difficult conditions such as when the speaker has a different accent and/or in a noisy environment.
Although basic features such as Zero-Crossing Rate and Short-Time Energy offer computational efficiency, they do place constraints on the ability of the system to be truly robust and complete in representing speech signals in the higher-level view of classes or phonemes. The computational efficiencies may leave significant room for error, which is especially the case between commands that are phonetically similar, i.e., “LEFT” and “RIGHT.”
Beyond the implications of noise, variability of speakers, and feature representation limitations, the fixed-point quantized and compact MLP architecture was developed to emphasize the resource efficiency of memory size and CPU inference time, which ultimately leads to artwork limitations in the model complexity and expressiveness. The most tenable constraints on the application’s robustness and accuracy were derived from the atmosphere created by noise, feature representation difficulties, and the variances among speaker characteristics, as seen in some examples.
6. Discussion
This section provides a critical review of the landscape of voice command recognition systems developed on embedded microcontroller platforms compared to the other studies reported in Table 1. The variety in the literature demonstrates differences in potential hardware choices, features for extraction, classifiers, and real-time capabilities, illustrating the trade-off decisions still being made in the field.
A few studies [24,25] provide full reviews and surveys relating to voice processing techniques in general, focusing on advancements in algorithms such as the Mel-Frequency Cepstral Coefficients (MFCC), Linear Predictive Coding (LPC), and in recurrent neural networks. These studies along with many additional studies do not include any practical implementation on microcontroller platforms. In contrast, implementations using PIC microcontrollers [26,27,33] range in sophistication from a simple hardware-based, triggered systems [26] and a limited voice playback module [27], to more advanced modular recognition systems, which rely on LPC features and lookup tables [33]. Therefore, these implementations illustrate possible PIC voice recognition but lack real time, low-latency operational command recognition. By comparison, some platforms such as the Raspberry Pi [29], STM32 [31], and ARM Cortex-M [28] are capable of a greater complexity of feature extraction (e.g., MFCC with zero-crossing rate) and classification (e.g., support vector machines and deep neural networks). These systems would be acceptable for real-time applications despite their higher-complexity hardware, power usage, and implementation. In some work, hybrid or partial real-time capabilities have been employed to balance performance expectations and limitations of embedded processors. This work advances the field by implementing lightweight voice recognition on a PIC microcontroller in which we leverage simple but powerful time-domain features (signal energy and zero-crossing rate) with a compact neural network classifier programmed to fully support fixed-point execution. Our difference from other approaches using PIC is that we achieved more than 90% accuracy, with inference times under 50 ms so that the power and timing requirements for real-time processing are ideal for IoT and mobile devices.
The distinction highlights the trade-offs associated with embedded speech recognition design: for example, more complex feature extraction and classifier models can (usually) improve accuracy but often require resources that most ultra-low-power microcontrollers cannot support. Our approach, which has a strong emphasis on practical deployment ability and low computing complexity, shows that effective real-time voice command recognition can be possible without additional DSP modules or cloud-supported services.
7. Conclusions and Future Works
This paper presented a viable and efficient approach for real-time voice command recognition on a resource-limited microcontroller (PIC). By utilizing simple time-domain features (energy and zero-crossing rate) and a small fixed-point neural network, the system achieved classification of four common voice commands with over 90% accuracy. The low inference latency (<50 ms) and very low power consumption demonstrate that the solution is suitable for embedded IoT and portable applications. While the current system was designed to validate real-time command recognition on a lightweight device and does not utilize a wake word, it could take on a two-stage architecture that begins with a lightweight wake-word detector to activate the command classifier and produce lower false activation rates and greater robustness in noisy settings. Future work will also focus on vocabulary expansion, increasing robustness against noise through better preprocessing, or improving robustness through better feature extraction or a small deep learning model, without adding significant complexity. Additional capabilities could be obtained through integration with wireless communication modules to enable remote control uses.
M.S. conceived the idea and prepared the initial draft; S.H. validated the results and prepared the final draft; formal analysis, S.H.; writing, M.S. and S.H.; supervision, A.G. and K.N. All authors have read and agreed to the published version of the manuscript.
Not applicable.
Not applicable.
Data are available upon request from the authors.
The authors declare no conflicts of interest.
Footnotes
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Figure 1 Functional diagram of real-time voice command recognition system on a PIC microcontroller with commands “ON,” “OFF,” “LEFT,” and “RIGHT”.
Figure 2 Architecture of real-time voice command recognition system on PIC microcontroller.
Figure 3 Flowchart of the real-time microcontroller software for voice command recognition.
Figure 4 Circuit diagram of a PIC16F877A-based DC motor driver system for real-time voice command control (Proteus simulation).
Figure 5 Command-wise classification accuracy.
Figure 6 Confusion Matrix of Voice Command Classification.
Figure 7 Inference time and power consumption per command for the PIC-based voice recognition system.
Figure 8 Audio signal before and after real-time filtering.
Figure 9 Audio and voltage control signals for voice commands.
Figure 10 Audio and voltage signal visualization for real-time command execution.
Figure 11 TinyML voice recognition: accuracy, inference time, and power consumption.
Comparison of embedded voice recognition systems in the literature.
| Ref. | Microcontroller | Voice Module | Feature Extraction | Classifier/Processing | Real-Time Capability | Objectives of Study |
|---|---|---|---|---|---|---|
| [ | Not specified | No | General overview | General techniques | No | Survey voice recognition methods and architectures |
| [ | Not specified | No | Survey on MFCC and LPC | None | No | Review and compare feature extraction techniques |
| [ | PIC16F877A | No | No | Hardware interrupt only | No | Develop basic voice-triggered hardware system |
| [ | PIC16F877A | ISD1820 | None (pre-recorded audio) | Hardware playback | Partial | Enable audio playback and limited voice recognition |
| [ | ARM Cortex-M | External DSP | MFCC | Deep Neural Network | No | Develop accurate speech recognition with deep learning on embedded platforms |
| [ | Raspberry Pi | USB Mic | MFCC | SVM | Yes | Implement voice recognition on affordable hardware |
| [ | Arduino Uno | No | FFT | Decision Tree | Yes | Simplify voice command recognition for low-cost microcontrollers |
| [ | STM32F103 | I2S Microphone | MFCC + ZCR | Rule-based hybrid | Partial | Combine multiple features for better recognition |
| [ | PIC18F4550 | HM2007 | LPC | Lookup Table | Yes | Design modular voice command recognition |
| This Work | PIC18F45K22 | Analog Mic + ADC | Filtered MFCC (optimized) | Hybrid Lightweight Classifier | Yes (Real-Time) | Implement fully embedded real-time voice command recognition system with low resources and scalability |
Functional architecture of real-time voice command recognition on PIC microcontroller.
| No. | Component | Role | Input | Output |
|---|---|---|---|---|
| 1 | Speech Dataset (Offline) | Provides labeled commands for model training | Recorded audio | Feature vectors |
| 2 | Feature Extraction (ZCR/STE) | Converts audio into numerical features | Audio signals | Feature vectors |
| 3 | MLP Classifier | Classifies features into commands | Feature vectors | Recognized command |
| 4 | PIC Microcontroller | Runs inference and issues control signals | Real-time audio and features | Control signal |
| 5 | Actuation System | Executes the recognized command | Control signal | Physical action (light, motor) |
| 6 | Microphone | Captures spoken commands | Human voice | Real-time audio |
Signal acquisition and preprocessing modules for real-time voice command recognition on PIC microcontroller.
| Module | Description |
|---|---|
| Microphone | An inexpensive analog MEMS microphone picks up the speech; the analog signal is sampled at 8 kHz, which is appropriate for short command recognition. |
| ADC | The conversion of the analog input to digital samples is accomplished with the integrated ADC of the PIC micro controller. |
| Framing and Windowing | The audio signal is split into frames of 20 ms, overlapped by 50%, using Hamming windowing functions. Also provided is some reduction in edge effects and artifacts. |
| Denoising Signal | Some method of denoising, either normalization or low pass filtering, will be beneficial for removing background noise when the computational load is low, as computational resources on embedded devices are limited in most applications. |
Neural network classifier: architecture and deployment.
| Element | Details |
|---|---|
| Type | Feedforward MLP |
| Input | 200 features |
| Hidden Layer | 10 neurons, ReLU activation |
| Output Layer | 4 neurons, softmax (“ON”, “OFF”, “LEFT”, “RIGHT”) |
| Training | Offline (Python 3.9.9, TensorFlow 2.20.0/PyTorch 2.7.0), local or Google Speech dataset |
| Quantization | Fixed-point (Q15), exported as C arrays for PIC deployment |
Mapping of Voice Commands to System Actions.
| Voice Command | Serial Input (ASCII) | Microcontroller Action | Motor Driver Response (L293D) | System Output |
|---|---|---|---|---|
| ON | “ON” | Sets PORTD to logic HIGH | Activates Motor A and B | Motors Start Running |
| OFF | “OFF” | Sets PORTD to logic LOW | Deactivates Motor A and B | Motors Stop |
| LEFT | “LEFT” | Set Motor A: FORWARD | Motor A turns forward | Robot turns LEFT |
| RIGHT | “RIGHT” | Set Motor A: REVERSE | Motor A turns backward | Robot turns RIGHT |
Confusion matrix showing the number of samples classified as each command.
| Actual\Predicted | ON | OFF | LEFT | RIGHT |
|---|---|---|---|---|
| ON | 95 | 1 | 2 | 2 |
| OFF | 2 | 94 | 1 | 3 |
| LEFT | 1 | 2 | 91 | 6 |
| RIGHT | 0 | 3 | 5 | 92 |
Inference Time and Power Consumption Metrics.
| Metric | Mean Value | Range | Unit |
|---|---|---|---|
| Inference Time per Command | 45 | 40–50 | milliseconds |
| Power Consumption during Recognition | 8 | 7–10 | milliamperes |
Audio Signal Filtering Parameters.
| Parameter | Value |
|---|---|
| Sampling Rate | 8 kHz |
| Filter Type | Bandpass FIR |
| Passband Frequency Range | 300 Hz–3400 Hz |
| Target Commands | “ON,” “OFF,” “LEFT,” “RIGHT” |
| Noise Reduction | Background noise suppressed |
PWM and Voltage Response per Voice Command.
| Voice Command | PWM Duty Cycle (%) | Description of Output Signal |
|---|---|---|
| ON | 80 | High voltage output indicating system activation |
| OFF | 0 | No voltage output indicating system deactivation |
| LEFT | 40 | Moderate voltage output corresponding to left command |
| RIGHT | 60 | Intermediate voltage output corresponding to right command |
Precision, recall, and F1-Score for voice command Classification.
| Command | Precision (%) | Recall (%) | F1-Score (%) | Description |
|---|---|---|---|---|
| ON | 97 | 95 | 96 | High accuracy for activation command; few false positives. |
| OFF | 95 | 94 | 94.5 | Reliable detection of deactivation command. |
| LEFT | 90 | 91 | 90.5 | Slightly lower due to phonetic similarity with “RIGHT”. |
| RIGHT | 88 | 92 | 90 | Lower precision; occasional misclassification as “LEFT”. |
Power consumption comparison of TinyML voice recognition systems.
| Work | Microcontroller | Features | Average Power Consumption (mA) | Remarks |
|---|---|---|---|---|
| This Work | PIC18F4550 | ZCR + STE | 8 mA | Lowest power due to efficient time-domain processing |
| [ | PIC16F877A | MFCC | 10–12 mA | Higher ADC overhead; limited optimization |
| [ | ARM Cortex-M | MFCC | 12–15 mA | Moderate power; higher due to MFCC computation |
| [ | Raspberry Pi | MFCC + SVM | >200 mA | Not suitable for low-power embedded applications |
1. Liu, Y.; Gan, Y.; Song, Y.; Liu, J. What influences the perceived trust of a voice-enabled smart home system: An empirical study. Sensors; 2021; 21, 2037. [DOI: https://dx.doi.org/10.3390/s21062037] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/33805702]
2. Venkatraman, S.; Overmars, A.; Thong, M. Smart home automation—Use cases of a secure and integrated voice-control system. Systems; 2021; 9, 77. [DOI: https://dx.doi.org/10.3390/systems9040077]
3. Velasco-Álvarez, F.; Fernández-Rodríguez, Á.; Ron-Angevin, R. Brain-computer interface (BCI)-generated speech to control domotic devices. Neurocomputing; 2022; 509, pp. 121-136. [DOI: https://dx.doi.org/10.1016/j.neucom.2022.08.068]
4. Yu, C.; Zhang, H.; Shangguan, Z.; Hei, X.; Cangelosi, A.; Tapus, A. Speech-Driven Robot Face Action Generation with Deep Generative Model for Social Robots. Lecture Notes in Computer Science, Proceedings of the 14th International Conference, ICSR 2022, Florence, Italy, 13–16 December 2022, Proceedings, Part I; Springer: Berlin/Heidelberg, Germany, 2022; Volume 13817 LNAI, pp. 61-74. [DOI: https://dx.doi.org/10.1007/978-3-031-24667-8_6]
5. Renuka, M.; Kondekar, P.; Mulani, A.O. Raspberry pi based voice operated Robot. Int. J. Recent Eng. Res. Dev.; 2017; 2, pp. 69-76.
6. Saryazdi, R.; DeSantis, D.; Johnson, E.K.; Chambers, C.G. The Use of Disfluency Cues in Spoken Language Processing: Insights from Aging. Psychol. Aging; 2021; 36, pp. 928-942. [DOI: https://dx.doi.org/10.1037/pag0000652]
7. Alluhaidan, A.S.; Saidani, O.; Jahangir, R.; Nauman, M.A.; Neffati, O.S. Speech Emotion Recognition through Hybrid Features and Convolutional Neural Network. Appl. Sci.; 2023; 13, 4750. [DOI: https://dx.doi.org/10.3390/app13084750]
8. Lee, S.H.; Park, J.; Yang, K.; Min, J.; Choi, J. Accuracy of Cloud-Based Speech Recognition Open Application Programming Interface for Medical Terms of Korean. J. Korean Med. Sci.; 2022; 37, e144. [DOI: https://dx.doi.org/10.3346/jkms.2022.37.e144]
9. Talebi, S.M.S.; Sani, A.A.; Saroiu, S.; Wolman, A. MegaMind: A platform for security & privacy extensions for voice assistants. Proceedings of the 19th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys 2021); Madison, WI, USA, 24 June–2 July 2021; Association for Computing Machinery, Inc.: New York, NY, USA, 2021; pp. 109-121. [DOI: https://dx.doi.org/10.1145/3458864.3467962]
10. Trabelsi, R.; Nouri, K.; Ammari, I. Enhancing traffic sign recognition through Daubechies discrete wavelet transform and convolutional neural networks. Proceedings of the 2023 IEEE International Conference on Advanced Systems and Emergent Technologies (IC ASET); Hammamet, Tunisia, 29 April–1 May 2023; pp. 1-6.
11. Manor, E.; Greenberg, S. Using HW/SW Codesign for Deep Neural Network Hardware Accelerator Targeting Low-Resources Embedded Processors. IEEE Access; 2022; 10, pp. 22274-22287. [DOI: https://dx.doi.org/10.1109/ACCESS.2022.3153119]
12. Snider, R.K.; Casebeer, C.N.; Weber, R.J. An open computational platform for low-latency real-time audio signal processing using field programmable gate arrays. J. Acoust. Soc. Am.; 2018; 143, (Suppl. 3), 1737. [DOI: https://dx.doi.org/10.1121/1.5035667]
13. Shome, N.; Sarkar, A.; Ghosh, A.K.; Laskar, R.H.; Kashyap, R. Speaker Recognition through Deep Learning Techniques. Period. Polytech. Electr. Eng. Comput. Sci.; 2023; 67, pp. 300-336. [DOI: https://dx.doi.org/10.3311/PPee.20971]
14. Meister, H.; Walger, M.; Lang-Roth, R.; Müller, V. Voice fundamental frequency differences and speech recognition with noise and speech maskers in cochlear implant recipients. J. Acoust. Soc. Am.; 2020; 147, pp. EL19-EL24. [DOI: https://dx.doi.org/10.1121/10.0000499]
15. Fariselli, M.; Rusci, M.; Cambonie, J.; Flamand, E. Integer-Only Approximated MFCC for Ultra-Low-Power Audio NN Processing on Multi-Core MCUs. Proceedings of the 2021 IEEE 3rd International Conference on Artificial Intelligence Circuits and Systems (AICAS); Virtual, 3–7 May 2021; [DOI: https://dx.doi.org/10.1109/AICAS51828.2021.9458491]
16. Constantinescu, C.; Brad, R. An Overview on Sound Features in Time and Frequency Domain. Int. J. Adv. Stat. ITC Econ. Life Sci.; 2023; 13, pp. 45-58. [DOI: https://dx.doi.org/10.2478/ijasitels-2023-0006]
17. Tsujikawa, M.; Kajikawa, Y. Low-Complexity and Accurate Noise Suppression Based on an a Priori SNR Model for Robust Speech Recognition on Embedded Systems and Its Evaluation in a Car Environment. IEICE Trans. Fundam. Electron. Commun. Comput. Sci.; 2023; E106.A, pp. 1224-1233. [DOI: https://dx.doi.org/10.1587/transfun.2022EAP1130]
18. Jana, D.K.; Bhunia, P.; Adhikary, S.D.; Mishra, A. Analyzing salient features and classification of wine type based on quality through various neural network and support vector machine classifiers. Results Control Optim.; 2023; 11, 100219. [DOI: https://dx.doi.org/10.1016/j.rico.2023.100219]
19. Prasetyo, E.; Purbaningtyas, R.; Adityo, R.D.; Suciati, N.; Fatichah, C. Combining MobileNetV1 and Depthwise Separable convolution bottleneck with Expansion for classifying the freshness of fish eyes. Inf. Process. Agric.; 2022; 9, pp. 485-496. [DOI: https://dx.doi.org/10.1016/j.inpa.2022.01.002]
20. Yi, L.; Wu, Y.; Tolba, A.; Li, T.; Ren, S.; Ding, J. SA-MLP-Mixer: A Compact All-MLP Deep Neural Net Architecture for UAV Navigation in Indoor Environments. IEEE Internet Things J.; 2024; 11, pp. 21359-21371. [DOI: https://dx.doi.org/10.1109/JIOT.2024.3359662]
21. Rahman, M.; Nicolici, N. Estimating Word Lengths for Fixed-Point DSP Implementations Using Polynomial Chaos Expansions. Electronics; 2025; 14, 365. [DOI: https://dx.doi.org/10.3390/electronics14020365]
22. Mishra, J.; Malche, T.; Hirawat, A. Embedded Intelligence for Smart Home Using TinyML Approach to Keyword Spotting. Eng. Proc.; 2024; 82, 30. [DOI: https://dx.doi.org/10.3390/ecsa-11-20522]
23. Di Leo, K.; Biagetti, G.; Falaschetti, L.; Crippa, P. Microcontroller Implementation of LSTM Neural Networks for Dynamic Hand Gesture Recognition. Sensors; 2025; 25, 3831. [DOI: https://dx.doi.org/10.3390/s25123831] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/40573717]
24. Yang, C. Design of Smart Home Control System Based on Wireless Voice Sensor. J. Sens.; 2021; 2021, 8254478. [DOI: https://dx.doi.org/10.1155/2021/8254478]
25. Deshmukh, A.M. Comparison of Hidden Markov Model and Recurrent Neural Network in Automatic Speech Recognition. Eur. J. Eng. Res. Sci.; 2020; 5, pp. 958-965. [DOI: https://dx.doi.org/10.24018/ejers.2020.5.8.2077]
26. Mazumdar, D.; Raulkar, J.; Vaidya, P.; Gajare, A.; Shinde, D.M. A Survey Paper on Refrigeration Monitoring Systems using PIC Microcontroller, PT100 Temperature Sensor and FT811 Display Driver. Int. J. Eng. Technol. Manag. Sci.; 2023; 7, pp. 121-126. [DOI: https://dx.doi.org/10.46647/ijetms.2023.v07i02.015]
27. Manhas, M.; Sanduja, D.; Aggarwal, N.; Vashisth, R. Design and Implementation of Artificial Intelligence (AI) Based Home Automation. Proceedings of the IEEE International Conference on Signal Processing, Computing and Control (ISPCC); Online, 7–9 October 2021; pp. 122-126. [DOI: https://dx.doi.org/10.1109/ISPCC53510.2021.9609482]
28. Nantasri, P.; Phaisangittisagul, E.; Karnjana, J.; Boonkla, S.; Keerativittayanun, S.; Rugchatjaroen, A.; Usanavasin, S.; Shinozaki, T. A Light-Weight Artificial Neural Network for Speech Emotion Recognition using Average Values of MFCCs and Their Derivatives. Proceedings of the 17th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON); Phuket, Thailand, 24–27 June 2020; pp. 41-44. [DOI: https://dx.doi.org/10.1109/ECTI-CON49241.2020.9158221]
29. Efat, M.I.A.; Hossain, M.S.; Aditya, S.; Setu, J.H.; Imtiaz-Ud-Din, K.M. Identifying Optimised Speaker Identification Model using Hybrid GRU-CNN Feature Extraction Technique. Int. J. Comput. Vis. Robot.; 2022; 12, pp. 662-685. [DOI: https://dx.doi.org/10.1504/IJCVR.2022.126508]
30. Triwiyanto, T.; Yulianto, E.; Luthfiyah, S.; Musvika, S.D.; Maghfiroh, A.M.; Mak’RUf, M.R.; Titisari, D.; Ichwan, S. Hand Exoskeleton Development Based on Voice Recognition Using Embedded Machine Learning on Raspberry Pi. J. Biomim. Biomater. Biomed. Eng.; 2022; 55, pp. 81-92. [DOI: https://dx.doi.org/10.4028/p-ghjg94]
31. Park, J.; Noh, H.; Nam, H.; Lee, W.-C.; Park, H.-J. A Low-Latency Streaming On-Device Automatic Speech Recognition System Using a CNN Acoustic Model on FPGA and a Language Model on Smartphone. Electronics; 2022; 11, 1831. [DOI: https://dx.doi.org/10.3390/electronics11121831]
32. Torad, M.A.; Bouallegue, B.; Ahmed, A.M. A Voice-Controlled Smart Home Automation System using Artificial Intelligence and Internet of Things. Telkomnika (Telecommun. Comput. Electron. Control); 2022; 20, pp. 808-816. [DOI: https://dx.doi.org/10.12928/telkomnika.v20i4.23763]
33. Swamidason, I.T.J.; Tatiparthi, S.; Arul Xavier, V.M.; Devadass, C.S.C. Exploration of Diverse Intelligent Approaches in Speech Recognition Systems. Int. J. Speech Technol.; 2023; 26, pp. 1-10. [DOI: https://dx.doi.org/10.1007/s10772-020-09769-w]
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.