Content area
Family engagement in STEM design activities play a huge role in shaping children's STEM interests and skills. Understanding these conversations is crucial for improving informal learning process and outcomes. However, comprehensive analysis of large data from STEM design activities is often done manually, which is time-consuming. This study applied machine learning ensemble techniques to classify conversations between parents and children during STEM design activities. The dataset was generated by transcribing video recordings of family-based STEM sessions, followed by annotation to categorize conversations according to distinct design stages: identify, understand, ideate, design, build, and test. Standard preprocessing and feature extraction techniques were employed, and the data was split into 80% for training and 20% for testing. The performance of machine learning models was evaluated using accuracy, precision, recall, Fl-score, and confusion matrix metrics. Accuracy ranged from 70% to 80%. Findings indicate that feature overlap across the 'Build' and 'Design' stages impact model accuracy. Overall results demonstrate the potential of machine learning to classify conversational stages in a STEM design activity. This study highlights the viability of ML-driven approaches in identifying structured design discussions within parent-child conversations, a means of analyzing large conversational data, supporting educators in fostering more engaging and effective learning environments. Recommendations for further model improvement and refinement are also discussed.
Abstract
Family engagement in STEM design activities play a huge role in shaping children's STEM interests and skills. Understanding these conversations is crucial for improving informal learning process and outcomes. However, comprehensive analysis of large data from STEM design activities is often done manually, which is time-consuming. This study applied machine learning ensemble techniques to classify conversations between parents and children during STEM design activities. The dataset was generated by transcribing video recordings of family-based STEM sessions, followed by annotation to categorize conversations according to distinct design stages: identify, understand, ideate, design, build, and test. Standard preprocessing and feature extraction techniques were employed, and the data was split into 80% for training and 20% for testing. The performance of machine learning models was evaluated using accuracy, precision, recall, Fl-score, and confusion matrix metrics. Accuracy ranged from 70% to 80%. Findings indicate that feature overlap across the 'Build' and 'Design' stages impact model accuracy. Overall results demonstrate the potential of machine learning to classify conversational stages in a STEM design activity. This study highlights the viability of ML-driven approaches in identifying structured design discussions within parent-child conversations, a means of analyzing large conversational data, supporting educators in fostering more engaging and effective learning environments. Recommendations for further model improvement and refinement are also discussed.
Keywords:
Machine Learning Classification, Natural Language Processing, STEM Education, Parent-Child Conversations, Design Stages in STEM Education
1. Introduction
STEM (Science, Technology, Engineering, and Mathematics) education is essential for preparing children to engage in design thinking and problem-solving tasks [1]. While formal educational settings play a huge role in STEM learning, informal learning environments such as museums, maker spaces and at-home activities provide additional opportunities for cognitive and socio-emotional development [2]. These settings often involve parent-child collaboration, enabling parents to actively participate in their children's learning processes as they explore STEM concepts and scientific inquiry outside the classroom [1]. Studies show that parent-child engagements in informal learning environments positively influence children's curiosity, critical thinking, creativity, collaboration skills and understanding of STEM concepts [3]. As families engage in collaborative problem solving, large amounts of conversational data are generated. This data is characterized by complex linguistic aspects of natural conversation.
Analyzing this data could provide valuable insights into the learning process that can inform strategies for enhancing STEM education. However, the qualitative analysis of large volumes of parent-child conversational data qualitatively is labor-intensive and subjective [4]. Natural language processing (NLP) and machine learning (ML) techniques allows for comprehensive analysis of large volumes of educationally relevant qualitative text [5]. This study investigates the application of ML models to classify parent-child conversation segments into distinct stages of the engineering design process. Specifically, it addresses the research question: How can machine learning models be applied to classify parent-child conversations into distinct design stages? By answering this question, the study contributes to the growing field of Ai-supported learning analytics and provides methodological insights for advancing research in informal STEM education.
2. Literature Review
ML approaches in education include predictive modeling for student performance, automated grading systems, intelligent tutoring systems, adaptive learning platforms, natural language processing for analyzing student interactions, and data-driven insights for curriculum design and instructional improvements [6]. ML techniques facilitate the analysis of student engagement, learning trajectories, classroom dynamics, and the effectiveness of educational interventions [7], [8]. Data-driven insights help tailor educational content and instructional strategies to individual student needs, ensuring that students receive customized support based on their progress and performance [9]. Complex datasets in educational research can be analyzed to reveal underlying patterns using ML and NLP techniques [9], [10]. ML and NLP techniques help identify themes for classifying text, thereby enabling insights into effective dialogue in educational environments [11].
Supervised ML techniques train on labeled data to model complex patterns in the data through more advanced mapping functions [12]. Text classification enhances educational resource organization, retrieval, and application, supporting more targeted and effective teaching and learning strategies [13]. NLP provides valuable insights into the dynamics of educational discourse and its impact on learning. In recent years, the BERT model has become a prevalent machine learning model that can handle a variety of NLP applications, including supervised text categorization [14].Prior studies focused on using NLP to analyze structured educational text (e.g., essays, assessments, forum discussions), ML-based sentiment analysis, engagement tracking, and feedback generation rather than spontaneous, spoken conversations in informal STEM learning. Parent-child conversations in STEM design activities have mostly been analyzed using qualitative coding of behaviors and conversations . NLP and ML techniques present a potential for analyzing qualitative data in STEM education [4]. The reliance on qualitative methods presents a gap that can be addressed using computational methods.
3. Methodology
a. Informal Activity Setup: 15 families with middle school aged adolescents aged 9-12 participated in a hands-on STEM design activity in a museum. Participating parents had attained a minimum of a high school diploma. All participants were introduced to the engineering design process prior to beginning the activity. A brief orientation was provided to ensure that both children and parents had a shared understanding of the design stages namely: Identify, Understand, Ideate, Design, Build, and Test. This instruction ensured consistency across sessions and enabled participants to more effectively engage with the task using design-based language and thinking. All families were presented with the same design challenge and required to work together to develop a prototype of their proposed solution. The duration of task was approximately 45 - 60 minutes. Ethical protocols that ensure participants confidentiality, voluntary participation and written consent were observed.
b. Dataset: Conversational data are generated during parent-child STEM design activities. To create the dataset, the verbal conversations of participating families were transcribed. The data was structured, so each segment represents a single speaker utterance by a parent or child and serves as an individual instance assigned a corresponding design stage for supervised ML classification. The dataset consisted of 1500 rows and 4 columns. The dataset was organized into four columns namely: Family ID: A unique code representing the family from which the segment originated, Speaker: Indicates who is speaking in the segment (e.g., child, or parent). Transcript Text: The words spoken by the individual. Design Stage Label: The labeled stage of the engineering design process (e.g., Understand, Ideate, Build). This structured format enables each row to function as a self-contained data instance suitable for machine learning tasks involving natural language classification.
c. Design Label Generation: Each dialogue segment was labeled with a corresponding design stage, based on a coding scheme derived from the commonly accepted stages of the engineering design process. All labeling was conducted manually by two trained graduate students. A coding guide was developed through iterative discussion and alignment sessions to ensure consistency in the application of labels based on the tasks involved in each design stage. Segments were reviewed independently, and disagreements were resolved through discussion. There was 80% inter-rater reliability calculated using Cohen's Kappa among the two raters that labeled the specific design. The sample transcript and corresponding design stages are presented in Table 1 below.
d. Preprocessing and Feature Engineering: The data was cleaned and preprocessed using text normalization, tokenization, stop word removal, and lemmatization. A word cloud was created to visualizes the frequency of design-related language used by parents and children in the STEM design activity. Text data were converted into numerical features utilizing the Term Frequency-Inverse Document Frequency (TF-IDF) method. The top 500 most important features were selected, balancing the need for representational accuracy with computational efficiency. TF-IDF provided a strong baseline for capturing semantic content, while lemmatization helped the model generalize across varied phrasing. Vocabulary to the top 5,000 features to reduce sparsity. The TF-IDF matrix served as the primary input feature for text-based classification models. The categorical target variable "Design stage" was encoded mapping each class to a numeric label for classification. Families spent more time in the build and design stages leading to an imbalanced data. Synthetic Minority Over-sampling Technique (SMOTE) was employed to address class imbalance within the dataset. SMOTE generated synthetic samples for underrepresented classes, ensuring a balanced dataset for training the models.
e. Model Training: The resampled dataset was divided into training and testing subsets using an 80-20 split, ensuring adequate data for model training and evaluation. Four machine learning methods were deployed: (1) Random Forest Classifier, a versatile model leveraging multiple decision trees to aggregate results. (2) Gradient Boosting designed to sequentially improves weak learners. (3) Voting Classifier to combine the predictions of the Random Forest and Gradient Boosting models using a hard voting scheme. (4) The BERT model (Bidirectional Encoder Representations from Transformers) for context-aware analysis. For comparison, the BERT model was trained iteratively for up to 25 epochs using batches. During each epoch, the model parameters were updated using backpropagation with the Adam W optimizer set to a learning rate of 2c-5. Loss and accuracy were monitored after each epoch. An early stopping mechanism based on validation accuracy was implemented with a patience of 3 epochs to avoid overfitting. Training was halted early if no improvement in validation accuracy was observed after three consecutive epochs. Model performance was evaluated using performance metrics such as precision, recall. Fl-score, and accuracy. Confusion matrices revealed misclassifications.
4. Results
Figure 1 below is a word cloud that visualizes the frequency of design-related language used by parents and children in the STEM design activity. The word cloud illustrates which engineering concepts were most prominent across conversations as families navigated problem-solving, iterate on ideas, and co-construct design solutions.
The classification results across the Random Forest, Gradient Boosting, and Voting Classifier models demonstrate varying levels of performance, with Random Forest achieving the highest accuracy (80%), followed by Voting Classifier (75%) and Gradient Boosting (70%). The accuracy of the BERT model was 78%. Table below shows the comparative analysis of model performance of the simple classifiers for each design stage.
The Identify category consistently achieved near-perfect classification (Fl-score -1.00), indicating that this category has distinct features that all models recognize well. Similarly, Ideate and Test categories performed relatively well across models, with Ideate consistently achieving high precision and recall. However, Build and Design categories exhibited lower classification performance, with Fl-scores ranging from 0.47 to 0.65, suggesting overlapping characteristics or less distinguishable features. Understand showed high recall but lower precision across models, indicating frequent misclassification. The Gradient Boosting model performed the weakest overall, particularly struggling with the Design category (Fl : 0.47), suggesting it may not generalize well to this dataset. Voting Classifier balanced recall and precision better than Gradient Boosting but remained slightly behind Random Forest in overall accuracy. The BERT model performed well with categories having higher representation ("build" and "design") while significantly underperforming on minority classes. Overall, while Random Forest emerged as the best-performing model, refining feature selection and exploring alternative models such as deep learning could further enhance classification accuracy.
5. Discussion
Family conversations are diverse, with unique communication styles, language, and dynamics each family. Traditional qualitative methods of analysis are labor and time intensive. Analyzing these data could yield insights into patterns that promote STEM learning. This study investigated the effectiveness of ML techniques to classify of parent-child conversations in a STEM design activity into design stages. Models such as Random Forest, Gradient Boosting, and Voting Classifiers were initially chosen due to their interpretability, and strong baseline performance on structured datasets. In addition, advanced models like BERT offer advantages for NLP tasks and semantic understanding. The accuracy of ML and NLP classification ranged from 70% to 80%.
For the random forest model, the "Identify" stage consistently achieved near-perfect classification (Fl-score -1.00), indicating that this design stage had distinct features that all models recognize well. Similarly, "Ideate" and "Test" stages performed relatively well across models, with "Ideate" consistently achieving high precision and recall. However, "Build" and "Design" stages exhibited lower classification performance, with Fl-scores ranging from 0.47 to 0.65, suggesting overlapping characteristics or less distinguishable features. The "Understand" stage showed high recall but lower precision across models, indicating frequent misclassification. The Gradient Boosting model performed the weakest overall, particularly struggling with the "Design" stage with an Fl score of 0.47 suggests that it may not generalize well to this dataset. Voting Classifier balanced recall and precision better than Gradient Boosting but remained slightly behind Random Forest in overall accuracy.
Random Forest's ensemble of decision trees captures these design stages better than boosting models, which rely on sequential adjustments. Random Forest reduces the risk of overfitting to specific patterns in the conversational dataset while gradient boosting sequentially corrects errors but is more prone to overfitting on noisy or overlapping linguistic features, particularly between Build and Design stages. The overlap in feature complexity (words) across the 'build' and 'design' stage leads to frequent misclassifications. Design stages with comparable phrases and syntactic structures made it difficult for all models to obtain high precision and recall, contributing to frequent misclassifications. The BERT model demonstrated classification performance comparable to the Random Forest. This may be due to the relatively small size of dataset. Improving model performance will require data augmentation and feature engineering to better distinguish 'Build' and 'Design' stages.
Future work would incorporate sequence modeling using attention models, exploring speaker embeddings and context-aware deep learning models such as GPT-based classifiers. Advanced feature engineering strategies, such as phrase-level embeddings or attention mechanisms, tailored to distinguish stages with overlapping characteristics [15]. Integrating domain-specific features or incorporating sequence analysis techniques improves model accuracy and robustness in dealing with complex conversational data [14]. Expanding the dataset size, including more families, different STEM tasks, and varied socio-cultural backgrounds to improve model generalization. Refining labels for better class separation would improve recall and precision for underrepresented classes. Enhanced feature engineering can be done to better capture the nuances of parent-child STEM design conversational features for more robust classification.
This study is significant for the classification of text data for qualitative research in STEM education [4]. ML models that accurately classify conversational data offers educators new tools for enhancing instructional strategies, learning experiences and managing extensive text data [5], [16]. ML models that accurately classify parent-child conversations can be used for real time adaptive learning systems that provide personalized guidance in ways that help scaffold learning and maintain engagement during informal STEM design activities. Analysis of education data using ML and NLP techniques can reveal insights into frequent words used by families as they navigate the design process. The implementation of this model may potentially allow for a timely intervention by an educator tailored to specific moments in the STEM design process to better support families during STEM design projects, thereby fostering improved engagement and learning outcomes in informal environments. The study focused on informal STEM design activity which may not fully capture conversational dynamics in unstructured or free-play settings. In addition, speaker roles (parent vs. child) were not incorporated into the classification model. Expanding the dataset size will add benefits for enhanced generalizability.
6. Conclusions
STEM education plays a key role in fostering critical thinking and problem-solving skills in children. Engaging handson STEM design activities with parents enhances learning experiences. Conversations between parents and children provide rich data which can be analyzed to gather insights into the design and critical thinking process. The ability to automatically classify conversational segments into engineering design stages offers significant implications for educators and learning tool developers. Educators can use these insights to provide timely, stage-specific guidance that supports deeper engagement with the design process. Similarly, developers of educational platforms and apps can embed these classification models to create intelligent systems that monitor, visualize, and support real-time learning trajectories. Such systems can dynamically scaffold learning, foster reflection, and ensure that learners engage with all critical phases of the engineering design process.
Traditional methods of analyzing parent-child conversations in STEM activities rely heavily on labor-intensive and subjective qualitative methods. Advances in machine learning (ML) and natural language processing (NLP) provide opportunities to automate the classification of educational conversations, enabling scalable and data-driven insights. This study performed a structured classification of stem design stages from conversations. ML models show promise in classifying STEM-related parent-child conversations. Data augmentation is needed to improve generalization. Enhanced feature engineering, incorporating speaker roles and sequence modeling may improve model performance.
References
[1] S. Alexandre, Y. Xu, M. Washington-Nortey, and C. Chen, "Informal STEM Learning for Young Children: A Systematic Literature Review," International Journal of Environmental Research and Public Health, vol. 19, no. 14, Art. no. 14, Jan. 2022, doi: 10.3390/ijerphl9148299.
[2] D. M. Sobel, "Science, Technology, Engineering, and Mathematics (STEM) Engagement From Parent-Child Interaction in Informal Learning Environments," Curr Dir Psychol Sci, vol. 32, no. 6, pp. 454^161, Dec. 2023, doi: 10.1177/09637214231190632.
[3] O. Sadka and O. Zuckerman, "From Parents to Mentors: Parent-Child Interaction in Со-Making Activities," in Proceedings of the 2017 Conference on Interaction Design and Children, in IDC 47. New York, NY, USA: Association for Computing Machinery, Jun. 2017, pp. 609-615. doi: 10.1145/3078072.3084332.
[4] C. G. P. Berdanier, E. Baker, W. Wang, and C. McComb, "Opportunities for Natural Language Processing in Qualitative Engineering Education Research: Two Examples," in 2018 IEEE Frontiers in Education Conference (FIE), Oct. 2018, pp. 1-6. doi: 10.1109/FIE.2018.8658747.
[5] B.-M. Hsu, "Comparison of Supervised Classification Models on Textual Data," Mathematics, vol. 8, no. 5, Art. no. 5, May 2020, doi: 10.3390/math8050851.
[6] H. Kishan Das Menon and V. Janardhan, "Machine learning approaches in education," Materials Today: Proceedings, vol. 43, pp. 3470-3480, Jan. 2021, doi: 10.1016/j.matpr.2020.09.566.
[7] D. Sansone, "Beyond Early Warning Indicators: High School Dropout and Machine Learning," Oxford Bulletin of Economics and Statistics, vol. 81, no. 2, pp. 456^185, 2019, doi: 10.1111/obes.12277.
[8] D. Kucak, V. Juricic, and G. Dambic, "Machine Learning in Education - a Survey of Current Research Trends," in DAAAM Proceedings, 1st ed., vol. 1, B. Katalinic, Ed., DAAAM International Vienna, 2018, pp. 04060410. doi: 10.2507/29th.daaam.proceedings.059.
[9] R. R. Halde, "Application of Machine Learning algorithms for betterment in education system," in 2016 International Conference on Automatic Control and Dynamic Optimization Techniques (ICACDOT), Sep. 2016, pp. 1110-1114. doi: 10.1109/ICACDOT.2016.7877759.
[10] P. Leliopoulos, S. Stavridis, and A. Drigas, "Big data and machine learning and the impact on education," World Journal of Advanced Research and Reviews, vol. 18, no. 3, pp. 670-683, 2023, doi: 10.30574/wjarr.2023.18.3.1054.
[11] S. Knight and K. Littleton, "Dialogue as Data in Learning Analytics for Productive Educational Dialogue," Learning Analytics, vol. 2, no. 3, pp. 111-143, Feb. 2016, doi: 10.18608/jla.2015.23.7.
[12] A. Lindh, K. Quille, A. Mooney, K. Marshall, and K. O'Sullivan, "Supervised Machine Learning for Modelling STEM Career and Education Interest in Irish School Children," presented at the Proceedings of the 15th International Conference on Educational Data Mining, 2022, p. 565. doi: 10.5281/zenodo.6853026.
[13] V. B. Kobayashi, S. T. Mol, H. A. Berkers, G. Kismihók, and D. N. Den Hartog, "Text Classification for Organizational Researchers: A Tutorial," Organizational Research Methods, vol. 21, no. 3, pp. 766-799, Jul. 2018, doi: 10.1177/1094428117719322.
[14] S. González-Carvajal and E. C. Garrido-Merchán, "Comparing BERT against traditional machine learning text classification," JCCE, vol. 2, no. 4, pp. 352-356, Apr. 2023, doi: 10.47852/bonviewJCCE3202838.
[15] G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing, "Learning from class-imbalanced data: Review of methods and applications," Expert Systems with Applications, vol. 73, pp. 220-239, May 2017, doi: 10.1016/j.eswa.2016.12.035.
[16] Y.-H. Chien and C.-K. Yao, "Development of an AI Userbot for Engineering Design Education Using an Intent and Flow Combined Framework," Applied Sciences, vol. 10, no. 22, Art. no. 22, Jan. 2020, doi: 10.3390/appl0227970.
Copyright Institute of Industrial and Systems Engineers (IISE) 2025