Content area

Abstract

GitHub has become a central platform for software development and is widely adopted by both industry professionals and computer science students. However, the lack of interactive tools for efficiently mining GitHub user contribution data limits researchers, educators, and practitioners who wish to analyze GitHub activity without extracting data from scratch. This dissertation presents GitHub-Mole, a novel tool designed to automate the collection, aggregation, and analysis of GitHub user contributions, enabling data-driven insights into software engineering education, team formation, and developer assessment.

Leveraging GitHub-Mole, this research explores three key areas. The first set of studies examines the relationship between GitHub contributions and educational outcomes, including the impact of formal education on open-source software (OSS) engagement, gender disparities in preparedness and performance, and the predictive power of pre-class GitHub metrics on academic outcomes such as exam performance and failure risk. These studies establish that pre-class GitHub activity can serve as a meaningful indicator of student engagement and technical preparedness.

The second set of studies focuses on optimizing software engineering team formation through GitHub metrics. This research identifies the key pre-class GitHub contributions that influence team project performance and investigates how prior GitHub activity correlates with individual contributions and overall team success. Furthermore, the application of the Constrained K-means algorithm for data-driven team formation demonstrates that strategically assembled teams based on GitHub metrics exhibit more balanced skill distributions and improved project outcomes compared to randomly or self-formed teams.

The third set of studies explores the role of GitHub and LeetCode data in technical interview preparation. By analyzing large-scale datasets, this research provides empirical insights into the effort required to succeed at major tech companies, outlining problem-solving patterns among candidates. Additionally, it identifies only a weak correlation between GitHub contributions and LeetCode problem-solving performance, suggesting that coding interview proficiency does not necessarily translate to sustained OSS engagement. These findings highlight the need for a more holistic approach in technical hiring that considers both algorithmic skills and real-world development experience.

The primary contribution of this dissertation lies in demonstrating the potential of GitHub-Mole as a scalable and practical tool for collecting and analyzing GitHub contribution data. By bridging the gap between academic education, software engineering teamwork, and professional developer assessment, this research underscores the transformative role of data-driven methodologies in software engineering education, industry recruitment, and OSS engagement. Future directions for this work include expanding GitHub-Mole’s capabilities, refining algorithmic team formation strategies, and exploring cross-platform developer behaviors to provide a more comprehensive understanding of modern software engineering practices.

Details

1010268
Title
GitHub-Mole: A Tool for Mining GitHub Contribution Data for Software Engineering Education and Team Formation
Number of pages
301
Publication year
2025
Degree date
2025
School code
0155
Source
DAI-B 87/4(E), Dissertation Abstracts International
ISBN
9798297622791
Committee member
Jiang, Shiyan; Stolee, Kathryn; D’Amorim, Marcelo
University/institution
North Carolina State University
University location
United States -- North Carolina
Degree
Ph.D.
Source type
Dissertation or Thesis
Language
English
Document type
Dissertation/Thesis
Dissertation/thesis number
32331589
ProQuest document ID
3264404535
Document URL
https://www.proquest.com/dissertations-theses/github-mole-tool-mining-contribution-data/docview/3264404535/se-2?accountid=208611
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Database
2 databases
  • ProQuest One Academic
  • ProQuest One Academic