Content area
GitHub has become a central platform for software development and is widely adopted by both industry professionals and computer science students. However, the lack of interactive tools for efficiently mining GitHub user contribution data limits researchers, educators, and practitioners who wish to analyze GitHub activity without extracting data from scratch. This dissertation presents GitHub-Mole, a novel tool designed to automate the collection, aggregation, and analysis of GitHub user contributions, enabling data-driven insights into software engineering education, team formation, and developer assessment.
Leveraging GitHub-Mole, this research explores three key areas. The first set of studies examines the relationship between GitHub contributions and educational outcomes, including the impact of formal education on open-source software (OSS) engagement, gender disparities in preparedness and performance, and the predictive power of pre-class GitHub metrics on academic outcomes such as exam performance and failure risk. These studies establish that pre-class GitHub activity can serve as a meaningful indicator of student engagement and technical preparedness.
The second set of studies focuses on optimizing software engineering team formation through GitHub metrics. This research identifies the key pre-class GitHub contributions that influence team project performance and investigates how prior GitHub activity correlates with individual contributions and overall team success. Furthermore, the application of the Constrained K-means algorithm for data-driven team formation demonstrates that strategically assembled teams based on GitHub metrics exhibit more balanced skill distributions and improved project outcomes compared to randomly or self-formed teams.
The third set of studies explores the role of GitHub and LeetCode data in technical interview preparation. By analyzing large-scale datasets, this research provides empirical insights into the effort required to succeed at major tech companies, outlining problem-solving patterns among candidates. Additionally, it identifies only a weak correlation between GitHub contributions and LeetCode problem-solving performance, suggesting that coding interview proficiency does not necessarily translate to sustained OSS engagement. These findings highlight the need for a more holistic approach in technical hiring that considers both algorithmic skills and real-world development experience.
The primary contribution of this dissertation lies in demonstrating the potential of GitHub-Mole as a scalable and practical tool for collecting and analyzing GitHub contribution data. By bridging the gap between academic education, software engineering teamwork, and professional developer assessment, this research underscores the transformative role of data-driven methodologies in software engineering education, industry recruitment, and OSS engagement. Future directions for this work include expanding GitHub-Mole’s capabilities, refining algorithmic team formation strategies, and exploring cross-platform developer behaviors to provide a more comprehensive understanding of modern software engineering practices.