Content area
Even though comments serve as crucial artifacts for understanding computer programs, relatively few studies examine their form, frequency, or authorship. Code comments are human-readable text that a compiler or interpreter ignores when executing the program. Comments serve multiple purposes, including describing a program’s functionality, explaining bugs or pending updates, and communicating with other developers. Although writing good comments is considered a best practice in software engineering, few studies examine the style and practice of code comment writing, especially non-English comments. The Russian Comment Corpus (RCC) was born out of a desire to understand how Russian-speaking programmers write comments in programming code. This project proposes a new methodology for code comment corpus construction implemented using a Python program to process, filter, and store files containing Russian comments. The RCC contains 95,538 code comments from programs written in C#, Java, JavaScript, Kotlin, PHP, Python, Ruby, and SQL. This project introduces an original comment corpus construction methodology and implements it to create the Russian Comment Corpus. The RCC methodology serves as a blueprint for developing future comment corpora to support studies in code comments, developer cognition, and natural language usage in programming. As a dataset, the Russian Comment Corpus is a foundational work for studying Russian language used in the context of computer programming.