Content area
Since 2020, GitGuardian has been detecting checked-in hard-coded secrets in GitHub repositories. During 2020-2023, GitGuardian has observed an upward annual trend and a four-fold increase in hard-coded secrets, with 12.8 million exposed in 2023. However, removing all the secrets from software artifacts is not feasible due to time constraints and technical challenges. Additionally, the security risks of the secrets are not equal, protecting assets of different risks accessible through asset identifiers (a DNS name and a public or private IP address). The value of the protected asset can vary from a database with mock data to medical data whose breach can incur fines. In addition, the ease of accessing the asset can vary since an attacker may need to be on the same network or have physical access to the host server to access the asset. Thus, secret removal should be prioritized by security risk reduction, which existing secret detection tools do not support.
To this end, we conducted six original studies. In the first study, we extracted 779 questions related to checked-in secrets on Stack Exchange and applied qualitative analysis to determine the challenges and the solutions posed by others for each of the challenges. We identify 27 challenges and 13 solutions. We observe an increasing trend in questions about checked-in secrets lacking accepted answers.
In the second study, we conducted a grey literature review of 54 Internet artifacts and identified 24 secret management practices grouped into six categories. Our findings indicate that using secret detection tools and employing short-lived secrets are the most recommended practices to avoid accidentally committing secrets and limit secret exposure.
In the third study, we curated SecretBench, the first publicly available labeled dataset of source codes containing 97,479 secrets (of which 15,084 are true secrets) of 7 secret types extracted from 818 public GitHub repositories.
In the fourth study, we evaluated five open-source and four proprietary secret detection tools against SecretBench. We observed that tools generate a lot of false positives (25%-99%) and miss secrets (14%-99%). Our manual analysis of reported secrets reveals that false positives are due to employing generic regular expressions and ineffective entropy calculation. In contrast, false negatives are due to faulty regular expressions, skipping specific file types, and insufficient rulesets.
In the fifth study, we presented AssetHarvester, a static analysis tool to detect secretasset pairs in a repository. Since the location of the asset can be distant from where the secret is defined, we investigated secret-asset co-location patterns and found four patterns. To identify the secret-asset pairs of the four patterns, we utilized three approaches (pattern matching, data flow analysis, and fast-approximation heuristics). We curated a benchmark of 1,791 secret-asset pairs of four database types extended from SecretBench to evaluate the performance of AssetHarvester. AssetHarvester demonstrates precision of (97%), recall (90%), and F1-score (94%) in detecting secret-asset pairs. Our findings indicate that data flow analysis employed in AssetHarvester detects secret-asset pairs with 0% false positives and aids in improving the recall of secret detection tools. Additionally, AssetHarvester shows 43% increase in precision for database secret detection compared to existing detection tools through the detection of assets, thus reducing developer’s alert fatigue.
In the final study, we presented RiskHarvester, a risk-based tool to compute a security risk score based on the value of the asset and ease of attack on a database. We calculated the value of asset by identifying the sensitive data categories present in a database from the database keywords in the source code. We utilized data flow analysis, SQL, and Object Relational Mapper (ORM) parsing to identify the database keywords. To calculate the ease of attack, we utilized passive network analysis to retrieve the database host information. To evaluate RiskHarvester, we curated RiskBench, a benchmark of 1,791 database secret-asset pairs with sensitive data categories and host information manually retrieved from 188 GitHub repositories. RiskHarvester demonstrates precision of (95%) and recall (90%) in detecting database keywords for the value of asset and precision of (96%) and recall (94%) in detecting valid hosts for ease of attack. Finally, we conducted a survey (52 respondents) to understand whether developers prioritize secret removal based on security risk score. We found that 86% of the developers prioritized the secrets for removal with descending security risk scores.