A scan of billions of files from 13 percent of all GitHub public repositories over a period of six months has revealed that over 100,000 repos have leaked API tokens and cryptographic keys, with thousands of new repositories leaking new secrets on a daily basis.
The scan was the object of academic research carried out by a team from the North Carolina State University (NCSU), and the study’s results have been shared with GitHub, which acted on the findings to accelerate its work on a new security feature called Token Scanning, currently in beta.
Academics scanned billions of GitHub files
The NCSU study is the most comprehensive and in-depth GitHub scan to date and exceeds any previous research of its kind.
NCSU academics scanned GitHub accounts for a period of nearly six months, between October 31, 2017, and April 20, 2018, and looked for text strings formatted like API tokens and cryptographic keys.
They didn’t just use the GitHub Search API to look for these text patterns, like other previous research efforts, but they also looked at GitHub repository snapshots recorded in Google’s BigQuery database.
Across the six-month period, researchers analyzed billions of files from millions of GitHub repositories.
In a research paper published last month, the three-man NCSU team said they captured and analyzed 4,394,476 files representing 681,784 repos using the GitHub Search API, and another 2,312,763,353 files from 3,374,973 repos that had been recorded in Google’s BigQuery database.
NCSU team scanned for API tokens from 11 companies
Inside this gigantic pile of files, researchers looked for text strings that were in the format of particular API tokens or cryptographic keys.
Since not all API tokens and cryptographic keys are in the same format, the NCSU team decided on 15 API token formats (from 15 services belonging to 11 companies, five of which were from the Alexa Top 50), and four cryptographic key formats.
This included API key formats used by Google, Amazon, Twitter, Facebook, Mailchimp, MailGun, Stripe, Twilio, Square, Braintree, and Picatic.
Results came back right away, with thousands of API and cryptographic keys leaking being found every day of the research project.
In total, the NCSU team said they found 575,456 API and cryptographic keys, of which 201,642 were unique, all spread over more than 100,000 GitHub projects.
An observation that the research team made in their academic paper was that the “secrets” found using the Google Search API and the ones via the Google BigQuery dataset also had little overlap.
“After joining both collections, we determined that 7,044 secrets, or 3.49% of the total, were seen in both datasets. This indicates that our approaches are largely complementary,” researchers said.
Furthermore, most of the API tokens and cryptographic keys –93.58 percent– came from single-owner accounts, rather than multi-owner repositories.
What this means is that the vast majority of API and cryptographic keys found by the NCSU team were most likely valid tokens and keys used in the real world, as multi-owner accounts usually tend to contain test tokens used for shared-testing environments and with in-dev code.
Leaked API and crypto keys to hang around for weeks
Because the research project also took place over a six-month period, researchers also had a chance to observe if and when account owners would realize they’ve leaked API and cryptographic keys, and remove the sensitive data from their code.
The team said that six percent of the API and cryptographic keys they’ve tracked were removed within an hour after they leaked, suggesting that these GitHub owners realized their mistake right away.
Over 12 percent of keys and tokens were gone after a day, while 19 percent stayed for as much as 16 days.
“This also means 81% of the secrets we discover were not removed,” researchers said. “It is likely that the developers for this 81% either do not know the secrets are being committed or are underestimating the risk of compromise.”
Research team uncovers some high-profile leaks
The extraordinary quality of these scans was evident when researchers started looking at what and where were some of these leaks were originating.
“In one case, we found what we believe to be AWS credentials for a major website relied upon by millions of college applicants in the United States, possibly leaked by a contractor,” the NCSU team said.
“We also found AWS credentials for the website of a major government agency in a Western European country. In that case, we were able to verify the validity of the account, and even the specific developer who committed the secrets. This developer claims in their online presence to have nearly 10 years of development experience.”
In another case, researchers also found 564 Google API keys that were being used by an online site to skirt YouTube rate limits and download YouTube videos that they’d later host on another video sharing portal.
“Because the number of keys is so high, we suspect (but cannot confirm) that these keys may have been obtained fraudulently,” NCSU researchers said.
Last, but not least, researchers also found 7,280 RSA keys inside OpenVPN config files. By looking at the other settings found inside these configuration files, researchers said that the vast majority of the users had disabled password authentication and were relying solely on the RSA keys for authentication, meaning anyone who found these keys could have gained accessed to thousands of private networks.
The high quality of the scan results was also apparent when researchers used other API token-scanning tools to analyze their own dataset, to determine the efficiency of their scan system.
“Our results show that TruffleHog is largely ineffective at detecting secrets, as its algorithm only detected 25.236% of the secrets in our Search dataset and 29.39% in the BigQuery dataset,” the research team said.
GitHub is aware and on the job
In an interview with ZDNet today, Brad Reaves, Assistant Professor at the Department of Computer Science at North Carolina State University, said they shared the study’s results with GitHub in 2018.
“We have discussed the results with GitHub. They initiated an internal project to detect and notify developers about leaked secrets right around the time we were wrapping up our study. This project was publicly acknowledged in October 2018,” Reaves said.
“We were told they are monitoring additional secrets beyond those listed in the documentation, but we weren’t given further details.
“Because leakage of this type is so pervasive, it would have been very difficult for us to notify all affected developers. One of the many challenges we faced is that we simply didn’t have a way to obtain secure contact information for GitHub developers at scale,” Reaves added.
“At the time our paper went to press, we were trying to work with GitHub to do notifications, but given the overlap between our token scanning and theirs, they felt an additional notification was not necessary.”
API key leaks –a known issue
The problem of developers leaving their API and cryptographic keys in apps and websites’ source code is not a new one. Amazon has urged web devs to search their code and remove any AWS keys from public repos as far as 2014, and has even released a tool to help them scan repos before commiting any code to a public repo.
Some companies have taken it upon themselves to scan GitHub and other code-sharing repositories for accidentaly exposed API keys, and revoke the tokens even before API key owners notice the leak or abuse.
What the NCSU study has done was to provide the most in-depth look at this problem to date.
The paper that Reaves authored together with Michael Meli and Matthew R. McNiece is titled “How Bad Can It Git? Characterizing Secret Leakage in Public GitHub Repositories,” and is available for download in PDF format.
“Our findings show that credential management in open-source software repositories is still challenging for novices and experts alike,” Reaves told us.
Related security coverage:
Work from Home Security
Spin Master is a leading global children’s entertainment company that invents toys and games, produces dozens of television and studio series that are distributed in 160 countries, and creates a variety of digital games played by more than 30 million children. What was once a small private company founded by childhood friends is now a public global supply chain with over 1,500 employees and 28 offices around the world.
Like most organizations in 2020, Spin Master had to adapt quickly to the new normal of remote work, shifting most of its production from cubicles in regional and head offices to hundreds of employees working from home and other remote locations.
This dramatic shift created potential security risks, as most employees were no longer behind the firewall on the corporate network. Without the implementation of hardened endpoint security, the door would be open for bad actors to infiltrate the organization, acquire intellectual property, and ransom customer information. Additionally, the potential downtime caused by a security breach could harm the global supply chain. With that in mind, Spin Master created a self-imposed 30-day deadline to extend its network protection capabilities to the edge.
- Think Long Term: The initial goal of establishing a stop-gap work-from-home (WFH) and work-from-anywhere (WFA) strategy has since morphed into a permanent strategy, requiring long-term solutions.
- Gather Skills: The real urgency posed by the global pandemic made forging partnerships with providers that could fill all the required skill sets a top priority.
- Build Momentum: The compressed timeline left no room for delay or error. The Board of Directors threw its support behind the implementation team and gave it broad budget authority to ensure rapid action, while providing active guidance to align strategy with action.
- Deliver Value: The team established two key requirements that the selected partner must deliver: implementation support and establishing an ongoing managed security operations center (SOC).
Key Criteria for Evaluating Privileged Access Management
Privileged Access Management (PAM) enables administrative access to critical IT systems while minimizing the chances of security compromises through monitoring, policy enforcement, and credential management.
A key operating principle of all PAM systems is the separation of user credentials for individual staff members from the system administration credentials they are permitted to use. PAM solutions store and manage all of the privileged credentials, providing system access without requiring users to remember, or even know, the privileged password. Of course, all staff have their own unique user ID and password that they use to complete everyday tasks such as accessing email and writing documents. Users who are permitted to handle system administration tasks that require privileged credentials log into the PAM solution, which provides and controls such access according to predefined security policies. These policies control who is allowed to use which privileged credentials when, where, and for what tasks. An organization’s policy may also require logging and recording of the actions undertaken with the privileged credentials.
Once implemented, PAM will improve your security posture in several ways. The first is by segregating day-to-day duties from duties that require elevated access, reducing the risk of accidental privileged actions. Secondly, automated password management reduces the possibility that credentials will be shared while also lowering the risk if credentials are accidentally exposed. Finally, extensive logging and activity recording in PAM solutions aids audits of critical system access for both preventative and forensic security.
How to Read this Report
This GigaOm report is one of a series of documents that helps IT organizations assess competing solutions in the context of well-defined features and criteria. For a fuller understanding consider reviewing the following reports:
Key Criteria report: A detailed market sector analysis that assesses the impact that key product features and criteria have on top-line solution characteristics—such as scalability, performance, and TCO—that drive purchase decisions.
GigaOm Radar report: A forward-looking analysis that plots the relative value and progression of vendor solutions along multiple axes based on strategy and execution. The Radar report includes a breakdown of each vendor’s offering in the sector.
Vendor Profile: An in-depth vendor analysis that builds on the framework developed in the Key Criteria and Radar reports to assess a company’s engagement within a technology sector. This analysis includes forward-looking guidance around both strategy and product.
Adventist Risk Management Data Protection Infrastructure
Companies always want to enhance their ability to quickly address pressing business needs. Toward that end, they look for new ways to make their IT infrastructures more efficient—and more cost effective. Today, those pressing needs often center around data protection and regulatory compliance, which was certainly the case for Adventist Risk Management. What they wanted was an end-to-end, best-in-class solution to meet their needs. After trying several others, they found the perfect combination with HYCU and Nutanix, which provided:
- Ease of deployment
- Outstanding ROI
- Overall TCO improvement
Nutanix Cloud Platform provides a software-defined hyperconverged infrastructure, while HYCU offers purpose-built backup and recovery for Nutanix. Compared to the previous traditional infrastructure and data protection solutions in use at Adventist Risk Management, Nutanix and HYCU simplified processes, speeding day-to-day operations up to 75%. Now, migration and update activities typically scheduled for weekends can be performed during working hours and help to increase IT staff and management quality of life. HYCU further increased savings by providing faster and more frequent points of recovery as well as better DR Recovery Point Objective (RPO) and Recovery Time Objective (RTO) by increasing the ability to do daily backups from one to four per day.
Furthermore, the recent adoption of Nutanix Objects, which provides secure and performant S3 storage capabilities, enhanced the infrastructure by:
- Improving overall performance for backups
- Adding security against potential ransomware attacks
- Replacing components difficult to manage and support
In the end, Nutanix and HYCU enabled their customer to save money, improve the existing environment, and, above all, meet regulatory compliance requirements without any struggle.
Damning report says broadband industry behind huge Net Neutrality astroturfing
Around 1 in 5 of the 22 million comments submitted to the FTC about ending Net Neutrality were real, a...
Sea level rise uncertainties: Why all eyes are on Antarctica
Donald Slater A few years ago, an ice-sheet model grabbed attention when it projected much faster losses of Antarctic ice...
Following Apple’s launch of privacy labels, Google to add a ‘safety’ section in Google Play – TechCrunch
Months after Apple’s App Store introduced privacy labels for apps, Google announced its own mobile app marketplace, Google Play, will...
Google responds to Apple App Tracking Transparency with new rules for Android
Google released a notice today about the future of Android and user data transparency. While not a direct response to...
China’s carbon pollution now surpasses all developed countries combined
Carbon pollution from China’s bustling, coal-intensive economy last year outstripped the carbon pollution of the US, the EU, and other...
Social1 year ago
CrashPlan for Small Business Review
Gadgets3 years ago
A fictional Facebook Portal videochat with Mark Zuckerberg – TechCrunch
Mobile3 years ago
Memory raises $5M to bring AI to time tracking – TechCrunch
Social3 years ago
iPhone XS priciest yet in South Korea
Cars2 years ago
What’s the best cloud storage for you?
Security2 years ago
Google latest cloud to be Australian government certified
Social3 years ago
Apple’s new iPad Pro aims to keep enterprise momentum
Cars2 years ago
SK Telecom and Samsung to collaborate on 5G for enterprise