Content area
Data breaches remain a common occurrence affecting both companies and individuals alike, despite promulgated data protection legislation worldwide. It is unlikely that factors causing data breaches such as incorrect device configuration or negligence will stop unless effective enforcement of relevant legislation is applied. While several information privacy regulators exist, the dominant norm is to respond reactively on reported incidents. Reactive response is useful for cleaning up detected breaches but does not provide a clear indication of the level of personal information available on the internet since only reported incidents are taken into account. The possibility of pro-active automated breach detection has previously been discussed as a capability augmentation for existing privacy regulators. By pro-actively detecting leaked information, detection times can potentially be reduced to limit the exposure time of Personal Identifiable Information (PII) on publicly accessible networks. At present the average time for data breach detection is in excess of three months internationally and breach discovery it most often not by the data owner but an external third party increasing exposure of leaked information. The duration of time that data is exposed on the internet has severe negative implications since a significant portion of information disclosed in data breaches have been proven to be used for cybercrime activities. It could then be argued that any reduction of data breach exposure time should directly reduce the opportunity for associated cyber-crime. While pro-active breach detection has been proven as potentially viable in previous work, numerous aspects of such a system remain in question. Aspects such as legality, detection accuracy and communication with affected parties and alignment with privacy regulator operating procedures are all unexplored. The research presented in this paper considers the results obtained from two iterations of such an experimental system that was conducted on the South African .co.za domain. The first iteration conducted in early 2014 was used as a baseline for the second iteration that was conducted one year later in 2015. While the experiment was conducted on the South African cyber domain, the concepts are applicable to the international environment.
Abstract: Data breaches remain a common occurrence affecting both companies and individuals alike, despite promulgated data protection legislation worldwide. It is unlikely that factors causing data breaches such as incorrect device configuration or negligence will stop unless effective enforcement of relevant legislation is applied. While several information privacy regulators exist, the dominant norm is to respond reactively on reported incidents. Reactive response is useful for cleaning up detected breaches but does not provide a clear indication of the level of personal information available on the internet since only reported incidents are taken into account. The possibility of pro-active automated breach detection has previously been discussed as a capability augmentation for existing privacy regulators. By pro-actively detecting leaked information, detection times can potentially be reduced to limit the exposure time of Personal Identifiable Information (PII) on publicly accessible networks. At present the average time for data breach detection is in excess of three months internationally and breach discovery it most often not by the data owner but an external third party increasing exposure of leaked information. The duration of time that data is exposed on the internet has severe negative implications since a significant portion of information disclosed in data breaches have been proven to be used for cybercrime activities. It could then be argued that any reduction of data breach exposure time should directly reduce the opportunity for associated cyber-crime. While pro-active breach detection has been proven as potentially viable in previous work, numerous aspects of such a system remain in question. Aspects such as legality, detection accuracy and communication with affected parties and alignment with privacy regulator operating procedures are all unexplored. The research presented in this paper considers the results obtained from two iterations of such an experimental system that was conducted on the South African .co.za domain. The first iteration conducted in early 2014 was used as a baseline for the second iteration that was conducted one year later in 2015. While the experiment was conducted on the South African cyber domain, the concepts are applicable to the international environment.
Keywords: data breach, legislation, privacy, personal identifiable information, protection of personal information act, pro-active security
1. Introduction/background
Data breaches are at present an unavoidable global risk factor that occurs regularly and typically affects a large number of individuals when it occurs (Shackleford, 2012). Furthermore, Personal Identifiable Information (PII) has become a key target for cyber criminals as well as hackers and is exploited in many ways including identity theft, spamming, phishing and cyber-espionage (Rozenberg, 2012). A classic example is the health insurer Anthem data beach. Within hours after the breach, cyber criminals started sending out phishing attacks to the victims affected in the first attack. Fortunately, Anthem was aware of the phishing campaigns and could warn their customers of potential risks to limit losses (Paganini, 2015). Detecting and removing exposed data is thus of utmost importance and yet the average breach remains undetected on average for 90 days (Global Space, 2013). Reasons for data breaches occur due to a number of reasons such as negligence, theft or malicious actors hacking the information from a company's databases. A clear example of stealing a company's database is found in the recent Ashley Madison data breach, where approximately 32 million users' PII was posted online (9.7 GB in size) in August 2015. The reason for this hack was merely an attempt to force the website to close down with the threat of exposing all registered users of the website which would in turn expose the scandals of the parties involved. While the main target was Ashley Madison, one analysis indicated more than 15000 email addresses were from .mil or .gov domains were included in the breach. This large number of PII for enlisted personnel should be worrying since disclosure of personal information can be used as leverage on individuals in key positions via social engineering attacks (Zetter, 2015).
With the inception of a new privacy law, the Protection of Personal Information (PoPI) Act, in South Africa (South African Government Gazette, 2013), the need for more effective data breach enforcement has been identified due to the increased visibility of data breaches. The possibility to enhance regulatory compliance by means of pro-active automated breach detection is discussed in this paper. By pro-actively discovering leaked information, detection times can potentially be reduced to limit the exposure time of personal PII on publicly accessible networks, thus increasing privacy regulator effectiveness. The data presented in this paper is derived from the results of an experimental system that makes use of public search engines to identify documents and files that contain personal information on the Internet. The accuracy and potential impact of such a tool to reduce the detection time of data breaches and as an indirect result reduce cybercrime is examined along with the potential use by existing privacy regulators. The South African PoPI Act is almost a direct copy of the European Data Protection Directive (EU DPD) that was implemented in 1995 (Birnhack, 2008). As such, while the results of the experiment discussed in this research are focussed on the South African landscape, it is internationally applicable and could aid privacy regulators internationally.
2. Pro-active data breach detection
In order to gauge the level of existing PII disclosure before the promulgation of the PoPI Act in South Africa, a custom developed software application was used to pro-actively scan the Internet in search for leaked PII. The architecture of the application is further discussed in section 2.1. Two datasets were collected using the custom application, the first was obtained in 2014 and the second in 2015. An in-depth analysis on these datasets is performed in sections 3 and 4 with the aim to compare the amounts of PII being disclosed each year as well as the validity of the data detected. The comparison results will be examined for indicators of how effective the PoPI legislation is over time since its inception, in reducing the amounts of PII detected online.
2.1 Application architecture and potential benefits
The application is capable of scanning the Internet for disclosed personal information. A scan on the Internet, in this context, refers to the application making use of public data sources, provided by Google, Bing and Twitter, in search for leaked PII being disclosed in electronic documents. These documents are stored within the website domain space. The application serves as a PII detection and collection service. Application program interface (API) calls are being used in search of personal information being leaked within electronic documents found freely available on the Internet. For the scope of this experiment, the scans were limited to the .co.za domain. Further data processing takes place in order to obtain an Internet Protocol (IP) address and approximate geo location for each website found responsible for disclosing personal information. Figure 1 shows the architecture of the custom application.
2.2 Information processing
The type of personal information extracted was limited to identity (ID) numbers, land line numbers, cell phone numbers, email addresses, credit card numbers and addresses. Other personal information such as names, job location or religious believes requires significant lexical analysis since no dominant standard of identification currently exists. The electronic documents scanned were limited to the most commonly used files such as text, MS Word, MS Excel and PDF files. SQL scripts and XML files have also been included. While information might reside in other formats, each document type has unique characteristics that need to be catered for on a technical level and this adds significant overhead to development time that will be address in future revisions.
2.3 Information visualisation
The application has a staged approach where information is first collected, then processed and lastly visualised. The image presented in Figure 2, illustrates how a single server responsible for the leakage of PII is presented by a blue antenna icon with its approximate geo-location. When multiple servers in the same area are found they are grouped together with the count of servers displayed in the grouped icon as shown in Figure 2, a count of 234 servers were found in South Africa.
Zooming into the map allows opening specific server nodes and display more details on that particular server. Figure 3 shows the details on a website URL that has been opened in the application, indicating that the particular website is responsible for leaking 171 telephone numbers and 9 cell phone numbers. The approximate geo-location for this web-server is identified as Durban, South Africa. The complete URL for the file found is hidden due to privacy reasons.
The experimental system at this point in time only made use of the limited free queries provided by each of the data sources. This was a deliberate limitation on the research team's part to investigate what could be achieved with little to no resources in terms of pro-active data breach detection. A further study comparing the current free results obtained with a funded approached will be conducted in future work.
3. Experiment results
The first iteration of collecting personal information being disclosed was done over a three-month period from May - July 2014, six months after the PoPI Act was signed into law in November 2013 (South African Government Gazette, 2013). A second three month iteration study was performed from July - end of September 2015, one year and eight months after the instantiation of the Act. Since the Act is now signed into law for quite some time, a popular hypothesis would be that the amount of leaked personal information should have decreased (Romanosky, Telang & Acquisti, 2011). The findings on the amounts of leaked personal data of both datasets are presented in Table 1. The count of PII is listed for 2014 and 2015, as well as the percentage increase or decrease.
Table 2 indicates further processing on specifically emails detected to reveal the number of times an email domain was detected in the datasets. The focus on email domains was simply to illustrate how individuals make use of resources given to them by one company to obtain services from another company. Should a data breach occur, it might not just be the individual affected but it could potentially affect another company that the individual was affiliated with.
Upon examination the results listed in Table 1, it was found that the amounts of PII disclosed in 2015 have slightly increased as the country progress further into the PoPI Act compliance timeframe; all PII types have increased except for addresses that have decreased by 99%. The results thus show that although the South African privacy and data breach legislation has been in effect since November 2013 it seems to have little effect on the amounts of PII being disclosed over the Internet.
The information presented in Table 2 compares the email domains counted in the 2014 dataset with those detected in 2015. The number of domains decreased significantly in 2015 even though the total number of email addresses detected has increased. Possible reasons for the significant reduced number of domains are that particular large files found in 2014 responsible for leaking the information have been removed. New files were detected in 2015, thus the overall count of leaked PII has increased as seen in Table 1.
Some curious findings detected in the datasets examined were that the youngest person whose personal information was detected is 16 years of age and the eldest person found is 86 years of age. This is calculated based on the national identification numbers. Another noteworthy finding is the amount of times certain individual pieces of PII was found. For example, a certain ID number was detected 101 times, whereas a certain email address was detected 5244 times (the specific ID number, email address or any other PII type won't be mentioned due to privacy reasons). See Table 3 on the amount of times the most detected PII type was found. Further investigation as to the reasons for the high volume of disclosure for the individual PII artefacts would have to be conducted on a case by case study.
Using the custom application, it was possible to geo locate the webservers or hosts found responsible for the disclosure of PII. A notable finding is that a substantial number of hosts or servers detected responsible for the leakage of personal information namely 10718. If it is assumed that each individual server represents an individual or company, it points to 10718 transgressions of privacy protection legislation. Figure 4 shows a breakdown on the number of hosts found per country. Although the focus is only on South Africa and the .co.za domain, the websites can be hosted outside the borders of South Africa. This could highlight non-compliance issues as the PoPI Act states that PII may not be stored at international locations without the consent of the data subject, according to condition 6 sections 18(g) (South African Government Gazette, 2013).
Taking into consideration the time period the Act has been instantiated and the amount of PII data leaked in 2014, and still being disclosed in 2015, one can see that the privacy legislation has very little effect on the amounts of PII data being leaked. It seems that South Africa still has a long road ahead towards the compliance process and possible improvements on the Act to better control the amounts of personal information being leaked.
Since a number of countries already have privacy laws enforced, one can learn from them on the measurements in place to reduce the amounts of personal information being disclosed online. In Australia, the ALRC Report 108 states that in a 28-month inquiry was done into the extent to which their Privacy Act and related laws provides an effective framework for the protection of privacy in Australia. It was found that the Privacy Act is working well and is up to date (The Australian Law Reform Commission, 2012).
4. Reflection on data results
This section is a reflection on the data found, see section 3 above, in terms of the accuracy of the data found. Another interesting point to consider is how the methods used in this experiment are aligned with the regulators. An additional viewpoint to contemplate is the communication to the third parties being affected by these data breaches.
4.1 Accuracy of data and PII classification
The custom application detected a substantial amount of PII. However, the accuracy of the data detected remains a concern. With the use of regular expressions (Regular-expressions.info, N.D.), detecting certain PII types such as telephone, cell phone, identity (ID) numbers and credit card numbers is reasonably straightforward. Detecting whether the email addresses or cell phone numbers can indeed be classified as PII is not as simple and also not implemented in the custom application. Manual or human intervention is required to be able to classify the collected data as PII at this point in time.
Case studies were done by randomly selecting detected PII entries in the database and manually investigating the accuracy of the data. Large scale verification is at present not possible since it would require a large number of resources to be allocated and would require significant funding overheads. Once a data link is randomly selected, the verification methodology is to revisit the selected location. The exposed data is retested and the data validated against the PII type it was previously classified as, such as cell phone number or email address. Should the link no longer be available, it is considered removed. The randomly selected locations to revisit are presented in Table 4 along with the particulars recorded for each record.
4.1.1 Case study 1
A cell phone number is found in an insert statement of a leaked SQL file. The link is still exposed, but the data is found on a website advertising businesses. This number is the number of a business therefore cannot be seen as PII. However, the specific business cannot be found using the search of the particular website, but can be found in the exposed SQL script file. This indicates that the record is no longer available using the website search functionality, but still exposing the details via the script found. The information is found to be accurate and still exposed at the time of the manual check, but cannot be classified as leaked PII.
4.1.2 Case study 2
A cell phone number is found in an Excel spreadsheet of someone who is a Pilates instructor. The spreadsheet list a number of Pilates instructors in South Africa in various regions. The information was not found on the website itself as an advertisement, but the document seems to be information that could have been used as public information, providing that the parties involved gave consent. It is not clear to tell that this is private or public information. Other information found with this number is the email address, website address, name and surname and address. Enough information is exposed in order to do some sort of a phishing attack on the names listed in the spreadsheet. The data is found to be accurate and still being exposed. It can also be classified as leaked PII.
4.1.3 Case study 3
Personal information found in a text file exposing the information of individuals who obtained a cycle tour certificate. At the time of the spot check the file was removed. It is therefore difficult to state that the information found was accurate and in fact personal information being disclosed.
4.1.4 Case study 4
An email address found in a SQL file, exposing the complete SQL database structure with a significant amount of PII being disclosed. The domain responsible for the leaking of PII seems to be a sub domain of jdconsulting.co.za. jdconsulting.co.za redirects to another website called talooma.com, which is a digital marketing company. The file seems to be a script used for storing a list of email addresses used for marketing purposes. The email was found next to messages indicating that is an advert on software packages. This kind of marketing campaigns is precisely what the PoPI Act is attempting to control. There is no evidence that the target users gave consent or not for receiving these marketing emails. The findings of this data are deemed to be accurate, still exposed and classified as leaked PII.
4.1.5 Case study 5
An email address found in a SQL script file. This file exposes the database structure of the particular website advertising accommodation, wine routes and tours. The information is found to be accurate, but cannot be classified as leaked PII, since it is public contact information.
4.1.6 Case study 6
An ID number was found in a text file. Exposing an ID number can be seen as exposing someone's personal information. However, the file was removed and therefore not disclosing the information anymore.
Having the information, collected by the application, classified as PII is one of the main objectives of this investigation. However, how these findings are aligned with the regulators and communicated this to the affected parties is still a matter of concern and possibly preventing the proposed application to be used as a proactive PII detection system. See section 4.2 below for more details on these factors.
4.2 Factors preventing the adoption of pro-active PII detection
Certain factors are preventing the adoption of a pro-active detection system as proposed in this paper. Two of the main factors are the alignment with regulators and the restrictions on communicating with the affected parties.
4.2.1 Alignment with regulators
Governments world-wide have been assigned the responsibility of protecting personal and sensitive data of consumers and companies (Noblet, N.D.). Having a system in place that could detect the leakage of personal information could be very beneficial, but it might not be aligned with the privacy law regulators' objectives and mandate. One needs to ensure that you adhere to all the principles of the privacy laws set out in the particular country.
The proposed system collects personal information from the Internet with the use of public data sources. The purpose is to detect leaked or disclosed personal information online. To accomplish this, personal data needs to be collected and analysed. The collection and processing of personal data should be done in such a way that it still comply with the privacy law. This particular experiment was performed in South Africa, focusing on the .co.za domain; therefor it has to comply with the principles of the PoPI Act (Botha, Eloff, & Swart, 2015; South African Government Gazette, 2013).
4.2.2 Communication with affected parties
Obtaining an overview of disclosed personal information distributed on the internet is interesting in the sense that it provides at least a quantitative measurement indicator. While the measurement is useful to express the seriousness of the problem or to act as a baseline for future experimentation, the question of how to use this information to reduce the amounts of data being leaked is difficult to answer.
The website owners where the information resides could be notified of the PII data leakage but they might merely be service providers and not the data custodians responsible for the data. According to the privacy legislation it is the responsibility of the data custodian to safeguard personal information. Another concern is that notification might raise unwanted questions on why this kind of data is collected and what methods were used to obtain this information. Collection could be perceived as Government interference or raise spying concerns. Notification of such an incident might also lead to the assumption that it is the responsibility of the person who notified the third party to take action on the matter and to help remove the leaked information from the responsible websites.
An alternative approach to consider is to notify the person whose personal information is being disclosed but this approach is also not perfect. The PII information disclosed might be an ID number without any additional information. Contacting the individual will then require co-operation between more than one department adding complexities to the potential solution. This approach also does not mitigate concerns over Government data collection. Furthermore, the question remains on how to communicate with individuals to notify them that their data was breached without disclosing other parties of the breach as well. Providing a link to where the data is located and leaving it up the individuals to contact the service provider to remove the offending data will expose all other parties also included in the data breach to each other.
One possible solution would be to work in close partnership with law firms where the law firm could use the personal data being disclosed to force the service provider to take responsibility and notify the parties involved. The financial viability of this option should be further explored and is unsure at present. At present, one of the best ways to reduce the amount of personal information being disclosed is to raise awareness regarding the leaked information as well as the requirements to be compliant with the PoPI Act (Botha et al., 2015).
4.3 Time period that data was available in South Africa vs. International averages
The main objectives for this section was to determine if data is removed without notification after some time and if it was removed, what was the duration of exposure of the data. In the current process, the experimental application runs a background service in order to determine if the data is still available or if it has been removed. 1541 Individual records were detected by the system containing multiple types of PII such as email, cell phone and credit card details that was leaked but has now been removed and are no longer available. An extract of data that has been removed as well as the number of days that it was exposed on the Internet is presented in Table 5. In some instances, the data was removed within a few days but in others, nearly a year passed before the disclosed data was removed.
At the time of writing this paper, the domains listed in Table 6 are still responsible for the disclosure of PII. Some of these domains are leaking a substantial amount of data with numbers such as 32953 telephone numbers, 32763 cell phone numbers, 1094 ID numbers and 10515 Credit Card numbers. The amounts disclosed are deliberately withheld to limit the exposure of the domains responsible for such high volumes of PII leakage.
5. Conclusion
The frequency of data breaches are increasing and the problem is that the data disclosed are used for various criminal activities and cyber-attacks (Paganini, 2015; Zetter, 2015). The current approach of reactively clearing up data breaches can most definitely be improved. The experiment documented in this paper examined the amount of PII publicly available at a stage in time when privacy and data breach legislation are being introduced in South Africa. The expectation was that the South African privacy legislation recently enacted would lead to a reduction in the amounts of PII being publicly disclosed. However, upon examination of the temporal data gathered from 2014 to 2015, only a slight improvement could be observed. Instead it is clear that the amounts of personal information being disclosed are still substantial. One potential reason for the slow reduction in leaked information could be that while the Act has been promulgated, the South African privacy regulator is not yet appointed. There has thus been no formal charges and cases presented in a court of law against an individual or company that could spur greater public compliancy. The data observed was analysed for accuracy via a number of case studies and for the temporal period disclosed online. International averages of up to six months are stated in prior work but in this experiment certain instances were recorded where PII was still available after nearly two years. This is well above the international averages and most certainly a cause for concern. A limitation identified during accuracy tests were that not all PII detected is meant to be private. Contact information such as cell phone and email addresses are not in fact just personal information, but public information disclosed on purpose. Determining what PII data is deliberately disclosed as opposed to accidently leaked remains one of the more challenging problems for fully automated pro-active data breach detection.
References
Birnhack, M. D. (2008). The EU Data Protection Directive: An engine of a global regime. Computer Law & Security Review, 24(6), 508-520.
Botha, J., Eloff, M., & Swart, I. (2015). Evaluation of Online Resources on the Implementation of the Protection of Personal Information Act in South Africa. Paper presented at the Iccws 2015-the Proceedings of the 10th International Conference on Cyber Warfare and Security, South Africa. 39.
Global Space. (2013). Data Breaches Often not Detected for Months, report finds. Retrieved from https://www.globalscape.com/blog/2013/2/14/data-breaches-often-not-detected-for-months-report-finds
Noblet, T. (N.D.). Business IT: Understanding Regulatory Complaince. Retrieved from https://technet.microsoft.com/enus/magazine/2006.09.businessofit.aspx
Paganini, P. (2015). Cybercrime Exploits Anthem Data Breach in Phishing Campaigns. Retrieved from http://securityaffairs.co/wordpress/33278/cyber-crime/anthem-phishing-campaigns.html
Regular-expressions.info. (N.D.). Welcome to Regular-expressions.info - the premier website about regular expressions. Retrieved from http://www.regular-expressions.info/
Romanosky, S., Telang, R., & Acquisti, A. (2011). Do Data Breach Disclosure Laws Reduce Identity Theft? Journal of Policy Analysis and Management, 30(2), 256-286.
Rozenberg, Y. (2012). Challenges in PII Data Protection. Computer Fraud & Security, 2012(6), 5-9.
Shackleford, D. (June 2012). When Breaches Happen: Top Five Questions to Prepare for. A SANS whitepaper SANS Institute. Protection of personal information act, ActU.S.C. (2013).
The Australian Law Reform Commission. (2012). Privacy law and practice. Retrieved from http://www.alrc.gov.au/inquiries/privacy
Zetter, K. (2015). Hackers Finally Post Stolen Ashley Madison Data. Retrieved from http://www.wired.com/2015/08/happened-hackers-posted-stolen-ashley-madison-data/
Johnny Botha1, 2, M. Eloff1, and Ignus Swart2
1Institute for Corporate Citizenship UNISA, Pretoria, South Africa
2CSIR, Pretoria, South Africa
Johnny Botha is a Software developer & researcher at the Council for Scientific and Industrial Research(CSIR). Studying my masters (MTech) degree in Information Technology, at University of South Africa(UNISA). Topic: "Personal Identifiable Infor mation Disclosure Since the Protection of Personal Information Act Adoption in South Africa"Obtained NDip and BTech de gree in Computer Systems Engineering at the Tswane University of Technology(TUT).
Copyright Academic Conferences International Limited 2016