Challenges of Data Privacy In AI and Machine Learning

On 11/06/2020, I discussed “Data Privacy and AI” during “Analytics and Data Privacy” conclave organized by Symboisis University.

Had some excellent questions by audience and my answers to address the “Challenges of AI and Data Privacy”.

Can AI and data privacy co-exist?

Yes, for most of the AI/ML analysis we anonymise the privacy data appropriately before carrying out the analysis.
For some of the AI/ML Analysis like recommendation engine where some privacy data is required to provide the recommendation, we ensure that only the data required is used seamlessly without any privacy data.
We ensure Acceptance of Privacy Policy by the client where applicable.
Additionally, CISO, the Privacy Officers exercise required controls on all the activities carried out by the organization including AI projects.
In some cases, we make sure privacy data such as SSN, Protected Health Information (PHI), and other personal data is masked before providing data to AI/Analytics teams. For example, we mask the first 5 digits of SSN or the first 12 numbers of credit card, etc.

What steps are you taking in your organization to protect customers’ privacy while providing data access for AI and Machine Learning projects?

The following are some of the steps to consider:

Exactly knowing the location of “privacy data” on our systems, how it is stored, processed and where this data is duplicated e.g. for backup purposes etc.
Using proper encryption of privacy data when stored on the systems and also when transmitted
Ensuring that the privacy data is not duplicated unless it is absolutely required for business enabling purposes
Access is provided to privacy data only on absolute business need basis
Anonymise the privacy data while carrying out any analysis (as far as possible)
Utmost care to process the data only for customer consented purposes
Ensuring that the data is deleted/archived on timely basis based on the IT security policies and procedures.
Exactly knowing what data is considered as “privacy data” as per various statutes and regulations e.g. HIPAA, GDPR (General Data Protection Regulation)
“Anonymizing” data is the common practice to protect “privacy data”

Why is Data Privacy such a concern when it comes to AI?

First of all, organizations are concerned about adhering to the specific standards and government regulations.
Certain information is private, sensitive, secret, and such data are restricted.
Such information is classified as “confidential” and needs to be protected.
Compromising sensitive personal data may damage the reputation of the organizations, violation of privacy rights and loss of business.
If any confidential data breach happens, senior executives are answerable to investors, government regulators and to consumers.
For instance, sharing medical data with pharmaceutical companies and hospitals for medical research purposes, or for finding patterns of a particular drug usage, can be violating a patient’s privacy even though sensitive information is masked.
Hence, data privacy is a big challenge not just for AI projects but in general.

How much is the end user aware of what personal data is being collected and shared?

The answer is “it depends”. However, over a period of time, the awareness among the user community with respect to privacy of personal data has increased.
Sometimes, the users simply accept the privacy policies of the organizations/software download without reading thus allowing organizations to use the personal privacy data beyond what they requested.
How many of us actually read the End User Licensing (EUL) Agreements?
Because of some of the recent incidences such as Cambridge Analytica, consumers are becoming aware as to how his/her data is being used.
Also, various laws are being brought in to support the privacy of personal data across the globe. Some of the regulations include General Data Protection Act (GDPR), California Consumer Privacy Act of 2018 (CCPA)

What is your view on data privacy regulation that has been recently announced?

The most recent regulations are General Data Protection Act (GDPR) and California Consumer Privacy Act(CCPA 2018). Both are aimed to protect customer “privacy”.
Some personal data are forcibly collected by organizations and some are not.
For example, Google Map, Google search, Facebook, are collecting all your data
Personal data is any information that relates to an individual who can be directly or indirectly identified. Names and email addresses are obviously personal data. Location information, ethnicity, gender, biometric data, religious beliefs, web cookies, and political opinions can also be personal data.
Key regulatory points of GDPR includes – providing transparency how personal data is used, how data is stored, who has access, what is the purpose of collecting such data
GDPR is for EU regions
Similarly, California Consumer Privacy Act (CCPA 2018) for California state residents.
In July 2019, New York passed the Stop Hacks and Improve Electronic Data Security (SHIELD) Act. This law amends New York’s existing data breach notification law and creates more data security requirements for companies that collect information on New York residents.
The consumer has right to access, right to rectify, right to delete, right to object
These regulations give more control over the personal information that the organizations are collecting.

Do you think the evolution of AI technology in the next 5-10 yearsseems to be blocked by Data Privacy?

Yes, very much. It is true that sometimes “privacy” of data tries to block the data usage for AI purposes.
Particularly, organizations dealing with sensitive and private data, like healthcare or finance, have lagged due to regulatory constraints to protect users’ data
AI team may comprise of temp workers, contractors, not just employees. Organizations are concerned about the data leak by people who are working on AI Projects. Hence, organizations exercise caution providing access to data to the right people.
Leak of privacy data can lead to huge penalties, company’s bad reputation and the consequences can be severe.
Hence, AI projects normally require data access approval from the top C-Suite persons to ensure that the data is available for AI purposes which are intended to solve critical business problems or challenges.
I strongly suggest that the AI Team reports to one of the C-Suite persons to ensure that there is authority, and accountability at the same time.

How do Data Privacy regulations affect businesses and innovation as whole?

One thing you have to understand is , AI is leading to significant benefits to the organizations in terms of increasing productivity, optimization of the processes, reducing downtime, increasing product reliability, increasing proactive preventive maintenance, prescriptive analysis, and additional business generation.
But, “privacy” is equally important and to avoid any possible breach of data privacy, the organizations have to invest money to put proper controls in place so that their IT and systems are secured and protected from threats.
Though there is an initial cost, the penalty on account of breach of privacy data or on account of negligence is much larger
Hence, it is an advantage for the organizations to spend additional money and protect their data, by implementing proper controls than paying huge penalty later on.
Once the data privacy meets standards, and all the regulations and proper controls are in-place, approval for accessing data that is required for solving AI/Machine Leaning/Analytics business challenges can be provided

Should we as consumers be given more choice on how our PII data is used by AI algorithm?

Yes, the consumers should be aware of what is “personal identifiable information” (PII).
The consumer also should be aware of how their data is collected, and what is the purpose of it being collected
They should know whether the data is used for research purposes or monetory benefits?
Based on all this information, the consumer should be able to make a choice/decision whether he should give consent or not.
The consumer has to read the data agreement carefully to check whether their data can be sold to others for monetory benefits
The consumer should have rights to modify the data, erase the data or to decline – collecting such data
The consumer should be aware that most of the AI and Analytics analysis can be done by anonymising the privacy data appropriately

Other Important factors to consider:

Cryptographic methods like Homomorphic Encryption, Garbled Circuits, Secret Sharing and Secure Processors and ultimately the generation of “Synthetic data”.
“Synthetic data” can be used without risking privacy of users
Anonymization removes some of the informational value of the data, it can distort or completely destroy important correlations.
Also, Intel’s SGX system, enables secured Trusted Execution Environment (TEE) which guarantees privacy through hardware.