We Raised $8M Series A to Continue Building Experiment Tracking and Model Registry That “Just Works”

Read more

Blog » General » Using Differential Privacy to Build Secure Models: Tools, Methods, Best Practices

Using Differential Privacy to Build Secure Models: Tools, Methods, Best Practices

COVID-19 pandemic in 2020 made us look at different challenges of life, some being painful and others bringing out the flaws of the society. It also made us remember the importance and lack of proper data. 

This article by WHO cites that “the collection, use, sharing and further processing of data can help limit the spread of the virus and aid in accelerating the recovery, especially through digital contact tracing.”

 It further cites that “Mobility data derived from people’s usage of mobile phones, emails, banking, social media, postal services, for instance, can assist in monitoring the spread of the virus and support the implementation of the UN System Organizations’ mandated activities.” Later releasing Ethical considerations to guide the use of digital proximity tracking technologies for COVID-19 contact tracing.

This might sound like a simple answer but is indeed a complex problem to solve. Medical data is one of the most confidential data which unlike any other personal information can be used both for and against an individual. For example, healthcare data breach may lead to COVID fraud scams, which we have heard of this past year.

In this article, we’ll stay in the lane of ML Privacy, and talk about certain issues and dive deeper into Differential Privacy (DP) concepts, one of the ways to address privacy issues. Further listing five open-source differential privacy libraries or tools that you can use or contribute to. 

What is data and how is it created?

Data are the facts or statistics collected for reference or analysis.

We create data almost every day. It may be both online or offline data.

For example, patient health records by the hospital; student information by schools or colleges; internal company logs for employee information and project performance; or just a simple note taking can be considered as offline data.

Whereas, data collected from online platforms or apps when connected to the internet are considered as online data, such as posting a tweet, a YouTube video or a blog post, or mobile apps collecting user performance data, etc.

Privacy vs Security

Though the sensitive personal data such as cancer patients records or contract tracing data may seem like a gold mine for data scientists and analysts, it also raises concerns on the methods used to collect such data and, who will ensure that the data will not be used for malicious purposes? 

The terms “Privacy” and “Security” are often confused with, but there’s a difference. Security controls “who” can access the data, whereas, Privacy is more about “when’’ and “what” type of data can be accessed. “You can’t have privacy without security, but you can have security without privacy.”

For example, we’re all familiar with the term “Login Authentication and Authorization”. Here, authentication is about who can access the data, so it’s a matter of security. Authorization is all about what, when, and how much of the data is accessible to that specific user, so it’s a matter of privacy.

Private and Secure Machine Learning (ML)

The risks from data leaks and data misuse have led a lot of governments to legislate data protection laws. To abide by data privacy laws and to minimize risks, ML researchers have come forward with techniques for solving these privacy and security issues, called Private and Secure Machine Learning (ML).

As this blog post from PyTorch puts it:

Private and secure machine learning (ML) is heavily inspired by cryptography and privacy research. It consists of a collection of techniques that allow models to be trained without having direct access to the data and that prevent these models from inadvertently storing sensitive information about the data.”

The same blog post lists some common techniques to cope with different privacy issues: 

Federated learning means training your ML model on data that is stored on different devices or servers across the world, without having to centrally collect the data samples.

Sometimes, AI models can memorize details about the data they’ve trained on and could ‘leak’ these details later on. Differential privacy is a framework for measuring this leakage and reducing the risk of it happening.

Homomorphic encryption lets you make your data unreadable, but you can still do computations on it.

Secure multi-party computation allows multiple parties to collectively perform some computation, and receive the resulting output without ever exposing any party’s sensitive input.

When two parties want to test if their datasets contain a matching value, but don’t want to ‘show’ their data to each other, they can use PSI to do so.

  • Protecting the model

While Federated Learning and Differential Privacy can be used to protect data owners from loss of privacy, they’re not enough to protect a model from theft or mis-use by the data owner. Federated learning, for example, requires the model owner to send a copy of the model to many data owners, putting the model at risk of IP theft or sabotage through data poisoning. Encrypted computation can be used to address this risk by allowing the model to train while in an encrypted state. The most well-known methods of encrypted computation are homomorphic encryption, secure multi-party computation, and functional encryption. 

We’ll focus on differential privacy – let’s see how it works, and what tools you can use. 

What is Differential Privacy? 

“Differential Privacy describes a promise, made by a data holder, or curator, to a data subject (owner), and the promise is like this: You will not be affected adversely or otherwise, by allowing your data to be used in any study or analysis, no matter what other studies, datasets or information sources are available.” 

Cynthia Dwork, “Algorithmic Foundations on Differential Privacy”. 

The intuition behind differential privacy is that “we limit how much the output can change if we change the data of a single individual in the database”. 

That is, if someone’s data is removed from the database for the same query, would that change the output? If yes, then the chances of an adversary being able to analyze it and find some auxiliary information is high. Simply put – Privacy is compromised!

For example:

Adam would like to find the average age of the donor, donating to his XYZ Organisation. At this point, this might seem OK! But, this data can also potentially be used to find the age of one specific donor.  Say for simplicity, 1000 people made a donation, for which the average age of the donor was found to be 28 years.

Now, just by excluding John Doe’s data from the database for the same query, let us assume that the average has changed to 28.007 for 999 donors. From this, Adam could easily find that John Doe is 21 years old. ( 1000*28 – 999*28.007 = 21.007) Similarly, Adam can repeat the process for other donors to find their actual ages.

Note: The result would be the same even if each of the donor ages were encrypted (e.g., homomorphic encryption), If Adam could reverse engineer and get their values.

Differential privacy

In order to avoid such data leaks, we add a controlled amount of statistical noise to obscure the data contributions from individuals in the data set.

Meaning, the donors are asked to add any value between say -100 to 100 to their original age before submitting it or before encrypting their age. Say John Doe added -30 to his original age i.e., 21, the age registered before encryption would be -9. 

This might sound crazy?! But, interestingly, by the Law of Large Numbers in probability and statistics, it is seen that when the average of these statistically collected data is taken, the noise cancels out and the average obtained is near to the true average (average of the data without adding noise (random number)) 

Now, even if Adam were to reverse engineer John Doe’s age, -9 would not make any sense, thus, preserving John Doe’s Privacy at the same time allowing Adam to find the average age of the donor. 

In other words, Differential privacy is not a property of databases, but a property of queries. It helps provide the OUTPUT Privacy i.e., how much insight can someone gain on the input by reverse engineering from the output.

In the case of AI model training, noise is added while ensuring that the model still gains insight into the overall population, and thus provides predictions that are accurate enough to be useful – at the same time making it tough for anyone to make any sense from the data queried.

Note: for more details about differential privacy, check out my Differential Privacy Basics Series.

Who’s using Differential Privacy?

Top tech companies, the FAANGs, IBM, are using differential privacy and also often releasing open-source tools and libraries. 

The most interesting examples are:

  1. RAPPOR, where Google used local differential privacy to collect data from users, like other running processes and Chrome home pages.
  2. Private Count Mean Sketch (and variances) where Apple used local differential privacy to collect emoji usage data, word usage and other information from iPhone users (iOS keyboard).
  3. Privacy-preserving aggregation of personal health data streams paper, develops a novel mechanism for privacy-preserving collection of personal health data streams that is characterized as temporal data collected at fixed intervals by leveraging local differential privacy (Local DP)
  4. Census Bureau Adopts Cutting Edge Privacy Protections for 2020 Census i.e, the US Census will use differential privacy to anonymize the data before publication.

And many more, to know more check out my  Local vs Global DP blog from the Differential Privacy Basics Series).

Do we really need it? Why does it matter?

From the above example of Adam attempting to find the average age of the donor, it can be seen that encryption alone cannot protect the individual’s data Privacy as de-anonymization is possible.

One such real-world example would be the de-anonymization of the Netflix Prize Dataset where an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber’s record in the dataset. Using the Internet Movie Database (IMDb) as the source of background knowledge, it was possible to successfully identify the Netflix records of known users, uncovering their apparent political preferences and other potentially sensitive information.

Sometimes, Machine Learning models can also inadvertently memorize individual samples due to the over-parameterization of deep-neural networks leading to unwanted data leaks.

For example, a language model designed to emit predictive text (such as next-word suggestions seen on smartphones) can be probed to release information about individual samples that were used for training (“my ID is …”).

Research in this field lets us calculate the degree of privacy loss, and evaluate it based on the concept of a privacy ‘budget’. Ultimately, the use of differential privacy is a careful tradeoff between privacy preservation and model utility or accuracy.  

Differential Privacy best practices

1. It is always important to know what Differential Privacy does and does not promise. (ref: Cynthia Dwork, “Algorithmic Foundations on Differential Privacy”

  • Differential privacy promises to protect individuals from any additional harm that they might face due to their data being in the private database x that they would not have faced had their data not been part of x.
  • Differential privacy does not guarantee that what one believes to be one’s secrets will remain secret. That is, it promises to make the data differentially private and not disclose it BUT not to protect it from attackers! 
    Ex: Differential attack is one of the most common forms of privacy attack.
  • It merely ensures that one’s participation in a survey will not in itself be disclosed, nor will participation lead to the disclosure of any specifics that one has contributed to the survey if kept differentially private.

2. Differential privacy is not a property of databases, but a property of queries. (as mentioned earlier)

3. The amount of noise added matters as the higher the noise added to make the data private, the lower the model utility or accuracy is.

4. Know the limitations, such as:

  • Differential privacy has been a topic that is widely explored by academia and the research community but less in the industry due to its strong privacy guarantee.
  • If the original guarantee is to be kept across k queries, noise must be injected k times. When k is large, the utility of the output is destroyed.
  • For a series of queries, more noise needs to be added, which later exhausts the privacy budget and might ultimately lead to killing the user. Meaning as soon as the privacy budget exhausts, that user’s no longer allowed to ask any more queries, and if you start to allow for collusion between users, you start to get into trouble with what this privacy budget means on a per-user basis. Which ultimately leads to killing the user.
  • Differential Privacy alone cannot protect user data as privacy attacks can happen!

Yes of course, varied approaches can be taken to work around but that’s beyond the scope of this blog.

5 open-source Differential Privacy libraries/tools (alphabetical order)

1. Facebook – Opacus

Facebook’s Opacus is a library for anyone who would like to train a model with differential privacy with minimal code changes or quickly prototype their ideas with their PyTorch code or pure Python code. It is also well documented and being an OpenSource library you may also contribute to its code base if interested.

Join the PyTorch Forum to drop in any questions.

Resources: 

2. Google – Differential Privacy or TensorFlow Privacy

Google provides two open-source libraries (or repository) wrt Differential Privacy. 

This is a repository containing 3 building block libraries to generate ε- and (ε, δ)-differentially private statistics over datasets supported in C++, Go, and Java suitable for research, experimental, or production use cases.

While other tools provided such as Privacy on Beam, stochastic tester, differential privacy accounting library and command line interface for running DP queries with ZetaSQL are fairly experimental.

Just as the previous library this is open to contributions and has a public discussion group.

This library could be called the TensorFlow counterpart of the PyTorch Opacus library mentioned above, with the implementations of TensorFlow optimizers for training machine learning models with differential privacy. This too accepts contributions and is well documented.

Resources:

3. IBM – Diffprivlib v0.4

Yet another general-purpose library for experimenting with, investigating and developing applications in differential privacy such exploring the impact of differential privacy on machine learning accuracy using classification and clustering models or just devel. Intended for expert level with knowledge of Differential Privacy. 

Resources:

4. OpenMined – PyDP

This is Python version of Google’s Java Differential Privacy Library providing a set of ε-differentially private algorithms used to produce aggregate statistics over numeric data sets containing private or sensitive information. It’s now supported by the PySyft Library of OpenMined

PyDP Team is actively recruiting members for further development of the library beyond Google’s Java Differential Privacy Library. 

Join OpenMined Slack #lib_pydp to interact with the team and start contributing.

Resources:

5. Harvard and Microsoft –  OpenDP – SmartNoise Core Differential Privacy

This is a collaboration between SmartNoise Project and OpenDP to bring academic knowledge to practical real-world deployments.

It provides differentially private algorithms and mechanisms for releasing privacy preserving queries and statistics, as well as APIs for defining an analysis and a validator for evaluating these analyses and composing the total privacy loss on a dataset. 

Also open to contributions, so feel free to join in if interested.

Resources:

Bonus – Uber – sql-differential-privacy

Though this project is deprecated and not maintained, it can be used for educational purposes. It was built for query analysis and rewriting framework to enforce differential privacy for general-purpose SQL queries. 

Summary

As you can see, differential privacy is an important topic in today’s data science landscape, and it’s something that all of the top tech giants are concerned with.

No wonder, because in a world that runs on data, it’s in our interest to do the best we can to protect that data.

If you’re interested in updates about differential privacy, follow me on Twitter, and give Neptune a follow too, if you haven’t done it yet.

Thanks for reading!

Resource or extra read:

Researcher and RecSys Team Member @OpenMined

READ NEXT

Best 7 Data Version Control Tools That Improve Your Workflow With Machine Learning Projects

5 mins read | Jakub Czakon | Updated October 20th, 2021

Keeping track of all the data you use for models and experiments is not exactly a piece of cake. It takes a lot of time and is more than just managing and tracking files. You need to ensure everybody’s on the same page and follows changes simultaneously to keep track of the latest version.

You can do that with no effort by using the right software! A good data version control tool will allow you to have unified data sets with a strong repository of all your experiments.

It will also enable smooth collaboration between all team members so everyone can follow changes in real-time and always know what’s happening.

It’s a great way to systematize data version control, improve workflow, and minimize the risk of occurring errors.

So check out these top tools for data version control that can help you automate work and optimize processes.

Data versioning tools are critical to your workflow if you care about reproducibility, traceability, and ML model lineage. 

They help you get a version of an artifact, a hash of the dataset or model that you can use to identify and compare it later. Often you’d log this data version into your metadata management solution to make sure your model training is versioned and reproducible.

How to choose a data versioning tool?

To choose a suitable data versioning tool for your workflow, you should check:

  • Support for your data modality: how does it support video/audio? Does it provide some preview for tabular data?
  • Ease of use: how easy is it to use in your workflow? How much overhead does it add to your execution?
  • Diff and compare: Can you compare datasets? Can you see the diff for your image directory?
  • How well does it work with your stack: Can you easily connect to your infrastructure, platform, or model training workflow?
  • Can you get your team on board: If your team does not adopt it, it doesn’t matter how good the tool is. So keep your teammates skillset in mind and preferences in mind. 

Here’re are a few tools worth exploring.

Continue reading ->
Neptune-ai CB Insights AI 100

Neptune.ai Named to the 2022 CB Insights AI 100 List of Most Promising AI Startups

Read more
Series-A-announcement-Neptune

We Raised $8M Series A to Continue Building Experiment Tracking and Model Registry That “Just Works”

Read more
Self supervised learning

Self-Supervised Learning and Its Applications

Read more
GAN failure modes

GANs Failure Modes: How to Identify and Monitor Them

Read more