A small disclaimer

This work is in Alpha stage and should not be used in production.

What is Differential Privacy?

Privacy seems like a loosely defined concept, a concept which resides outside the scope of mathematical rigor. And this was in-fact true for a long time, however in 2006 Cynthia Dwork et al. released a seminal paper formalizing privacy using the concept of ‘plausible deniability’, by injecting very carefully selected noise to computations performed on a dataset — this framework is called Differential Privacy.

Without delving too deep, Differential Privacy presents a framework for asking statistical questions about a dataset while provably maintaining the privacy of the entities within that dataset. We can ask a question like:

What is the average age of the people reading this post?

You’re reading this post, how old are you? Either way, our differentially private system will compute the true response and add a very carefully selected amount of noise on top. Even though you are now reading this post, therefore you have contributed to the average age, this carefully selected noise ensures that no one can prove that you have read this post. You have plausible deniability.

If we can ensure that this property holds for not just yourself, but every single person to have read this post, we have Differential Privacy.

Parallax

Ok, so this sounds cool! Using differential privacy we can do privacy-preserving analytics on our database. This means that we can give our analyst access to a database, without them actually ever seeing any raw or sensitive data! It also opens up use-cases, where suddenly you can give third parties access to your database without you having to build bespoke pipelines to anonymise data and without infringing on your client’s privacy!

At OpenQuery our goal is to make state-of-the-art privacy available to non-experts. Since its conception, differential privacy has been largely unapproachable, giving companies and individuals a binary choice when accessing information (e.g. RBAC). To change this, we have been developing an open source Virtual Private Data Warehouse called Parallax. It sits between your databases and the person asking questions about your data. From the outside Parallax looks just like a plain old SQL database. It integrates with your favorite BI tools (coming soon 🤷‍♂️) so you don’t have to change your workflows!

Parallax provides a differentially private entrypoint to your sensitive datasetsParallax provides a differentially private entrypoint to your sensitive datasets

More than that, Parallax will scale to meet your data processing requirements and allow you to join data across private datasets. You can find the full source for Parallax here.

Step 0 — Prerequisites

Some things you need to follow along:

  • Docker

  • A GCP account

  • A BigQuery dataset. We’ll be using a public dataset for this demo but you will need a separate dataset which Parallax will use as a cache. Our main dataset will be cms_synthetic_patient_data_omop. This is a synthetic dataset released by the Centers for Medicare and Medicaid Services (CMS)

Step 1 —Install Parallax

To install Parallax simply pull and run the docker image from our container registry:

$ docker run -it --rm eu.gcr.io/openquery/parallax

The image will start a single parallax-worker node in the background. This lets you issue commands against it using the parallax CLI. The first thing you will want to do is initialize the deployment with:

$ /opt/work# parallax init

That’s it! We’re up and running.

Step 2— Connect Parallax to BigQuery

Parallax uses state files — a declarative configuration language which specifies everything that Parallax understands. (This is very similar to Terraform in principle)

First we need to pull the existing state from Parallax. This state was created automatically when you ran paralax init and it contains your user resource root and the root group called wheel. The root account is a super_user meaning that root can change the state of Parallax as well as query it.

To pull the state run:

$ /opt/work# parallax --disable-tls gov fetch --output=state.yaml

The --disable-tls flag is required as the CLI uses TLS by default, but your instance is currently running locally so TLS hasn’t been setup.

This results in a state.yaml file in your working directory which looks a bit like:

- user:
    name: root
    email: ""
    public_keys:
      - "..."
    auth_provider: ~
    primary_group: resource.group.wheel
    super_user: true
- group:
    name: wheel
    members:
      - resource.user.root

(In order to modify text files you can apt-get install vim or apt-get install nano )

To add BigQuery as a backend let’s create a file in the same working directory called backends.yaml. You need to download a GCP service account key which has the roles “BigQuery Job User” and “BigQuery Read Session User”.

To create a resource we specify the resource type is backend , and all the other fields required by Parallax to locate a backend.

- backend:
    big_query:
      name: "synthetic_patient_data"
      service_account_key: "/path/to/secret/key.json"
      project_id: "bigquery-public-data"
      dataset_id: "cms_synthetic_patient_data_omop"
      staging_project_id: "<my-staging-project-id>"
      staging_dataset_id: "<my-staging-dataset-id>"

The staging fields are used by Parallax to store intermediate results while computing. It can be in the same project, but you should pick a different dataset.

We can now add the backend to Parallax by running:

$ /opt/work# parallax --disable-tls gov apply

Parallax recursively scans all yaml files in your working directory and treats them as state files. You should see a state mutation plan like so:

An example of a state mutation planAn example of a state mutation plan

Finally, just type y.

Step 3- Virtual Dataset

We’ve defined a backend, but we can’t really interact with it yet. Virtual Datasets are the main abstraction which external stakeholders interact with. It consists of 2 main components:

  1. A list of columns from the underlying backend (in this case our a_bigquery_backend)

  2. A set of policies stating how a group in Parallax interacts with the backend.

Let’s create a Virtual Dataset in a new file called datasets.yaml :

- dataset:
    name: "patient_data"
    data:
    - resource.backend.big_query.synthetic_patient_data.data["concept"]
    entities: []
    policy_bindings:
    - name: "wheel_policies"
      groups:
      - resource.group.wheel
      policies:
      - policy.whitelist.whitelist_concept
      priority: 10
    policies:
    - whitelist:
        name: "whitelist_concept"
        fields:
        - "concept_id"
        - "concept_name"

Since our concept table has no sensitive information, the Virtual Dataset simply allows unfettered access to columns concept_id and concept_name in the underlying dataset.

We then need to re-apply the new state:

$ /opt/work# parallax --disable-tls gov apply

Congratulations! You now have your first Virtual Dataset.

Step 4- Query Parallax

At this stage, Parallax is simply acting as a proxy to our underlying BigQuery dataset. To query Parallax via the CLI we can use the jobs API:

$ /opt/work# cat<<<EOF | parallax --disable-tls jobs query
SELECT business_id
FROM yelp.review limit 10
EOF

If you get an error like:

Error: reason: an error occurred dealing with one of the backends involved in this job (details: Backend(BackendError { kind: Io, source: "", description: "No such file or directory (os error 2)" }))

Parallax probably can’t find your GCP service account key. Make sure the path is correct.

Step 5- Virtual Private Dataset

A Virtual Private Dataset (VPD) is a Virtual Dataset as above, but with a Differentially Private policy. This is where the magic happens.

First let’s modify our datasets.yaml :

- dataset:
    name: "patient_data"
    data:
    - resource.backend.big_query.synthetic_patient_data.data["death"]
    - resource.backend.big_query.synthetic_patient_data.data["concept"]
    entities: []
    policy_bindings:
    - name: "wheel_policies"
      groups:
      - resource.group.wheel
      policies:
      - policy.differential_privacy.patients_have_differential_privacy
      - policy.whitelist.whitelist_concept
      priority: 10
    policies:
    - differential_privacy:
        name: "patients_have_differential_privacy"
        entity: "person_id"
        epsilon: 1
        bucket_size: 15
    - whitelist:
        name: "whitelist_concept"
        fields:
        - "concept_id"
        - "concept_name"

There’s quite a lot going on here, let’s break it down:

  1. priority — is used to disambiguate policy preference. Let’s say we have a query which matches 2 policies, Parallax uses priority to decide which one takes precedence.

  2. The entity field tells Parallax which column is the primary identifier for the entity we are trying to protect, in this case user. But the entity could very well be a company, vehicle, etc.

  3. epsilon — One thing to remember about differential privacy, is that we are releasing information about the underlying entities. We are just very careful about how we go about doing this. There is always a trade-off between the usability of the information released and the privacy of the underlying entities. epsilon is a value between 0 and +∞ which scales the amount of noise being added to query results. A value of 0 means we have perfect privacy (i.e. no information is being released) and +∞ means we have perfect information — no noise is being added. In academia we see values of epsilon ranging from 0.01 all the way to 7 . We have an upcoming article on choosing values of epsilon, even though it is more art than science at this stage.

  4. bucket_size —The bucket_size is a threshold to prevent leakage of sensitive information in keys when doing a group_by.The intuition here is that when doing a group_by an individual’s contribution can change the number of groups. The bucket size is the minimum number of entities that need to have contributed to a specific group for that group to appear at the output of the query.

We can now fire a differentially private query against Parallax:

$ /opt/work# cat<<<EOF | parallax --disable-tls jobs query
SELECT death_date, COUNT(death_date)
FROM patient_data.death group by death_date
EOF
+------------+--------------------+
| death_date | f1_                |
+------------+--------------------+
| 2009-10-01 | 3049.9739256459347 |
+------------+--------------------+
| 2009-04-01 | 3014.2262523270997 |
+------------+--------------------+
| 2010-01-01 | 3075.405497481715  |
+------------+--------------------+
| 2010-02-01 | 2994.98778172814   |
+------------+--------------------+
| 2009-02-01 | 3055.7674453237755 |
+------------+--------------------+
| 2010-11-01 | 3051.6319423704194 |
+------------+--------------------+
| 2008-12-01 | 2972.5665251618207 |
+------------+--------------------+
| 2010-09-01 | 2898.6731450982707 |
+------------+--------------------+
| 2008-03-01 | 2980.9890624515233 |
+------------+--------------------+
| ...        | ...                |
+------------+--------------------+

Behind the scenes

You may have noticed this query took a bit longer than usual to run — this is for a good reason. Behind the scenes Parallax is computing the domain of columns in question and understanding their statistics. This makes sense, the amount of noise that needs to be added depends on the context of the underlying data. For example, if our dataset contains the age of human beings which have a domain of (roughly) 0–100 years old in increments of 1, the amount of noise added is different to if we were considering account balances which could span from (roughly) -$1,000 to +$1,000,000 in increments of $0.01.

In our case this Parallax does this by probing BigQuery with the following:

SELECT MAX(person_id), MIN(person_id) 
FROM `bigquery-public-data`.`cms_synthetic_patient_data_omop`.`death`

SELECT COUNT(*) 
FROM `bigquery-public-data`.`cms_synthetic_patient_data_omop`.`death`

SELECT MAX(freq) 
FROM (
  SELECT person_id as key, COUNT(*) as freq 
  FROM `bigquery-public-data`.`cms_synthetic_patient_data_omop`.`death` 
  GROUP BY person_id
)

...

And so on

After these queries have finished executing and the domain of the columns has been established, our original query:

SELECT death_date, COUNT(death_date) 
FROM patient_data.death group by death_date

Is re-written into a Differentially Private equivalent query:

SELECT death_date, f1_ FROM (
  SELECT * FROM (
    SELECT death_date AS death_date, COUNT(death_date) + 0 - 1 * (SIGN(RAND() - 0.5) * LN(1 - 2 * ABS(RAND() - 0.5))) AS f1_, COUNT(1) + 0 - 1 * (SIGN(RAND() - 0.5) * LN(1 - 2 * ABS(RAND() - 0.5)
    )
  ) AS __bucket_count
FROM bigquery-public-data.cms_synthetic_patient_data_omop.death 
GROUP BY death_date) WHERE __bucket_count > 15)

Result Accuracy

Below we can see how different values of epsilon affect the accuracy of the result for our query:

1fy0-MHUvJLBJawViRDtQ7g
1R1MD1a0PsdmS6ex3IAu7Aw
19kRmRkvPAmvzvFEU0BF_Tg

Final remarks

As I mentioned in the beginning, Parallax is very much tech in the alpha stage. If you use it a bit you will notice that a small subset of all possible queries are supported, but we are working hard to expand its functionality and correctness.

If you would like your use-case to be covered, and your backend to be supported feel free to add an issue on our repo.

Fin.