3 min read

Creating a synthetic version of a real dataset to facilitate data sharing

How to make a synthetic dataset which mimics the characteristics of a real dataset
Creating a synthetic version of a real dataset to facilitate data sharing

I recently starting live-streaming the creation of a tutorial paper describing how to create a synthetic versions of real datasets, which can be used for sharing to protect participant privacy.

Here's me in a previous blogpost on why I'm interested in creating synthetic datasets:

First, readers and reviewers can better understand the data, as they can explore dataset properties such as distributions, variance, and outliers. Second, other researchers can explore the data and fit hypothesis-generating models, which can be verified by the original authors using the real dataset. Third, synthetic datasets can be used to accompany analysis scripts, which can be difficult to follow without the dataset. Finally, the publication of a synthetic dataset would incentivize scholars to carefully audit their analysis scripts, given that others might inspect it and attempt to recreated the published analysis.

One open dataset I recently explored is linked with this paper on the effects of oxytocin administration on receiving help, which used a clever experimental design. In short, participants were randomized to self-administer either oxytocin or placebo nasal spray, after which they instructed to complete several computed-based tasks. Participants completed these tasks individually in groups of two, but one of these participants was a research confederate.

After three trials, the experimenters programed the computer to appear to crash. Confederates were randomized to either help out the genuine participant with their computer problem (which they magically fixed) or to not help them. In the non-help condition, the experimenter eventually "fixed" the issue.

Among a few other outcomes, the researchers were interested in observer-rated reactions to receiving help. That is, how happy were participants to have received help and how grateful the participants appeared towards the helper.

The authors reported a significant interaction effect of treatment condition (oxytocin vs. placebo) and whether participants received help or not on observer-rated gratitude. That is, participants in the oxytocin condition expressed greater observer-rated gratitude in response to receiving help [ F(1, 63)4.32, p < .05]. Participants in the oxytocin condition also expressed greater observer-rated  greater happiness, but this effect was on the border of conventional statistical significance  [ F(1, 63)3.73, p < .06].

Before we get into the analyses, let's load up our required packages and import our dataset, which we're going to call 'h_dat'.

Using the open dataset and analysis script, I was able to recreate these two outcomes in R.

To draw inferences from the synthesized data, a synthpop package function can estimate linear regression models. Although the original analyses used ANOVAs, these can be easily converted into equivalent linear regression models.  Here, I'm going to demonstrate this for the happiness outcome.

Now let's create our synthetic dataset and look at how it stacks up against our original dataset.

Here's the output from this analysis, which visualizes the distribution of our four variables of interest. We can see that synthetic counts (dark blue) are similar to the observed counts (light blue).

Let's now create construct a linear regression model from our synthesized dataset and compare this model against the model from the observed dataset.

When we compare our models, we can see for that there's no significant difference between our synthetic and observed data for the interaction (p = .22 with a CI overlap of 0.67). This code also generates a useful plot.

In sum, the authors of the observed dataset could have shared a synthetic dataset in the place of the observed dataset if there were privacy concerns because the model in the synthetic dataset closely approximated the model in the observed data. Of course, it's always better to share the real data, but a synthetic dataset is better than no dataset.


UPDATE: A preprint of this tutorial now available, along with an RStudio Server instance of the primary analysis and results, which recreates the complete computational environment used for this paper.