Generating Synthetic Data Sets with ‘synthpop’ in R

Synthpop – A great music genre and an aptly named R package for synthesising population data. I recently came across this package while looking for an easy way to synthesise unit record data sets for public release. The goal is to generate a data set which contains no real units, therefore safe for public release and retains the structure of the data. From which, any inference returns the same conclusion as the original. This will be a quick look into synthesising data, some challenges that can arise from common data structures and some things to watch out for.

Sythesising data

This example will use the same data set as in the synthpop documentation and will cover similar ground, but perhaps an abridged version with a few other things that weren’t mentioned. The SD2011 contains 5000 observations and 35 variables on social characteristics of Poland. A subset of 12 of these variables are considered.

The objective of synthesising data is to generate a data set which resembles the original as closely as possible, warts and all, meaning also preserving the missing value structure. There are two ways to deal with missing values 1) impute/treat missing values before synthesis 2) synthesise the missing values and deal with the missings later. The second option is generally better since the purpose the data is supporting may influence how the missing values are treated.

Missing values can be simply NA or some numeric code specified by the collection. A useful inclusion is the syn function allows for different NA types, for example income, nofriend and nociga features -8 as a missing value. A list is passed to the function in the following form.

By not including this the -8’s will be treated as a numeric value and may distort the synthesis.

After synthesis, there is often a need to post process the data to ensure it is logically consistent. For example, anyone who is married must be over 18 and anyone who doesn’t smoke shouldn’t have a value recorded for ‘number of cigarettes consumed’. These rules can be applied during synthesis rather than needing adhoc post processing.

The variables in the condition need to be synthesised before applying the rule otherwise the function will throw an error. In this case age should be synthesised before marital and smoke should be synthesised before nociga.

There is one person with a bmi of 450.

Their weight is missing from the data set and would need to be for this to be accurate. I don’t believe this is correct! So, any bmi over 75 (which is still very high) will be considered a missing value and corrected before synthesis.

Consider a data set with variables. In a nutshell, synthesis follows these steps:

Take a simple random sample of and set as

Fit model and draw from

And so on, until

The data can now be synthesised using the following code.

The compare function allows for easy checking of the sythesised data.

Solid. The distributions are very well preserved. Did the rules work on the smoking variable?

They did. All non-smokers have missing values for the number of cigarettes consumed.

compare can also be used for model output checking. A logistic regression model will be fit to find the important predictors of depression. The depression variable ranges from 0-21. This will be converted to

0-7 – no evidence of depression (0)
8-21 – evidence of depression (1)

This split leaves 3822 (0)’s and 1089 (1)’s for modelling.

While the model needs more work, the same conclusions would be made from both the original and synthetic data set as can be seen from the confidence interavals. Occaisonally there may be contradicting conclusions made about a variable, accepting it in the observed data but not in the synthetic data for example. This scenario could be corrected by using different synthesis methods (see documentation) or altering the visit sequence.

Preserving count data

Released population data are often counts of people in geographical areas by demographic variables (age, sex, etc). Some cells in the table can be very small e.g. <5. For privacy reasons these cells are suppressed to protect peoples identity. With a synthetic data, suppression is not required given it contains no real people, assuming there is enough uncertainty in how the records are synthesised.

The existence of small cell counts opens a few questions,

If very few records exist in a particular grouping (1-4 records in an area) can they be accurately simulated by synthpop?
Is the structure of the count data preserved?

To test this 200 areas will be simulated to replicate possible real world scenarios. Area size will be randomly allocated ensuring a good mix of large and small population sizes. Population sizes are randomly drawn from a Poisson distribution with mean . If large, is drawn from a uniform distribution on the interval [20, 40]. If small, is set to 1.

The sequence of synthesising variables and the choice of predictors is important when there are rare events or low sample areas. If Synthesised very early in the procedure and used as a predictor for following variables, it’s likely the subsequent models will over-fit. Synthetic data sets require a level of uncertainty to reduce the risk of statistical disclosure, so this is not ideal.

Fortunately syn allows for modification of the predictor matrix. To avoid over-fitting, ‘area’ is the last variable to by synthesised and will only use sex and age as predictors. This is reasonable to capture the key population characteristics. Additionally, syn throws an error unless maxfaclevels is changed to the number of areas (the default is 60). This is to prevent poorly synthesised data for this reason and a warning message suggest to check the results, which is good practice.

The area variable is simulated fairly well on simply age and sex. It captures the large and small areas, however the large areas are relatively more variable. This could use some fine tuning, but will stick with this for now.

The method does a good job at preserving the structure for the areas. How much variability is acceptable is up to the user and intended purpose. Using more predictors may provide a better fit. The errors are distributed around zero, a good sign no bias has leaked into the data from the synthesis.

At higher levels of aggregation the structure of tables is more maintained.

Build your own methods

‘synthpop’ is built with a similar function to the ‘mice’ package where user defined methods can be specified and passed to the syn function using the form syn.newmethod. To demonstrate this we’ll build our own neural net method.

As a minimum the function takes as input

y – observed variable to be synthesised
x – observed predictors
xp – synthesised predictors

Set the method vector to apply the new neural net method for the factors, ctree for the others and pass to syn.

The results are very similar to above with the exception of ‘alcabuse’, but this demonstrates how new methods can be applied.

Takeaways

The ‘synthpop’ package is great for synthesising data for statistical disclosure control or creating training data for model development. Other things to note,

Synthesising a single table is fast and simple.
Watch out for over-fitting particularly with factors with many levels. Ensure the visit sequence is reasonable.
You are not constrained by only the supported methods, you can build your own!

Future posts

Following posts tackle complications that arise when there are multiple tables at different grains that are to be synthesised. Further complications arise when their relationships in the database also need to be preserved. Ideally the data is synthesised and stored alongside the original enabling any report or analysis to be conducted on either the original or synthesised data. This will require some trickery to get synthpop to do the right thing, but is possible.

The post Generating Synthetic Data Sets with ‘synthpop’ in R appeared first on Daniel Oehm

Gradient Descending.

Related