A Python Package to Generate Synthetic Data: SDV

Creating synthetic data is becoming more and more important due to privacy issues or many other reasons. Synthetic data are expected to de-identify individuals while preserving the distributional properties of the data. I am going to introduce a Python package: SDV. SDV package includes various methods to generate synthetic data such as copulas and deep learning algorithms. You can generate synthetic data for relational databases as well. See sdv.dev web page for documentations.

I will show an example of a multivariate copula today. I am downloading a data set used in David Card’s famous return to schooling paper. Card estimates the return to schooling using 2-Stage Least Squares. I am going to create a synthetic data set for some of the selected variables. Afterward, I will evaluate the quality of the synthetic data. Let’s get the data:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from copulas.multivariate import GaussianMultivariate
from statsmodels.regression.linear_model import OLS
from stargazer.stargazer import Stargazer

df = pd.read_stata('http://fmwww.bc.edu/ec-p/data/wooldridge/card.dta')

For simplicity, I am going to fill missing values with median.

df = df.fillna(df.median())

I would like to use variables used in 1st stage estimations of 2SLS (but excluded some variables). I will regress education on some exogenous variables. Here is the list of variables:

X = ['married', 'exper', 'expersq',
     'nearc2', 'nearc4', 'fatheduc', 'motheduc',
     'weight', 'black', 'smsa', 'south', 'IQ']
Y = ['educ']
all_X = X+Y

I use the Gaussian Multivariate Copula to create synthetic data:

s_df = GaussianMultivariate()
s_df.fit(df[all_X])

C:\Users\alfur\Miniconda3\lib\site-packages\scipy\stats\_continuous_distns.py:4798: RuntimeWarning: divide by zero encountered in true_divide
  return c**2 / (c**2 - n**2)
C:\Users\alfur\Miniconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:2407: RuntimeWarning: invalid value encountered in double_scalars
  Lhat = muhat - Shat*mu
C:\Users\alfur\Miniconda3\lib\site-packages\scipy\stats\_continuous_distns.py:4789: RuntimeWarning: divide by zero encountered in power
  return cd2*x**(c-1)
C:\Users\alfur\Miniconda3\lib\site-packages\scipy\stats\_continuous_distns.py:5756: RuntimeWarning: overflow encountered in true_divide
  lPx -= 0.5*np.log(r*np.pi) + (r+1)/2*np.log(1+(x**2)/r)
C:\Users\alfur\Miniconda3\lib\site-packages\scipy\stats\_continuous_distns.py:5756: RuntimeWarning: overflow encountered in square
  lPx -= 0.5*np.log(r*np.pi) + (r+1)/2*np.log(1+(x**2)/r)
C:\Users\alfur\Miniconda3\lib\site-packages\copulas\univariate\truncated_gaussian.py:43: RuntimeWarning: invalid value encountered in double_scalars
  a = (self.min - loc) / scale
C:\Users\alfur\Miniconda3\lib\site-packages\copulas\univariate\truncated_gaussian.py:44: RuntimeWarning: divide by zero encountered in double_scalars
  b = (self.max - loc) / scale
C:\Users\alfur\Miniconda3\lib\site-packages\scipy\optimize\slsqp.py:63: RuntimeWarning: invalid value encountered in subtract
  jac[i] = (func(*((x0+dx,)+args)) - f0)/epsilon
C:\Users\alfur\Miniconda3\lib\site-packages\scipy\stats\_continuous_distns.py:547: RuntimeWarning: invalid value encountered in sqrt
  sk = 2*(b-a)*np.sqrt(a + b + 1) / (a + b + 2) / np.sqrt(a*b)
C:\Users\alfur\Miniconda3\lib\site-packages\scipy\optimize\minpack.py:162: RuntimeWarning: The iteration is not making good progress, as measured by the 
  improvement from the last ten iterations.
  warnings.warn(msg, RuntimeWarning)
C:\Users\alfur\Miniconda3\lib\site-packages\copulas\univariate\truncated_gaussian.py:43: RuntimeWarning: divide by zero encountered in double_scalars
  a = (self.min - loc) / scale
C:\Users\alfur\Miniconda3\lib\site-packages\copulas\univariate\truncated_gaussian.py:44: RuntimeWarning: invalid value encountered in double_scalars
  b = (self.max - loc) / scale
C:\Users\alfur\Miniconda3\lib\site-packages\scipy\optimize\minpack.py:162: RuntimeWarning: The iteration is not making good progress, as measured by the 
  improvement from the last five Jacobian evaluations.
  warnings.warn(msg, RuntimeWarning)

I draw sample from my Gaussian Copula with the same size of my data:

s_data = s_df.sample(len(df))

Let’s compare two datasets with basic summary statistics:

df[all_X].describe().transpose().round(2)

	count	mean	std	min	25%	50%	75%	max
married	3010.0	2.27	2.07	1.0	1.0	1.0	4.0	6.0
exper	3010.0	8.86	4.14	0.0	6.0	8.0	11.0	23.0
expersq	3010.0	95.58	84.62	0.0	36.0	64.0	121.0	529.0
nearc2	3010.0	0.44	0.50	0.0	0.0	0.0	1.0	1.0
nearc4	3010.0	0.68	0.47	0.0	0.0	1.0	1.0	1.0
fatheduc	3010.0	10.00	3.27	0.0	8.0	10.0	12.0	18.0
motheduc	3010.0	10.54	3.03	0.0	9.0	12.0	12.0	18.0
weight	3010.0	321185.25	170645.80	75607.0	122798.0	365200.0	406024.0	1752340.0
black	3010.0	0.23	0.42	0.0	0.0	0.0	0.0	1.0
smsa	3010.0	0.71	0.45	0.0	0.0	1.0	1.0	1.0
south	3010.0	0.40	0.49	0.0	0.0	0.0	1.0	1.0
IQ	3010.0	102.62	12.76	50.0	98.0	103.0	108.0	149.0
educ	3010.0	13.26	2.68	1.0	12.0	13.0	16.0	18.0

s_data[all_X].describe().transpose().round(2)

	count	mean	std	min	25%	50%	75%	max
married	3010.0	2.30	2.13	-0.52	0.84	1.23	4.19	7.62
exper	3010.0	8.83	4.25	-1.14	5.90	8.20	11.39	25.02
expersq	3010.0	95.54	88.27	-38.26	34.08	68.51	129.27	568.98
nearc2	3010.0	0.44	0.51	-0.35	-0.01	0.13	0.98	1.32
nearc4	3010.0	0.69	0.48	-0.27	0.07	0.95	1.03	1.37
fatheduc	3010.0	10.02	3.21	-1.42	8.32	10.23	11.91	19.85
motheduc	3010.0	10.61	3.01	-0.24	8.72	11.44	12.30	18.92
weight	3010.0	318721.50	167928.40	-6296.75	138158.44	358169.97	418018.70	1422281.72
black	3010.0	0.23	0.42	-0.31	-0.04	0.03	0.16	1.26
smsa	3010.0	0.71	0.46	-0.28	0.11	0.95	1.03	1.30
south	3010.0	0.39	0.50	-0.32	-0.02	0.09	0.96	1.33
IQ	3010.0	102.77	12.85	48.59	97.98	103.10	108.55	149.97
educ	3010.0	13.03	2.46	3.68	11.36	13.16	14.87	17.97

Immediately, we find two problems related with the Gaussian Copula: (1) Dummary variables are not recognized. The could become less than 0 or greater than 1. (2) integer variables are not recognized and impossible values appears. For example, experience variable cannot be less than 0.

This is not surprising. Let’s compare distribution of some variables on 3D plots:

fig = plt.figure(figsize=(12,8))
ax1 = fig.add_subplot(1,2,1,projection='3d')
ax1.scatter(df.educ, df.exper, df.IQ)

ax2 = fig.add_subplot(1,2,2,projection='3d')
ax2.scatter(s_data.educ, s_data.exper, s_data.IQ)

The education variable is a discrete variable. We see that the Gaussian copula turns it into a continuous variable just like others. It looks like chaos on the right-hand side! I would like to investigate whether distributional properties are preserved or not.

As distribution assumed to be Gaussion, covariance matrices are very similar:

df[all_X].cov().round(2).iloc[1:8,1:8]

	exper	expersq	nearc2	nearc4	fatheduc	motheduc	weight
exper	17.15	338.97	-0.01	-0.12	-4.12	-3.31	-1.847862e+04
expersq	338.97	7160.26	-0.24	-2.52	-78.35	-64.72	-5.315395e+05
nearc2	-0.01	-0.24	0.25	0.03	0.14	0.09	1.989730e+03
nearc4	-0.12	-2.52	0.03	0.22	0.19	0.10	7.425300e+03
fatheduc	-4.12	-78.35	0.14	0.19	10.67	5.18	1.138799e+05
motheduc	-3.31	-64.72	0.09	0.10	5.18	9.21	1.064642e+05
weight	-18478.62	-531539.47	1989.73	7425.30	113879.94	106464.16	2.911999e+10

s_data[all_X].cov().round(2).iloc[1:8,1:8]

	exper	expersq	nearc2	nearc4	fatheduc	motheduc	weight
exper	18.04	362.92	-0.08	-0.14	-4.18	-3.47	3.414371e+04
expersq	362.92	7791.49	-1.81	-3.26	-81.23	-68.28	6.499460e+05
nearc2	-0.08	-1.81	0.26	0.03	0.11	0.04	6.308000e+01
nearc4	-0.14	-3.26	0.03	0.23	0.16	0.10	8.522730e+03
fatheduc	-4.18	-81.23	0.11	0.16	10.27	4.98	9.190515e+04
motheduc	-3.47	-68.28	0.04	0.10	4.98	9.03	8.162761e+04
weight	34143.71	649945.98	63.08	8522.73	91905.15	81627.61	2.819995e+10

I am more interested in linear regression results:

fit1 = OLS(df[Y], df[X]).fit()
fit2 = OLS(s_data[Y], s_data[X]).fit()
Stargazer([fit1, fit2])


	Dependent variable:educ

	(1)	(2)

IQ	0.087^***	0.091^***
	(0.002)	(0.002)
black	0.575^***	0.582^***
	(0.114)	(0.106)
exper	0.151^***	-0.023
	(0.031)	(0.033)
expersq	-0.020^***	-0.006^***
	(0.002)	(0.002)
fatheduc	0.135^***	0.089^***
	(0.014)	(0.015)
married	-0.004	0.007
	(0.019)	(0.019)
motheduc	0.175^***	0.210^***
	(0.014)	(0.015)
nearc2	0.086	0.229^***
	(0.077)	(0.078)
nearc4	0.439^***	0.340^***
	(0.087)	(0.087)
smsa	0.406^***	0.242^***
	(0.090)	(0.089)
south	0.251^***	0.189^**
	(0.083)	(0.083)
weight	0.000^***	0.000^***
	(0.000)	(0.000)

Observations	3,010	3,010
R²	0.977	0.974
Adjusted R²	0.977	0.974
Residual Std. Error	2.039 (df=3002)	2.132 (df=2998)
F Statistic	16193.365^*** (df=8; 3002)	9452.387^*** (df=12; 2998)

Note:	^p<0.1; ^p<0.05; ^**p<0.01

I could tell that it performs way more better than I expected. I believe one could generate much more better synthetic data with deep learning algorithm.

A Python Package to Generate Synthetic Data: SDV – Example with Gaussian Copula

Leave a Reply Cancel reply