Creation of the SSB:
The SSB is created from several data sources. The survey data are drawn from multiple panels of the Survey of Income and Program Participation (SIPP): the 1990, 1991, 1992, 1993, 1996, and 2004 panels. The administrative data are drawn from the following SSA files: the Master Earnings File, the Master Beneficiary Records (MBR), the Supplemental Security Records (SSR), the 831 Disability File (F831), and the Payment History Update System (PHUS).
The creation of the SSB begins with the construction of the Gold Standard File (GSF). To construct the GSF, a set of variables from the 1990-2004 SIPP panels are standardized to produce consistent measures across panels. The SIPP respondent identifiers are mapped to Social Security Numbers (SSB) using the Census Bureau's erson Information Validation System (PVS). Using the list of SSN's for the sample is SIPP respondents, SSA creates Summary Earnings Records (SER) and Detailed Earnings Record (DER) extracts from the Master Earnings File. SSA also creates extracts from the four benefit files (MBR, SSR, F831, and PHUS) from the corresponding master files. Using the mapping between the SIPP identifiers and SSN's, Census then links these extracts to the SIPP data. The GSF consists of person-level research variables created from these linked data.
The next step in the creation of the SSB is to impute missing values in the GSF multiple times. This process results in four files (implicates) referred to as the Completed Data implicates. Each of these implicates contains original GSF values where non-missing and imputed values where the original value is missing. The imputations across Completed Data implicates are independent of each other.
The Completed Data implicates from the basis of the data synthesis that
produces the SSB files. From each Completed Data file, four synthetic
datasets are created by synthesizing variables conditional on the values in
the Completed Data file. Thus, the SSB consists of sixteen files (implicates).
All but the following data are synthesized in the SSB implicates: gender,
OASDI benefit type, and spouse link (specific variables described in the
data items section below). Detailed documentation of the process of data
synthesis is available in the publication "
The Completed Data and SSB implicates need not all have the same number of records. In order to be included in a Completed Data or SSB implicate, an individual's (possibly imputed or synthesized) age must be at least fifteen years as of January 1 in the first year of his or her SIPP panel. The interaction between this restriction and the variation in imputed and synthesized ages across implicates causes the exclusion of a slightly different set of individuals from each Completed Data and SSB implicate.
We request that researchers who publish results from analyses done using these data cite the SSB as their data source and acknowledge the use of the SDS server at Cornell and the support of Census staff in running any validation programs. These citations will help ensure continued funding for the SDS server and the creation of the Gold Standard File and the SSB.
Suggested acknowledgement:
Using SSB:
The GSF and Completed Data implicates contain personally identifiable information protected by Titles 13, 26, and 42 and cannot be accessed without Census Bureau Special Sworn Status nor outside of Census Bureau facilities. The SSB files, however, have been cleared by the Census Bureau Disclosure Review Board, SSA, and IRS for use by individuals without Census Bureau Special Sworn Status and outside of Census Bureau facilities.
Researchers interested in using the SSB can submit an application to the Census
Bureau. The application form and instructions can be downloaded from
The SSB is designed to be analytically valid in that sense that point estimates should be unbiased and estimated variances should lead to inferences similar to those that would be drawn from an identical analysis on the Completed Data implicates. Initial tests of analytic validity of the SSB have been promising. All SSB users are invited to help further test the analytic validity of the SSB by submitting programs used to analyze the SSB to be run on the Completed Data and/or Gold Standard files. Users need only inform Census Bureau staff of the location on the server of such programs and work with Census Bureau staff to ensure that the programs run without error. Census Bureau staff will run the programs on the confidential data and release to the user resulting output that are cleared for release by the Census Bureau Disclosure Review Board. In order to evaluate the effects of the data synthesis separate from the effect of imputing missing data, comparisons should be made between results from the SSB and the Completed Data. To evaluate the effects of missing data imputation, comparisons should be made between results from the Completed Data and the Gold Standard.
Protocol for Validation of Results:
Census will validate results obtained from the SSB on the internal, confidential version of these data (Completed Gold Standard Files). Users who wish to obtain validated results should follow the protocol outlined here.
The restricted access site will provide SAS and Stata analysis software and a computing environment similar to the one used to analyze the confidential Completed Gold Standard data on Census Bureau internal computers. Researchers should follow the Census Bureau programming requirements described in SSB Validation Request Guidelines to ensure that the programs will successfully transfer to internal Census computers for validation. Researchers should plan to share their results and programs from the synthetic data analysis with Census, ORES/SSA and SOI/IRS.
After programs have successfully run without error on the synthetic data, researchers may request that Census run these programs on the Completed Gold Standard Files. Only programs successfully run without error on the SDS will be eligible to be run on the confidential data by Census staff. Any programs that produce errors on the Completed Gold Standard Files will be returned to users for correction.
Once an analysis has been repeated on the Completed Gold Standard File, the results will be reviewed by Census staff for disclosure concerns. Researchers should familiarize themselves with standard Census disclosure rules for outside projects (See the