Lori B. Reeder and Martha Stinson and Kelly E. Trageser and Lars Vilhuber. Codebook for the SIPP Synthetic Beta 6.0.2 [Codebook file]. Cornell Institute for Social and Economic Research and Labor Dynamics Institute [distributor]. Cornell University, Ithaca, NY, 2015
U.S. Census Bureau. SIPP Synthetic Beta: Version 6.0.2 [Computer file]. Washington DC; Cornell University, Synthetic Data Server [distributor], Ithaca, NY, 2015
The SIPP Synthetic Beta (SSB) is a Census Bureau product that integrates person-level micro-data from a household survey with administrative tax and benefit data. These data link respondents from the Survey of Income and Program Participation (SIPP) to Social Security Administration (SSA)/Internal Revenue Service (IRS) Form W-2 records and SSA records of retirement and disability benefit receipt, and were produced by Census Bureau staff economists and statisticians in collaboration with researchers at Cornell University, the SSA and the IRS. The purpose of the SSB is to provide access to linked data that are usually not publicly available due to confidentiality concerns. To overcome these concerns, Census has synthesized, or modeled, all the variables in a way that changes the record of each individual in a manner designed to preserve the underlying covariate relationships between the variables. The only variables that were not altered by the synthesis process and still contain their original values are gender and a link to the first reported marital partner in the survey. Eight SIPP panels (1990, 1991, 1992, 1993, 1996, 2001, 2004, 2008) form the basis for the SSB, with a large subset of variables available across all the panels selected for inclusion and harmonization across the years. Administrative data were added and some editing was done to correct for logical inconsistencies in the IRS/SSA earnings and benefits data.
Users should be aware that time-varying variable arrays are collapsed in this codebook to a placeholder variable. Thus,
SIPP Panels (1990, 1991, 1992, 1993, 1996, 2001, 2004, and 2008)
SSA records of retirement and disability benefit receipt (1984-2012)
IRS Form W-2 records on respondents (1951-2011)
Creation of the SSB:
The SSB is created from several data sources. The survey data are drawn from multiple panels of the Survey of Income and Program Participation (SIPP): the 1990, 1991, 1992, 1993, 1996, and 2004 panels. The administrative data are drawn from the following SSA files: the Master Earnings File, the Master Beneficiary Records (MBR), the Supplemental Security Records (SSR), the 831 Disability File (F831), and the Payment History Update System (PHUS).
The creation of the SSB begins with the construction of the Gold Standard File (GSF). To construct the GSF, a set of variables from the 1990-2004 SIPP panels are standardized to produce consistent measures across panels. The SIPP respondent identifiers are mapped to Social Security Numbers (SSB) using the Census Bureau's erson Information Validation System (PVS). Using the list of SSN's for the sample is SIPP respondents, SSA creates Summary Earnings Records (SER) and Detailed Earnings Record (DER) extracts from the Master Earnings File. SSA also creates extracts from the four benefit files (MBR, SSR, F831, and PHUS) from the corresponding master files. Using the mapping between the SIPP identifiers and SSN's, Census then links these extracts to the SIPP data. The GSF consists of person-level research variables created from these linked data.
The next step in the creation of the SSB is to impute missing values in the GSF multiple times. This process results in four files (implicates) referred to as the Completed Data implicates. Each of these implicates contains original GSF values where non-missing and imputed values where the original value is missing. The imputations across Completed Data implicates are independent of each other.
The Completed Data implicates from the basis of the data synthesis that
produces the SSB files. From each Completed Data file, four synthetic
datasets are created by synthesizing variables conditional on the values in
the Completed Data file. Thus, the SSB consists of sixteen files (implicates).
All but the following data are synthesized in the SSB implicates: gender,
OASDI benefit type, and spouse link (specific variables described in the
data items section below). Detailed documentation of the process of data
synthesis is available in the publication "
The Completed Data and SSB implicates need not all have the same number of records. In order to be included in a Completed Data or SSB implicate, an individual's (possibly imputed or synthesized) age must be at least fifteen years as of January 1 in the first year of his or her SIPP panel. The interaction between this restriction and the variation in imputed and synthesized ages across implicates causes the exclusion of a slightly different set of individuals from each Completed Data and SSB implicate.
The data can only be used on the VirtualRDC Synthetic Data Server at Cornell University. While no SSB data downloads are permitted at this time, users do not have to operate behind the Census Bureau firewall to access this server.
We request that researchers who publish results from analyses done using these data cite the SSB as their data source and acknowledge the use of the SDS server at Cornell and the support of Census staff in running any validation programs. These citations will help ensure continued funding for the SDS server and the creation of the Gold Standard File and the SSB.
Suggested acknowledgement:
This analysis was first performed using the SIPP Synthetic Beta (SSB) on the Synthetic Data Server housed at Cornell University which is funded by NSF Grants SES-1042181 and BCS-0941226, and through a grant from the Alfred P. Sloan Foundation. These data are public use and may be accessed by researchers outside secure Census facilities. For more information, visit http://www.census.gov/sipp/synth_data.html. Final results for this paper were obtained from a validation analysis conducted by Census Bureau staff using the SIPP Completed Gold Standard Files and the programs written by this author and originally run on the SSB. The validation analysis does not imply endorsement by the Census Bureau of any methods, results, opinions, or views presented in this paper.
You will need to use an NX client to logon to the Synthetic Data Server. Information about how to set-up your account and use the Synthetic Data Servers will come to you directly from the staff that maintains this server, after approval of your access by Census staff.
Using SSB:
The GSF and Completed Data implicates contain personally identifiable information protected by Titles 13, 26, and 42 and cannot be accessed without Census Bureau Special Sworn Status nor outside of Census Bureau facilities. The SSB files, however, have been cleared by the Census Bureau Disclosure Review Board, SSA, and IRS for use by individuals without Census Bureau Special Sworn Status and outside of Census Bureau facilities.
Researchers interested in using the SSB can submit an application to the Census
Bureau. The application form and instructions can be downloaded from
The SSB is designed to be analytically valid in that sense that point estimates should be unbiased and estimated variances should lead to inferences similar to those that would be drawn from an identical analysis on the Completed Data implicates. Initial tests of analytic validity of the SSB have been promising. All SSB users are invited to help further test the analytic validity of the SSB by submitting programs used to analyze the SSB to be run on the Completed Data and/or Gold Standard files. Users need only inform Census Bureau staff of the location on the server of such programs and work with Census Bureau staff to ensure that the programs run without error. Census Bureau staff will run the programs on the confidential data and release to the user resulting output that are cleared for release by the Census Bureau Disclosure Review Board. In order to evaluate the effects of the data synthesis separate from the effect of imputing missing data, comparisons should be made between results from the SSB and the Completed Data. To evaluate the effects of missing data imputation, comparisons should be made between results from the Completed Data and the Gold Standard.
Protocol for Validation of Results:
Census will validate results obtained from the SSB on the internal, confidential version of these data (Completed Gold Standard Files). Users who wish to obtain validated results should follow the protocol outlined here.
The restricted access site will provide SAS and Stata analysis software and a computing environment similar to the one used to analyze the confidential Completed Gold Standard data on Census Bureau internal computers. Researchers should follow the Census Bureau programming requirements described in SSB Validation Request Guidelines to ensure that the programs will successfully transfer to internal Census computers for validation. Researchers should plan to share their results and programs from the synthetic data analysis with Census, ORES/SSA and SOI/IRS.
After programs have successfully run without error on the synthetic data, researchers may request that Census run these programs on the Completed Gold Standard Files. Only programs successfully run without error on the SDS will be eligible to be run on the confidential data by Census staff. Any programs that produce errors on the Completed Gold Standard Files will be returned to users for correction.
Once an analysis has been repeated on the Completed Gold Standard File, the results will be reviewed by Census staff for disclosure concerns. Researchers should familiarize themselves with standard Census disclosure rules for outside projects (See the
L. B. Reeder, M. Stinson, K. E. Trageser, and L. Vilhuber, "Codebook for the SIPP Synthetic Beta v5.1 [Codebook file]," {Cornell Institute for Social and Economic Research} and {Labor Dynamics Institute} [distributor]. Cornell University, Ithaca, NY, USA, DDI-C document , 2014. Available at
U.S. Census Bureau, "Disclosure Review Board Memo: Second Request for Release of SIPP Synthetic Beta Version 6.0," U.S. Census Bureau 2015.
Available at
J. M. Abowd, M. Stinson, and G. Benedetto, "Final Report to the Social Security Administration on the SIPP/SSA/IRS Public Use File Project," U.S. Census Bureau 2006. Available at
Variables in this section are taken from the 831 Disability File (F831) that tracks all Title II (SSDI) and Title XVI (SSI) applications for disability payments from 1990 onwards. Using the F831 data, we created one record per person that reports number of filings, plus filing and decision dates, and the result of determination for the first, second, and most recent applications listed on the file.
We created separate variables for different application types (Title II-SSDI/Title XVI-SSI) so in total there is information on a maximum of 6 applications. We retain records for primary applicants only. Differences may exist between the disability variables on the MBR and the F831 due to differing lengths of the time series, as well as to the fact that research version of the MBR used to create the Gold Standard recorded a maximum of two events (e.g., events occurring between the initial and most recent entitlement may be censored).
The Master Benefits Records (MBR) is SSA's main file to track who is receiving Old Age Survivor and Disability (OASDI) benefits, the reason for receipt, and the monthly benefit amounts payable to the individual. The MBR also contains records detailing applications to Social Security Disability Insurance (SSDI), including dates of disability adjudication, date of disability entitlement, and date of disability onset as well as date an individual ceased receiving a benefit (if applicable).
The Payment History Update System (PHUS) contains actual payments delivered to OASDI beneficiaries. The data from the PHUS may differ from what are contained on the MBR due to discrepancies between the timing of SSA awarded amounts and the actual payments made to participants. This situation would be expected to affect disability cases more than aged cases because it takes more time to establish eligibility to receive disability.
Individuals are eligible to receive benefits due to their own earnings history and age, as well as due to a spouse's earnings history and age. In this section retirement and disability are "own" benefits while aged spouse, widowed spouse, and other are "spouse" benefits. The age requirements for receiving each type of benefit are as follows:
- Retire - minimum age 62 (reduced benefit), full retirement age (full benefit)
- Disability - under age 65 or full retirement age, whichever is greater; at full retirement age, these benefits convert to retirement.
- Aged Spouse - minimum age 62(reduced benefit), full retirement age (full benefit), spouse must be retired or disabled
- Widowed Spouse - minimum age 60(reduced benefit), full retirement age (full benefit), spouse must be deceased
- Other - no age requirements
Until the year 2000, the full retirement age was 65. From 2000 to 2022, the full retirement age is increasing by 2 months each year so that by 2022 the full retirement age will be 67.
The benefits reported in this section are total benefits received at a point in time. The MBR research extract provided by SSA to create the Gold Standard contains information about different reasons for receiving benefits but does not always allow the amount due to each reason to be accurately separated from the total. Hence we have elected to report total benefits at a point in time and researchers should be careful to note that when an individual is receiving both own retirement and aged spouse benefits, the amounts listed for each benefit type will be redundant, i.e. there is really only one total amount and two reasons for receiving it.
SSA calculates benefits based on an individual's lifetime earnings history following rules which they publish in "Annual Statistical Supplement to the Social Security Bulletin," available for each tax year on the Social Security website, www.ssa.gov.
These variables pertain to benefits awarded to aged spouses (Initial Date of Entitlement Type of Benefit=3).
These variables pertain to social security disability insurance (SSDI) benefits awarded due to own disability. These variables provide details for up to 4 applications, including whether there was an application, whether that application led to a decision that the individual was entitled to a benefit, the date of disability entitlement, the date of disability onset, the disability adjudication date of the application, the monthly amount of SSDI benefits received, whether and when an individual ceased receiving a benefit, as well as diagnosis group. The aforementioned variables are derived from the Master Beneficiary Record (MBR).
The total monthly benefit and the date of when that benefit was first received, both recorded in the Payment History Update System (PHUS), are also included. The PHUS contains actual payments delivered to SSDI beneficiaries. The data from the PHUS may differ from what are contained on the MBR due to discrepancies between the timing of SSA awarded amounts and the actual payments made to participants.
Identifiers for respondent, respondent's spouse, panel, rotation group, start and end date of SIPP panel, and the first full year in a panel.
These variables include SIPP survey-reported income.
SIPP survey variables related to educational attainment, enrollment, and field of study.
SIPP survey variables detailing whether a respondent has a work-limiting or work-preventing disability.
This is the first year in the panel for which every rotation group is in scope to have all the monthly SIPP variables from January to December.
This variable indicates whether a respondent successfully linked to a valid Social Security Number.
This variable indicates that a person had a work-preventing disability. This information comes from the disability topical module in the 1984, 1990-1993 panels. In the 1996-2008 panels, this variable is created from a combination of reports in the core and the disability topical module. We look across all waves of the panel and at the topical module and if ever there is a report of a work-preventing disability, we set this indicator to 1. The universe for this variable is all individuals who were at least age 15 and no older than age 70 by the end of the panel. The following disability variables from the core were used: 1996, 2001, 2004, Wave 6 in 2008 (EJOBCANT).
Field of bachelors degree as reported in the education history topical module. Universe if individuals who were age 15 by beginning of their SIPP panel and who had a bachelors degree. This topical module was asked in the following waves, by panel: Wave 3 in 1984 panel (TM8038); Wave 2 in the 1990-1993 panels (TM8428, TM8436); Wave 2 in the 1996-2008 panels (EBACHFLD). Categories vary for the 1996-2008 panels and for the 1990-1993 panels.
This set of variables (mbr_ssdi_applied_1mbr_ssdi_applied_4 where K=1 to 4) indicate whether there is a corresponding record of Social Security Disability Insurance (SSDI) application in the Master Benefit Record (MBR). Details for up to four SSDI applications are maintained. If the individual applied more than four times, then details for only a subset of the applications are recorded in this data, with priority given to approved and more recent applications. The first recorded application, the last recorded application, the first recorded application during the SIPP interview period, and the last recorded application during the SIPP interview period are always kept. For example, if a person applied for SSDI three times, this will be reflected in the following way: mbr_ssdi_applied_1 = 1, mbr_ssdi_applied_2 = 1, mbr_ssdi_applied_3 = 1, mbr_ssdi_applied_4 = 0.
This variable contains the diagnostic group for the SSDI recipient's primary code for mental or physical disability used in the medical determination of the individual's eligibility for disability benefits.
Indicates panel of source record
This start date variable tells when the first aged spouse benefit payment was recorded in the Payment History Update System, the administrative database maintained by SSA to track actual payments made to beneficiaries. This start date can differ from the MBR start date which records only eligibility and not actual payments. The PHUS began in 1984 and hence the earliest possible start date is January 1984. The latest possible start date is December 2012.
Total monthly benefit payment as recorded in the Payment History Update System, the administrative database maintained by SSA to track actual payments made to beneficiaries. This benefit amount can differ from the MBR total benefit which records only eligibility and not actual payments. This benefit amount is the total amount paid in the first month of aged spouse benefits receipt. If the respondent was dually entitled in this month, this benefit amount reflects the total payment made (i.e. the sum of the amounts due to each type of benefit). For example if a person received own and aged spouse retirement benefits, this benefit amount would be the sum of those two benefits.
This start date variable tells when the first SSDI benefit payment was recorded in the Payment History Update System, the administrative database maintained by SSA to track actual payments made to beneficiaries. This start date can differ from the MBR start date which records only eligibility and not actual payments. The PHUS began in 1984 and hence the earliest possible start date is January 1984. The latest possible start date is December 2012.
Total monthly benefit payment as recorded in the Payment History Update System, the administrative database maintained by SSA to track actual payments made to beneficiaries. This benefit amount can differ from the MBR total benefit which records only eligibility and not actual payments. This benefit amount is the total amount paid in the month k of own SSDI benefits receipt.
This start date variable tells when the first widowed spouse benefit payment was recorded in the Payment History Update System, the administrative database maintained by SSA to track actual payments made to beneficiaries. This start date can differ from the MBR start date which records only eligibility and not actual payments. The PHUS began in 1984 and hence the earliest possible start date is January 1984. The latest possible start date is December 2012.
Indicator that the PHUS recorded a positive retirement benefit amount (and consequently had a non-missing PHUS start date) at some point after the MBR eligibility start date.
Indicator that the PHUS recorded a positive SSDI benefit amount (and consequently had a non-missing PHUS start date) at some point after the MBR eligibility start date.
SSI benefits recorded as ceased on the SSR; see SSR: last payment date.
This variable contains the diagnostic group for the SSI recipient's primary code for mental or physical disability used in the medical determination of the individual's eligibility for disability benefits.
The date of last recorded payment of SSI benefits. This variable is saved as a SAS date.
Type of SSI benefit applied for or received by the individual.
Total personal income summed from all sources in month M. M represents the month number since the start of panel_1stfullyear. Only the first 24 months of the individual's SIPP panel are in universe for panels 1984, 1990, 1991, 1992, 1993, 2001. Panels 1996 and 2008 go up to month 48 and panel 2004 goes up to month 36. A later release of these data will add months after 24 for longer SIPP panels. Months that are outside the time frame covered by an individual's SIPP panel will always be missing and out of universe.
Total number of weeks that the respondent held a job in month M. M represents the month number since the start of panel_1stfullyear. Only the first 24 months of the individual's SIPP panel are in universe for panels 1984, 1990, 1991, 1992, 1993, 2001. Panels 1996 and 2008 go up to month 48 and panel 2004 goes up to month 36. A later release of these data will add months after 24 for longer SIPP panels. Months that are outside the time frame covered by an individual's SIPP panel will always be missing and out of universe.
Total number of weeks worked with pay in month M. M represents the month number since the start of panel_1stfullyear. Only the first 24 months of the individual's SIPP panel are in universe for panels 1984, 1990, 1991, 1992, 1993, 2001. Panels 1996 and 2008 go up to month 48 and panel 2004 goes up to month 36. A later release of these data will add months after 24 for longer SIPP panels. Months that are outside the time frame covered by an individual's SIPP panel will always be missing and out of universe.