Synthetic Longitudinal Business Database
Virtual RDC
Lars Vilhuber
Cornell NCRN Project
January 6th, 2014
Cornell Institute for Social and Economic Research (CISER), Cornell University, Ithaca NY
CED2AR, Version 1.0
The Comprehensive Extensible Data Documentation and Access Repository 2.5
The Comprehensive Extensible Data Documentation and Access Repository 2.5
National Science Foundation (NSF)
1131848
NSF-Census Research Network - Cornell node
Cornell Institute for Social and Economic Research
Labor Dynamics Institute
January 6, 2014
Revision: 459
Comprehensive Extensible Data Documentation and Access Repository. Codebook for the Synthetic LBD Version 2.0 [Codebook file]. Cornell Institute for Social and Economic Research and Labor Dynamics Institute [distributor]. Cornell University, Ithaca, NY, 2013
SynLBD
Synthetic Longitudinal Business Database
United States Department of Commerce. Bureau of the Census.
Internal Revenue Service.
Cornell University. Labor Dynamics Institute.
United States Department of Commerce, Bureau of the Census
Duke University
Cornell University, Labor Dynamics Institute
none
Washington, DC, USA
National Science Foundation
0427889 and 1042181
Cornell University
U.S. Census Bureau
16 October 2013
2.0.2
U.S. Census Bureau. Synthetic Longitudinal Business Database: Version 2.0 [Computer file]. Washington DC; Cornell University, Synthetic Data Server [distributor], Ithaca, NY, 2013
establishments dynamics
In most countries, national statistical agencies do not release establishment-level business microdata, because doing so represents too large a risk to establishments' confidentiality. One approach with the potential for overcoming these risks is to release synthetic data; that is, the released establishment data are simulated from statistical models designed to mimic the distributions of the underlying real microdata. The Synthetic Longitudinal Business Database (SynLBD) is the synthetic data version of the Longitudinal Business Database (LBD), an annual economic census of establishments in the United States comprising more than 20 million records dating back to 1976. More information is available at https://www.census.gov/ces/dataproducts/synlbd/index.html.
In this codebook, variables are noted as "blanked" if they are available on the confidential version but have been removed from the synthetic version; "synthetic" if the confidential values have been synthesized and released on the synthetic version.
United States of America
National
Establishment
All employer establishments
Sampling from posterior predictive distribution
In order to access the Synthetic LBD, users should apply for a free account on the Synthetic Data Server (SDS) housed at the VirtualRDC at Cornell University. Application forms can be found at https://www.census.gov/ces/dataproducts/synlbd/accesslbd.html. Application decisions are based solely on feasibility, determined by evaluating whether the data necessary to conduct the analysis are included on the SynLBD Beta file. Decisions generally occur within 10 business days.
The SynLBD files have been cleared by the Census Bureau Disclosure Review Board and IRS for use by individuals wihtout Census Bureau Special Sworn Status and outside of Census Bureau facilities. Establishments in the SynLBD are fully synthesized using statistical models, and the SynLBD contains no data from actual establishments. Comparison at the establishment level shows SynLBD data differ substantially from the actual data. Modeling preserves variable relationships while protecting establishment identity.
The data can only be used on the VirtualRDC Synthetic Data Server http://www.vrdc.cornell.edu/sds/ at Cornell University. While no SynLBD data downloads are permitted at this time, users do not have to operate behind the Census Bureau firewall to access this server.
ces.synthetic.data.use@census.gov
Please use the following language in published work that make use of this dataset: "The creation of the Synthetic LBD was made possible through NSF Grant #0427889. Access to the Synthetic LBD was made possible through NSF Grant #1042181." Please also cite Kinney et al (2011) and use the bibliographic citation for the dataset provided in this document.
Establishments in the SynLBD are fully synthesized using statistical models, and the SynLBD contains no data from actual establishments. Comparison at the establishment level shows SynLBD data differ substantially from the actual data. Modeling preserves variable relationships while protecting establishment identity. Because the SynLBD has not been fully validated, relationships between SynLBD variables may not correspond to the relationships in the underlying confidential microdata. Unless validated, there is no guarantee results from the SynLBD reflect results from the underlying confidential data. Researchers are strongly encouraged to request result validation prior to publishing results based on the SynLBD. Validation occurs as part of an internal Census Bureau process to improve current beta data products, and is free, as resources permit. (See https://www.census.gov/ces/dataproducts/synlbd/validatingresults.html)
https://www.census.gov/ces/pdf/SynLBD_Codebook.pdf
Kinney, Satkartar K., Jerome P. Reiter, Arnold P. Reznek, Javier Miranda, Ron S. Jarmin and John M. Abowd. 2011.
CES WP-11-04
In most countries, national statistical agencies do not release establishment-level business
microdata, because doing so represents too large a risk to establishments' confidentiality. One approach with the potential for overcoming these risks is to release synthetic data; that is, the released establishment data are simulated from statistical models designed to mimic the distributions of the underlying real microdata. In this article, we describe an application of this strategy to create a public use file for the Longitudinal Business Database, an annual economic census of establishments in the United States comprising more than 20 million records dating back to 1976. The U.S. Bureau of the Census and the Internal Revenue Service recently approved the release of these synthetic microdata for public use, making the synthetic Longitudinal Business Database the first-ever business microdata set publicly released in the United States. We describe how we created the synthetic data, evaluated analytical validity, and assessed disclosure risk.
synlbd1997c.dta
all
28
Stata
Blanked Variables
Variables that are not available on the Synthetic LBD, but are available on the confidential LBD, are noted as "blanked". The data files on the Synthetic Data Server do contain these variables, but all values have been removed. Users can functionally test their programs to use these variables.
Synthetic Variables
Synthetic variables have had confidential values synthesized. On the Synthetic LBD, as the name suggests, the synthetic values have been released.
Identifiers
Identifiers allow for unique identification of records, and/or linkage to other datasets (foreign keys).
(synthetic) LBD Number
Longitudinal establishment identifier. Can be used to track establishment units over time.
(synthetic) March 12 Employment
Paid employment consists of full and part-time employees, including salaried officers and executives of corporations, who were on the payroll in the pay period including March 12. Included are employees on sick leave, holidays, and vacations; not included are proprietors and partners of unincorporated businesses. Employment refers to paid employment at the establishment where business is conducted. Note, this will correspond to firm employment only for single-unit establishment firms. Reported to the Internal Revenue Service (IRS) on Form 941. In
some cases this value is imputed due to missing or invalid data.
(synthetic) Reported Annual Payroll (in $1,000)
Total annual payroll includes all forms of compensation, such as salaries, wages, commissions, bonuses, vacation allowances, sick-leave pay, and the value of payments in kind (e.g., free meals and lodgings) paid during the year to all employees. Sum of quarterly IRS Form 941 payroll for the year. Missing or invalid 941 data
is replaced with imputed values.
(observed, computed) SIC3 code
all
0
Three digit Standard Industrial Classification code.
Standard Industrial Classification
Neat
(synthetic) Single-Multi Identifier
all
0
Indicator for whether the establishment belongs to a firm composed of two or more establishments. A value of 1 indicates the establishment is a member of a firm composed of two or more establishments. A value of 0 indicates the establishment is the only member ofthe firm.
0
Establishment is the only member of the firm
8000000
1
Establishment is a member of a firm composed of two or more establishments
1000000
(synthetic) First Year Establishment is Observed
all
0
1975
2000
Indicator for the first year the establishment is observed in the data (birth year). This variable is left censored at 1976. In conjunction with LASTYEAR, allows users to quickly determine the tenure of an establishment from any point in the data series.
(synthetic) Last Year Establishment is Observed
Indicator for the last year the establishment is observed (death year). This variable is right censored at the last year of the data. In conjunction with FIRSTYEAR, allows users to quickly determine the tenure of an establishment from any point in the data series.
(blanked) Activity Code.
blababalb
Variable not present on Synthetic LBD, only available on confidential LBD
(blanked) Best SIC code
Variable not present on Synthetic LBD, only available on confidential LBD
(blanked) Best NAICS code
Variable not present on Synthetic LBD, only available on confidential LBD
(blanked) Indicator: used in County Business Patterns
Variable not present on Synthetic LBD, only available on confidential LBD. Indicates that an observation was used in the tabulation of the County Business Patterns
Sysmiss
all
(blanked) Census File Number
Variable not present on Synthetic LBD, only available on confidential LBD. Links to other Economic microdata
0123456789
1234567890
(blanked) County FIPS codes
Variable not present on Synthetic LBD, only available on confidential LBD. County FIPS code. It is not possible to compute statistics at the state or county level.
FIPS code
(blanked) First Link Flag
Indicator for first link. Only available on confidential LBD.
(blanked) Type of Link Flag
Identifies the type of link. Only available on confidential LBD.
(blanked) Birth-Death-Continuer Link Flag
Identifies if the link is for birth/death/continuing establishment. Only available on confidential LBD.
(blanked) Last Link Flag
Only available on confidential LBD.
(blanked) Legal Form of Organization
Identifies the legal form of the organization. Only available on confidential LBD.
(blanked) LFO1
Processing variable. Only available on confidential LBD.
(blanked) Most Frequent SIC 1
Variable not present on Synthetic LBD, only available on confidential LBD.
(blanked) Processing (Economic) Division Code
Variable not present on Synthetic LBD, only available on confidential LBD.
(blanked) Reported Annual Payroll Flag
Only available on confidential LBD.
(blanked) SSEL Record Number
Links to BR. Only available on confidential LBD,
Variable not present on Synthetic LBD.
Sysmiss
all
(blanked) Standard Industrial Classification Code
Variable not present on Synthetic LBD, only available on confidential LBD. Detailed SIC code
Standard Industrial Classification
processing variable (to be dropped in future version)
(blanked) State FIPS codes
Variable not present on Synthetic LBD, only available on confidential LBD. It is not possible to compute statistics at the state or county level.
all
FIPS state code
(blanked) Type of Operation Code
Variable not present on Synthetic LBD, only available on confidential LBD.
(computed) Year
Implicit in file name, was added on SynLBD 2.0.2