Synthetic Longitudinal Business Database Virtual RDC Lars Vilhuber Cornell NCRN Project January 6th, 2014 Cornell Institute for Social and Economic Research (CISER), Cornell University, Ithaca NY CED2AR, Version 1.0 The Comprehensive Extensible Data Documentation and Access Repository 2.5 The Comprehensive Extensible Data Documentation and Access Repository 2.5 National Science Foundation (NSF) 1131848 NSF-Census Research Network - Cornell node Cornell Institute for Social and Economic Research Labor Dynamics Institute January 6, 2014 Revision: 459 Comprehensive Extensible Data Documentation and Access Repository. Codebook for the Synthetic LBD Version 2.0 [Codebook file]. Cornell Institute for Social and Economic Research and Labor Dynamics Institute [distributor]. Cornell University, Ithaca, NY, 2013 SynLBD Synthetic Longitudinal Business Database United States Department of Commerce. Bureau of the Census. Internal Revenue Service. Cornell University. Labor Dynamics Institute. United States Department of Commerce, Bureau of the Census Duke University Cornell University, Labor Dynamics Institute none Washington, DC, USA National Science Foundation 0427889 and 1042181 Cornell University U.S. Census Bureau 16 October 2013 2.0.2 U.S. Census Bureau. Synthetic Longitudinal Business Database: Version 2.0 [Computer file]. Washington DC; Cornell University, Synthetic Data Server [distributor], Ithaca, NY, 2013 establishments dynamics In most countries, national statistical agencies do not release establishment-level business microdata, because doing so represents too large a risk to establishments' confidentiality. One approach with the potential for overcoming these risks is to release synthetic data; that is, the released establishment data are simulated from statistical models designed to mimic the distributions of the underlying real microdata. The Synthetic Longitudinal Business Database (SynLBD) is the synthetic data version of the Longitudinal Business Database (LBD), an annual economic census of establishments in the United States comprising more than 20 million records dating back to 1976. More information is available at https://www.census.gov/ces/dataproducts/synlbd/index.html. In this codebook, variables are noted as "blanked" if they are available on the confidential version but have been removed from the synthetic version; "synthetic" if the confidential values have been synthesized and released on the synthetic version. United States of America National Establishment All employer establishments Sampling from posterior predictive distribution In order to access the Synthetic LBD, users should apply for a free account on the Synthetic Data Server (SDS) housed at the VirtualRDC at Cornell University. Application forms can be found at https://www.census.gov/ces/dataproducts/synlbd/accesslbd.html. Application decisions are based solely on feasibility, determined by evaluating whether the data necessary to conduct the analysis are included on the SynLBD Beta file. Decisions generally occur within 10 business days. The SynLBD files have been cleared by the Census Bureau Disclosure Review Board and IRS for use by individuals wihtout Census Bureau Special Sworn Status and outside of Census Bureau facilities. Establishments in the SynLBD are fully synthesized using statistical models, and the SynLBD contains no data from actual establishments. Comparison at the establishment level shows SynLBD data differ substantially from the actual data. Modeling preserves variable relationships while protecting establishment identity. The data can only be used on the VirtualRDC Synthetic Data Server http://www.vrdc.cornell.edu/sds/ at Cornell University. While no SynLBD data downloads are permitted at this time, users do not have to operate behind the Census Bureau firewall to access this server. ces.synthetic.data.use@census.gov Please use the following language in published work that make use of this dataset: "The creation of the Synthetic LBD was made possible through NSF Grant #0427889. Access to the Synthetic LBD was made possible through NSF Grant #1042181." Please also cite Kinney et al (2011) and use the bibliographic citation for the dataset provided in this document. Establishments in the SynLBD are fully synthesized using statistical models, and the SynLBD contains no data from actual establishments. Comparison at the establishment level shows SynLBD data differ substantially from the actual data. Modeling preserves variable relationships while protecting establishment identity. Because the SynLBD has not been fully validated, relationships between SynLBD variables may not correspond to the relationships in the underlying confidential microdata. Unless validated, there is no guarantee results from the SynLBD reflect results from the underlying confidential data. Researchers are strongly encouraged to request result validation prior to publishing results based on the SynLBD. Validation occurs as part of an internal Census Bureau process to improve current beta data products, and is free, as resources permit. (See https://www.census.gov/ces/dataproducts/synlbd/validatingresults.html) https://www.census.gov/ces/pdf/SynLBD_Codebook.pdf Kinney, Satkartar K., Jerome P. Reiter, Arnold P. Reznek, Javier Miranda, Ron S. Jarmin and John M. Abowd. 2011. CES WP-11-04 In most countries, national statistical agencies do not release establishment-level business microdata, because doing so represents too large a risk to establishments' confidentiality. One approach with the potential for overcoming these risks is to release synthetic data; that is, the released establishment data are simulated from statistical models designed to mimic the distributions of the underlying real microdata. In this article, we describe an application of this strategy to create a public use file for the Longitudinal Business Database, an annual economic census of establishments in the United States comprising more than 20 million records dating back to 1976. The U.S. Bureau of the Census and the Internal Revenue Service recently approved the release of these synthetic microdata for public use, making the synthetic Longitudinal Business Database the first-ever business microdata set publicly released in the United States. We describe how we created the synthetic data, evaluated analytical validity, and assessed disclosure risk. synlbd1997c.dta all 28 Stata Blanked Variables Variables that are not available on the Synthetic LBD, but are available on the confidential LBD, are noted as "blanked". The data files on the Synthetic Data Server do contain these variables, but all values have been removed. Users can functionally test their programs to use these variables. Synthetic Variables Synthetic variables have had confidential values synthesized. On the Synthetic LBD, as the name suggests, the synthetic values have been released. Identifiers Identifiers allow for unique identification of records, and/or linkage to other datasets (foreign keys). (synthetic) LBD Number Longitudinal establishment identifier. Can be used to track establishment units over time. (synthetic) March 12 Employment Paid employment consists of full and part-time employees, including salaried officers and executives of corporations, who were on the payroll in the pay period including March 12. Included are employees on sick leave, holidays, and vacations; not included are proprietors and partners of unincorporated businesses. Employment refers to paid employment at the establishment where business is conducted. Note, this will correspond to firm employment only for single-unit establishment firms. Reported to the Internal Revenue Service (IRS) on Form 941. In some cases this value is imputed due to missing or invalid data. (synthetic) Reported Annual Payroll (in $1,000) Total annual payroll includes all forms of compensation, such as salaries, wages, commissions, bonuses, vacation allowances, sick-leave pay, and the value of payments in kind (e.g., free meals and lodgings) paid during the year to all employees. Sum of quarterly IRS Form 941 payroll for the year. Missing or invalid 941 data is replaced with imputed values. (observed, computed) SIC3 code all 0 Three digit Standard Industrial Classification code. Standard Industrial Classification Neat (synthetic) Single-Multi Identifier all 0 Indicator for whether the establishment belongs to a firm composed of two or more establishments. A value of 1 indicates the establishment is a member of a firm composed of two or more establishments. A value of 0 indicates the establishment is the only member ofthe firm. 0 Establishment is the only member of the firm 8000000 1 Establishment is a member of a firm composed of two or more establishments 1000000 (synthetic) First Year Establishment is Observed all 0 1975 2000 Indicator for the first year the establishment is observed in the data (birth year). This variable is left censored at 1976. In conjunction with LASTYEAR, allows users to quickly determine the tenure of an establishment from any point in the data series. (synthetic) Last Year Establishment is Observed Indicator for the last year the establishment is observed (death year). This variable is right censored at the last year of the data. In conjunction with FIRSTYEAR, allows users to quickly determine the tenure of an establishment from any point in the data series. (blanked) Activity Code.

blababalb

Variable not present on Synthetic LBD, only available on confidential LBD
(blanked) Best SIC code Variable not present on Synthetic LBD, only available on confidential LBD (blanked) Best NAICS code Variable not present on Synthetic LBD, only available on confidential LBD (blanked) Indicator: used in County Business Patterns Variable not present on Synthetic LBD, only available on confidential LBD. Indicates that an observation was used in the tabulation of the County Business Patterns Sysmiss all (blanked) Census File Number Variable not present on Synthetic LBD, only available on confidential LBD. Links to other Economic microdata 0123456789 1234567890 (blanked) County FIPS codes Variable not present on Synthetic LBD, only available on confidential LBD. County FIPS code. It is not possible to compute statistics at the state or county level. FIPS code (blanked) First Link Flag Indicator for first link. Only available on confidential LBD. (blanked) Type of Link Flag Identifies the type of link. Only available on confidential LBD. (blanked) Birth-Death-Continuer Link Flag Identifies if the link is for birth/death/continuing establishment. Only available on confidential LBD. (blanked) Last Link Flag Only available on confidential LBD. (blanked) Legal Form of Organization Identifies the legal form of the organization. Only available on confidential LBD. (blanked) LFO1 Processing variable. Only available on confidential LBD. (blanked) Most Frequent SIC 1 Variable not present on Synthetic LBD, only available on confidential LBD. (blanked) Processing (Economic) Division Code Variable not present on Synthetic LBD, only available on confidential LBD. (blanked) Reported Annual Payroll Flag Only available on confidential LBD. (blanked) SSEL Record Number Links to BR. Only available on confidential LBD, Variable not present on Synthetic LBD. Sysmiss all (blanked) Standard Industrial Classification Code Variable not present on Synthetic LBD, only available on confidential LBD. Detailed SIC code Standard Industrial Classification processing variable (to be dropped in future version) (blanked) State FIPS codes Variable not present on Synthetic LBD, only available on confidential LBD. It is not possible to compute statistics at the state or county level. all FIPS state code (blanked) Type of Operation Code Variable not present on Synthetic LBD, only available on confidential LBD. (computed) Year Implicit in file name, was added on SynLBD 2.0.2