Enhancing the Quality of Wage Records for Analysis through Imputation: Part Two
by: Tony Glover, Research Analyst
"By creating a separate category for workers who demonstrate an attachment to Wyoming's labor market of three or fewer quarters, we lower the error associated with imputation."
The article, "Enhancing the Quality of Wage Records for Analysis through Imputation: Part One" in the April 2001 issue of Wyoming Labor Force Trends introduced a method of imputing demographic characteristics for individuals based on people with known demographic characteristics. This article introduces and justifies additional restrictions for using imputed data, and demonstrates that the impact of imputed demographic data on analysis of the Wage Records database depends on the level of detail desired. Specifically, we intend to lower the risks of using imputed data for any individuals whose work history reflect a low level of attachment to the Wyoming labor market. Lastly, we make suggestions for revised imputation techniques and future analysis.
The imputation models introduced in the previous article are based on an individual’s work history. The combination of an individual’s interaction with an employer, quarters with the employer and average quarterly wage from the employer were used to define an individual’s work history. Further, the imputed demographic characteristics were determined by an aggregation of all of an individual’s employer interactions. For example, individuals with three employers had three associated probabilities of being a male and all three were averaged to determine the imputed gender of the individual.
It is apparent that the imputation model used for our analysis relies heavily on previous work history. We determined at the beginning of this project that any imputation based on three or fewer quarters of work history would be re-coded as not available (N/A). In other words, due to the small quantity of known demographic information for workers with low attachment to the Wyoming labor market (i.e., because these workers often do not hold Wyoming driver’s licenses), we have chosen to treat them as a separate category of worker and thereby reduce the error otherwise associated with including them in demographic analyses.
Table 1 shows the number of unique Social Security Numbers (SSN) by quarters worked and demographic data availability for Wage Records from 1992 to 2000. Imputed demographics were only used in absence of known characteristics. Table 1 reveals that of the 697,613 unique SSNs appearing in Wage Records during this eight-year period, 321,491 (46.1%) had unknown and subsequently imputed demographic data. Incorporating the three-quarter rule, we re-coded 73.1 percent of the imputed data as N/A. The final result is that of the 697,613 unique SSNs, 376,122 (53.9%) have known demographics, 86,567 (12.4%) have imputed demographics and 234,924 (33.7%) remain N/A. Subsequent tables and figures are offered to support the decision to use the three-quarter rule and demonstrate the impact of imputations on analysis of the 1998 Wage Records data.
Table 2 shows the demographic characteristics (i.e., gender and age) by the imputation status. The first category of imputation status is "Known," and it occurs when there is a match on the Wage Records SSN with another administrative database that includes demographics (i.e., Driver’s License, Employment Services). The second imputation status category in Table 2 is "Imputed," based on the work history model discussed in Part One of this article and meeting the requirement of at least four quarters of attachment to Wyoming’s labor force. Lastly, "Imputed but Re-coded N/A" are records where the demographic data were imputed but then re-coded N/A for all subsequent research using Wage Records as a result of appearing in fewer than four quarters of Wage Records during the period 1992 to 2000. Table 3 shows the distribution of gender across all three categories of imputation status, and Table 4 shows the distribution by age group.
Figure 1 shows the age and gender distribution for 1998 Wage Records broken out by the category of imputation status. Because imputed data are based on the known data, the distribution of those groups is by definition similar. The data in Table 2 and Figure 1 were arranged in this fashion to demonstrate that males 20 to 34 are over represented, while females 35 to 54 years old are underrepresented for those records that were "Imputed but Re-coded N/A" when compared to the "Known" and "Imputed" categories.
Table 5 demonstrates that the effects of the "Imputed but Re-Coded N/A" category are not equally distributed among major industries. Government and Finance, Insurance, & Real Estate (FIRE) have the lowest percentage of records with "Imputed but Re-coded N/A" at 1.3 percent and 1.9 percent, respectively. These two industries also have the lowest turnover rates and a strong employee/employer attachment.1 Construction, Services and Retail Trade have the highest percentage of re-coded values at 8.5, 6.9 and 4.4 percent, respectively. This statement excludes Unemployment Insurance-covered Agriculture due to underrepresentation in Wage Records and low total employment.
Unpublished research using interstate Wage Records data suggests that the high percentage of construction workers who appear in Wage Records for a brief time but never appear in our demographic databases are working for Colorado construction companies contracting in Wyoming. Retail Trade and Services have high seasonal employment variation due to tourism. A large number of employees working in Construction, Retail Trade and Services have a low attachment to Wyoming’s labor market, often working here during the summer months, then returning home. Research & Planning is collecting other states Wage Records databases in hopes of pursuing research in this area in the near future.
A comparison of Table 5 and Table 6 will aid in achieving the goals of this analysis, primarily to demonstrate the differential impact of using imputed data at the detailed industry level. Table 6 breaks the Services Industry into sub-industries, and demonstrates that the more detail we desire the more likely we are to impact differentially the distribution of "Known," "Imputed" and "Imputed but Re-Coded, N/A" data. Referring to Table 5 we know that there are 33,636 total imputations ("Imputed" plus "Re-Coded N/A"). Of these, 4,469 (13.3%) occur in one sub-industry, hotels and other lodging places (see Table 6). Further, of the total imputations re-coded N/A across all industries (9,325), 19.9 percent (1,854) occur in the same sub-industry, hotels and other lodging places.
By incorporating the three-quarter rule, we are attempting to lower the possible error associated with imputations based on low attachment to Wyoming’s labor force. We assume that individuals who come to Wyoming and only work for three or fewer quarters are different than Wyoming’s labor force in general. Given the last statement and considering the justification for the three-quarter rule described with Figure 1, perhaps the individuals re-coded as N/A are predominately males aged 20 to 34 and less likely to be females 35 to 54 years old. We suggest that any future use of demographic data include all three categories of data and validations of each.
Research & Planning will continue to explore avenues for the application and interpretation of imputation methods. As for imputation of demographics, future research will take a step back and use a detailed industry approach rather than an employer based model. We would like other states to test this model for themselves, but few states have access to demographic data on which to build the statistical models. Our earlier attempts based on industry rather than employer yielded a higher likelihood of error. However, by building our models on industry rather than employer, we may be able to supply other states with the associated probabilities and give them a tool with which to assign demographics to their own Wage Records data.
Other explorations into using imputations conducted by Research & Planning included methods by which North American Industrial Classification (NAICS) codes have been imputed based on historical Standard Industrial Classification (SIC) / NAICS combinations. Future research will attempt to assess the validity of imputing occupations to the Wage Records database. It is clear that increased computing capabilities are opening the doors to a diverse set of new research questions.
1Wyoming Department of Employment, Research & Planning, Outlook 2000: Detailed Occupational Projections and Labor Supply, 2000, Chapter 2.
Table of Contents | Labor Market Information | Wyoming Job Network | Send Us Mail
These pages designed by Gayle C. Edlin.
Last modified on 06/22/01by Julie Barnish.