Enhancing the Quality of Wage Records for Analysis through Imputation: Part One
by: Tony Glover, Research Analyst
"A complete and comprehensive set of demographic data enables a better understanding of the diverse nature of labor force churning."
Much of Research & Planning’s (R&P) analyses of labor market dynamics are based on Unemployment Insurance Wage Records1 collected by the Employment Tax Division of the Wyoming Department of Employment. One of the limitations of using Wage Records to address labor market issues is that while Wage Records collects detailed work behavior of individuals working for employers required to pay Unemployment Insurance tax, it lacks demographic information of individuals. To compensate, R&P uses several other administrative databases (e.g., Driver’s License, Employment Services, Unemployment Insurance Claims) which collect gender, date of birth, and race and combines these with Wage Records. While the combined database approach works for the majority of the records in Wage Records (88.2% from 1992Q1 to 2000Q2), it leaves 1,007,723 records without demographic data attached. This article discusses the method R&P used to impute gender and age to records without demographic data.
Two areas of research conducted by R&P are particularly sensitive to missing demographic data. The first is our research on labor market dynamics in the form of turnover. A complete and comprehensive set of demographic data enables a better understanding of the diverse nature of labor force churning.2 The second is our evaluation research3 where it is necessary to select large control groups from the Wage Records data. The demographic characteristics of age and gender are the two most important matching characteristics for control group selection.
There are several methods of handling records with missing data and consequently an extensive body of literature on the subject. However, for the sake of brevity only a few are discussed here.4
1) Procedures based on completely recorded units – This method deletes records for which data are missing, and only complete records are used for analysis. This is the easiest approach to handling missing data, but it often leads to biased results. For example, a large number of individuals for whom R&P has no gender and date of birth information all worked in SIC 70 “Hotels, Rooming Houses, Camps and Other Lodging Places.” Deleting the records from SIC 70 with missing data would lead to an inappropriate representation of this industry.
2) Imputation-based procedures – The missing values are filled in and the resultant completed data are analyzed by standard methods.
3) Model-based procedures – The characteristics of the individuals with complete records are used to calculate the likelihood that an individual with missing data falls in a specific age group. For example, if 85 percent of the individuals working in SIC 70 for four quarters with a quarterly wage between $3,000 and $3,500 are 16 to 19 years old, then it is inferred that the age of an individual (for whom we are missing data) with the same characteristic work behavior is 16 to 19 years old.
Both “regression imputation” and “model-based procedures” were explored as methods to calculate the missing demographic data in Wyoming. The model-based procedure had a higher degree of reliability and was used in the subsequent analysis. Gender was imputed first and the result was used in the subsequent age group imputation.
Gender Imputations
Several combinations of variables were used to determine the best model, in other words, the model with the highest percentage of actual gender matching imputed gender. The combination of 4-digit SIC, quarters worked, and average quarterly wage, assigned the correct gender approximately 80 percent of the time. Analysis of the data where the incorrect gender was assigned revealed that using the employer account number in place of SIC increased the overall percent correct. For example, a full-service dining establishment may predominately hire females. On the other hand, a fast food restaurant may be more likely to have an equal distribution of males and females. Both of these establishments are assigned to SIC 5812 “Eating Places.” By using the firm specific model, rather than the SIC based model, 95.3 percent of the records with a known gender were assigned the correct imputed gender.
All 7,494,215 records in the Wage Records database with a known gender were used. The data were aggregated on employer account number, quarters worked for employer, and average quarterly wage group for all occurrences in the Wage Records database. The number of males and females were counted. The probability that an individual — working for the same employer, for the same number of quarters, in the same average wage group — was a male was added to the record. A typical result for an employer account appears in Table 1.
A review of the first row of data in Table 1 shows there were 5 males and 25 females who worked for Employer 1 earning an average quarterly wage from $2,501 to $3,000. The probability of being a male in this case would be 16.7 percent (5 males divided by 30 employees).
Tabular data representing the probability of being a male were merged with the Wage Records database. Because 47 percent of the individuals appearing in Wage Records have worked for more than one employer, the probability of being a male would appear more than once. The average of the probability of being a male is the last step in imputing gender; an example for an individual is given in Table 2. A quick review of Table 3 shows that the individual’s Social Security Number (SSN) was reported by three employers in Wage Records. Table 2 shows that the individual worked for Employer 1 for four quarters and had an average quarterly wage of $501 to $1,000. The probability of being male for an individual working for Employer 1 for four quarters and earning an average quarterly wage between $501 and $1,000 is .85 (in other words 85% of the individuals falling in this group are males). The same data appear in Table 3 for the other two employers. The average probability (in this case of the three individual/employer interactions) was used to determine gender. An average probability of .50 and above was assigned the gender of male, while an average probability less than .50 was assigned the gender of female.
As shown in Table 3, of the records with a known gender, 94.9 percent of the males, 95.7 percent of the females, and 95.3 percent of all records were assigned the correct imputed gender. To assess the overall impact on future calculations using Wage Records, it is important to remember that the imputed gender is only used in the absence of a known gender. There are 8,501,938 records in Wage Records; of these, 7,494,215 have a known gender. For the remaining 1,007,723 records, with no gender attached, we assume (based on Table 3) that 95.3 percent (960,360) are assigned the correct gender after imputation. Therefore, the overall error associated with future calculations is .6 percent, the number incorrect (1,007,723 minus 960,360) divided by the total number of records, 8,501,938.
Age Group Imputation
Gender is dichotomous (male or female), but age changes across time and is measured as a continuous variable. A few major differences between gender and age imputations are discussed and presented in a table of results similar to Table 3.
The first major difference is that instead of grouping across all records (1992 to 1999) on the employer account number, quarters worked, wages, and gender (added for age imputations), the data were first grouped by year. The second major difference is that the age group with the highest probability of being correct for a given year was used to determine the individual’s age group for all years. For example, if an individual had an imputed age of 25 to 34 years in 1992 and the average age of this age group is 29 years, then for each subsequent year in Wage Records one year was added to the average age. In this example, the individual would make a transition from the 25 to 34 age group to the 35 to 44 age group in 1998.
Table 4 shows the age groups and associated percentage of known records that are correct. Note that the age group that was most difficult to correctly impute was the 20–24 year olds, and the age group with the most known records correctly imputed was the 35–44 year olds. To assess the overall impact on future calculations using Wage Records, it is important to remember, as with the imputed gender, the imputed age group is only used in the absence of a known age. There are 8,501,938 records in Wage Records; of these, 7,420,459 have a known age. For the remaining 1,081,479 records, with no age information attached, we assume (based on Table 4) that 68.6 percent (741,894) are assigned to the correct age group. Therefore, the overall error associated with future calculations is 4.0 percent, the number incorrect (1,081,479 minus 741,894) divided by the total number of records, 8,501,938.
Conclusions
As discussed in the introduction to this article, there are many imputation methods. The method chosen for this article appears to work best with the data available. R&P currently downloads several data sets with demographic information attached to individuals and as data becomes available, the imputed values will be replaced with the actual data. For the time being, R&P will adopt a regimen of imputing values to missing data on an annual basis, as it is a time consuming process. In the future, R&P intends to use the knowledge gained from this process to apply imputation techniques to other areas of interest (i.e., imputing occupations to Wage Records data). Next month’s Wyoming Labor Force Trends will present tables and figures demonstrating the impact of the imputed demographic data on future analysis of labor market activity using Wage Records.
1Wyoming Department of Employment, Research & Planning, Wyoming Wage Records 1992-1998: A Baseline Study, 1999.
2Labor market churning was the subject of a five-part Wyoming Labor Force Trends feature series: G. Lee Saathoff, "Separation from the Wyoming Labor Market," Wyoming Labor Force Trends, March 1999, pp. 1-5. Krista R. Shinkle, "Wyoming-Attached Workers: Living and Working in Wyoming," Trends, April 1999, pp. 1-6. Gregg Detweiler, "Industry Variations in Wyoming's Steady Workers," Trends, May 1999, pp. 1-6. Mike Evans, "Job Turnover and Hire Rates in Wyoming," Trends, June 1999, pp. 1-5. Valerie A. Davis, "Who Are Wyoming's New Hires?," Trends, July 1999, pp. 1-6.
3Tony Glover, "The Flow of Labor in Wyoming: Department of Family Services, Division of Vocational Rehabilitation and Job Training Partnership Act Clients," Wyoming Labor Force Trends, March 2000,
pp. 1-8. Tony Glover, "Performance Accountability in the Workforce Investment Act: An Application with Division of Vocational Rehabilitation Data Part One," Trends, November 1999, pp. 1-7. Tony Glover, "Performance Accountability in the Workforce Investment Act: An Application with Division of Vocational Rehabilitation Data Part Two," Trends, December 1999, pp. 1-7.
4R. J. Little & D. B. Rubin, Statistical Analysis with Missing Data, 1987. D.B. Rubin, Multiple Imputation for Nonresponse in Surveys, 1987.
a) Hot deck imputation – Missing data are replaced with data from a complete record that is characteristically similar on the known values. For example, if we were analyzing data about the characteristics of individuals working in SIC 70 and we had a record missing age, the unknown age value would be replaced with the age of the last record for which we had data.
b) Mean imputation – The missing value is replaced with the mean value of similar records. This method is similar to hot deck imputation in that instead of using the age of the known record, the mean age of all known records would replace the missing value.
c) Regression imputation – The missing values are estimated based on a predicted value from a regression model of the known data. Mean imputation takes one characteristic (i.e., working in SIC 70), calculates the mean age of the records with data, and substitutes the missing data with the mean age. Regression models take into account several variables (i.e., working in SIC 70, wages received, county of residence, quarters worked in the industry) to predict the age of an individual using a mathematical formula.
Table of Contents | Labor Market Information | Wyoming Job Network | Send Us Mail
These pages designed by Gayle C. Edlin.
Last modified on
by
Valerie A. Davis.