We work with a lot of data. How often do we question whether it is good data?

Big idea: if you don’t start with the right data, you can get the wrong conclusions.

Let’s illustrate with a story.

Abraham Wald was a mathematician born in present day Romania. He was educated at the University of Vienna, but due to Jewish persecution in Austria in the late 1930’s Wald chose to immigrate to the United States to pursue a teaching and research career.

During World War 2, Wald was a member of the Statistical Research Group, an elite group of thinkers who looked to solve problems arising in the war using data. One such problem was to determine the best way to minimize bomber losses, specifically through potential changes in protection/armor. Data was collected to understand where bombers were being hit. This data is loosely represented by the picture below.

There was an natural reaction to reinforce the bombers where significant damage was occurring. Wald looked at the data from the returning bombers, and then declared that the recommendation was the opposite of what they should do.

Wald (correctly) theorized that damage should have been more evenly distributed. The patterns observed in the damage data, therefore, were important in what they did not show. The data only represents the damage on planes that were able to return. There is, of course, no data on planes that were too damaged to return. Wald suggested that the parts of the plane with less damage data are likely more vulnerable, since damage sustained in those areas of the plane resulted in the planes not returning to base.

The data was biased. Wald observed this, and was able to interpret it more correctly as a result.

This concept has been termed “survivor bias” since the data was biased based on the planes that survived. Bias comes in various forms. The table below identifies different types of data biases (survivor bias is a form of “non-response” bias).

Bias Definition Example
Sample Bias Sample does not accurately represent the population Volunteer focus group of employees in the headquarters location representing employee.

Volunteers may give different feedback, headquarters employees may not represent all, and the headquarters location may have unique factors

Non-Response Bias A particular group is not represented or under-represented Start ups seem to be paying quite differently than my survey data.

Start ups don’t typically participate in compensation surveys.

Response Bias Participants give false or exaggerated data 68% of sales reps report that that they are in the top 25% of performers.

They may feel this way, but it’s not mathematically possible and can’t be considered accurate.

Measurement Bias Data collection method gives incorrect results To what extent do you love this article: (A) A lot, (B) A ton, or (C) more than life itself

Not a balanced scale, so not likely to get a true reflection of sentiment.

The battle against bias begins when collecting data. Consider its source – be particularly skeptical of survey-based data. Unbiased data sources are typically thoughtfully generated using sampling techniques to control for bias.

Some of our favorite HR data is – get ready for it – biased. As you see from the examples in the table above, our beloved data sources can have embedded bias. Even a full census engagement survey likely has forms of bias.

Not all is lost when working with biased data. The key is to interpret the data for what it is. If the data has bias, simply understand it with that in mind. HBR has reported the following data from Payscale:

Key conclusion: people who say they are paid above/below market are typically wrong. Does this mean you can never use employee survey data to gauge competitiveness? Nope – you just need to be clear that what people say represents how they feel, not necessarily what is fully true. You have to be clear about what the data represents and not draw broader conclusions.

Biased data is a danger. It can lead to faulty conclusions and expensive solutions that make no impact. Before crunching the numbers, think hard about thedata you are using and what bias it may contain.