MATLAB Handle Graphics

Creating gender bias without statistical significance


There have been a number of online discussions about the statistical significance of the low recruitment of women in this year's Royal Society fellowships. Here I take a slightly different perspective. I show that gender bias in academia can easily be created even when each career stage appears to be fair according to a statistical test.


Imagine a career in science that consists of a number of stages (Undergraduate, PhD, Postdoc, Lecturer and Professor). At each stage we hire 50 people to these positions, from the pool of people who made it through the stage before. Lets imagine at the first stage 50% of the applicants are women and 50% are men. We demand that we are fair in this recruitment so that we don’t recruit statistically more men than woman, and set the significance of that statistical test at P=0.1.


But here is the hitch. Imagine a scenario where institution doing the recruitment wants to bend the rules. It wants to recruit as many men as possible without it showing up as statistically significant at the P=0.1 level. To do this it needs to know a bit about the binomial distribution and find the smallest value of k such that
\sum_{i=0}^k \left(\begin{array} \\ 50 \\ k \end{array}\right) 0.5^k (1-0.5)^{n-k} > 0.1
allowing the institution to pass the statistical test. This turns out to be k=20 which gives P=0.101 in a sign test: safely statistically insignificant. 20 women are recruited and 30 men. This sounds reasonable enough and no one can use statistics to argue that the recruiter wasn’t actually fair.


Then comes the second stage. Now 40% of applicants are female. Again a requirement is made for fairness, but now rather than asking that recruitment doesn’t differ significantly from 50% the requirement is that it doesn’t differ from the 40%. This is ‘fair’ since it gives each person an equal probability of being hired. Unfortunately, in my imagined scenario, this stage 2 institution has the same idea of rule bending as stage 1. It looks at the applicants, applies the formula above but with 0.4 instead of 0.5 and happily recruits 16 women. And so it goes on. On the next stage, 16 becomes 12, then 8, and by the fifth stage we only have 5 women and 45 men. The model's prediction is shown below.


At no particular stage was there any statistical significant discrimination, but by the time we got to stage 5, which I defined at the start to be Professor, then suddenly there are only 10% women. Sound familiar to those working in science? It should do, in science and technology a level of around 10% female professors is not unusual.


I want to be clear that I am not proposing that this is exactly what happens in science. There isn’t a single bigot who manipulates things, and the reasons that science is male dominated are complex and many. The last article I wrote comes closer to describing some of the mechanisms behind male dominance, but even that is just part of a web of interactions and cultural reasons.


The reason I wrote this piece is in response argument made by Edward Hinds of the Royal Society, that it is statistically justified to appoint only 2 female fellows because 9% of highly qualified professors are female. Doing a statistical test at one stage might show that that particular stage was fair, but it is more or less meaningless when done isolation. Hinds admits this at the end of his article, but does not spell out that each stage can appear ‘fair’ while producing an unfair outcome. Here, I hope, I have made this clear.


If you are paying attention you will notice that if we did a statistical test over all five stages then we will detect the unfairness of the system as a whole. This I obviously agree with. But this is again an argument for why doing a test at any one stage is not appropriate. It is very difficult to accuse a whole system of being unfair, since many different events have to be unravelled.


And that brings be to the Royal Society’s appointment of 2 out of 43 research fellows. Given the 21% female applicants each year, this was obviously statistically significant and I believe a poor decision. As Richard Mann argues in his blog post it might be reasonably considered as a statistical blip. Ben Sheldon provided statistics over the last 6 years giving the proportion of female appointments as 24%, 30%, 17.5%, 19.4%, 17.1% and 4.7%. The average is 18.78%, which given 21% female applicants gives a P value of about 0.4. Not biased against women, but hardly evidence that the Royal Society is helping women at this critical time in their career. More worrying is if we only take the last four years. Now the average is 14.68%. And the P value for this? About 0.2. Just around the level we might expect in my model for an institution trying to balance itself above statistical significance. While I am sure the Royal Society take this issue extremely seriously, I am afraid that these types of statistics do not look good.


Acknowledgements: Thanks to Richard Mann for comments on this post and the various interesting discussions on Twitter for inspiring it.