As a part of my data science journey, I have made it a point not only to learn the necessary programming skills but to also understand the industry, its applications, and more importantly, it’s consequences. One such consequence is the way that data can perpetuate bias within society, and I wanted to use my platform to talk about this issue.
Over the past year, I have read books including “Invisible Women” and “Data Feminism,” which have opened my eyes to see the ways that data perpetuates gender bias. The ongoing pandemic has uncovered how bias in data leads to inequitable healthcare. And the Black Lives Matter movement shines a light on how society continues to hold back people of color and different races based on systems determined using data.
Data perpetuates all kinds of bias in society and continues to do so as long as biased people work with data, as long as data scientists continue to use biased data and as long as data scientists continue to leave out important populations of people in datasets, what we call “missing data”.
While Data Science is an evolving field, it has always been positioned towards computer science and, in turn, men – white men. White men are a dominant group that is considered to be the “norm” and benefit from a level of privilege in society that minority groups don’t. Those in positions of privilege are blinded by their advantage and do not think that the world is different than the one that they live in, which means that they do not see or understand the problems that minority groups face.
Just last month, MIT took down an AI training dataset that labeled people in images with misogynistic, racist, and derogatory terms. Imagine, if a place like MIT was using biased training data up until July 2020, what are other institutions and organizations using? How many places and people out there are blind to the bias?
When a data scientist does not understand the real issues at hand, how can they possibly account for them in their work? This makes the work that they do biased. Not to mention the fact that data science typically serves the goals of those with the money and data resources, not the goals of those being affected by it. In some cases, bias may be perpetuated on purpose.
To solve the issue of bias within data science, we first need to fix the lack of diversity that exists within data science. Good data science corrects for bias, but this can only happen when there is diversity and inclusion, and when we have people who can find and correct bias. Therefore, we need people in the field who represent different minority groups and backgrounds. Only then can data scientists build better data and models because you’re building bad models if you’re leaving out different perspectives.
So, what needs to be done? Companies need to broaden their hiring field to look for candidates from uncommon and unexpected places. While doing so, the field of data science needs to be rebranded as a field for all people to work in. We need to stop questioning why women or other minorities are present within a field and stop expecting them not to know things. Instead, we need to bring in their perspectives, not only to correct the bias work but also to teach others how to be unbiased in their work.
Beyond the people who work with data, how data is collected and measured can also be biased towards different groups of people. When used to build machine learning algorithms and AI, this bias will only be amplified and lead to biased outcomes. This is part of the root cause of biased data science that needs to be addressed. Data scientists need to make sure we have equitable data to work with, or else we will never have an equitable society. The reality is that data has the power to hold back people everywhere.
In 2014, Amazon built a hiring algorithm that reviewed job applicants’ resumes with the hope of automating the search for top talent. This sounds great when you have hundreds of resumes to peruse through, but just one year later, Amazon realized that their algorithm was biased against women. The dataset that their hiring algorithm was trained on consisted of past resumes submitted to the company, and was highly representative of male applicants. Therefore, the algorithm taught itself that male candidates were preferred and penalized any resume that mentioned the word “women’s.” This project was later scrapped.
Biased data has failed people of color as well. Controversial “risk assessments” are used to determine who can be set free at different stages of the criminal justice system, but flag black defendants more often as high risk to “re-offend” because of the data collected about them. Unfortunately, there is not enough evidence to determine the validity of the risk scores in deciding such important matters – yet they continue to be used. Even today, during the COVID-19 pandemic, we’ve seen a lack of COVID-19 testing centers and a delay in testing equipment to diverse and predominantly black neighborhoods. Not only do we use existing data to disadvantage them, but we also continue to fail people of color by not collecting and reporting COVID-19 data related to race.
This just scrapes the surface when it comes to the MANY examples in which people are prejudiced against and are held back in society because of the biased data being used to make decisions that impact them. The only way to move forward from here is to try and correct for bias within existing data, but starting over and collecting unbiased data from the beginning may be the best solution.
Not only do we need to think about the bias in the data being collected, but we need to think about the bias that is created by the paucity of data as well. By leaving out data, missing data, the resulting analysis becomes unrepresentative of the population. Just think about selection bias, and how excluding some people can leave them out of the solution to a problem, a problem that may very well have an impact on them.
One example is the collection of gender data. When asked what gender a person is, typically options are “male” and “female.” This poses an issue for people who identify with non-binary genders and are forced to pick one of the binary options. The lack of information on non-binary populations means that fewer studies are being done to cater to this specific group of people and that they’re not being considered when it comes to making decisions about the population at large; the data is biased against them. Granted, some people may not be comfortable identifying themselves because of potential treatment by others, but there are ways to still consider this information as outlined by the Human Rights Campaign organization.
The COVID-19 pandemic has also shown a deleterious effect of missing data. The past few months have been a vital time to distribute healthcare supplies to people around the world, but when you’re “not on the map”, how can people get you the necessary help? The fact is that the people who need this information just don’t have it and this creates a bias against the people who are left out. Organizations like the Humanitarian OpenStreetMap Team (HOT), are working to fill this gap by “mapping” houses, buildings, and entire communities and cities in high-risk areas so that people in these areas are identified and can receive proper supplies. I have been able to volunteer virtually on one of HOT’s initiatives mapping homes and buildings in high-risk countries.
Another HOT project that I have been working on is mapping villages in rural India so that girls can get access to proper education that they would otherwise be denied because of their gender. Data is power and it is amazing how simply putting people on the map, and making them be noticed, can give them so much power.
As someone still learning the field of data science, volunteering with HOT is the most that I can do right now to help, but I do hope to do more in the future. There are more organizations out there fighting the bias within data and also using data for social good from DataKind to DSSG Solve. If you are interested in playing a part in eliminating data bias, I would highly recommend researching what you can do.
If you are a data scientist, in the field of data science or just work with data that can impact people, pay attention to the possible biases that you might create and work to terminate the bias that currently exists. Further, hire diverse teams to work with data and educate others when possible.
Biased data, at every level, can impact people negatively. When you think about how much of what happens in society is based on data and information, then you realize just how much work there is to be done. This is only the beginning.