AN AI-POWERED GENDER DIVERSITY CLASSIFIER


By Yousef Sabir, 30194560
For The University of Calgary
To view how this was created scroll down!


Step 1: AUTOMATED WEB SCRAPING

In this step, I used the IMPORTXML formula in Google Sheets to automatically get the names and phone numbers of faculty members at the University of Calgary's Art Department by scraping it from their website. This required a thorough understanding of how the parameters in the IMPORTXML formula worked. The first parameter input was the URL to the name directory from the university website (fairly simple). The second parameter however, required inspecting the name and phone number elements on the directory page to view their HTML byline code, and I needed to understand the format that the second parameter "xpath_query" followed to correctly input the argument. After many trials, I was able to successfully scrape the names and phone numbers into two adjacent columns in Google Sheets! - (Scroll down to view the next step).


Step 2: DATA CLEANING (NAMES)

For the next step, I used an AI-based application programming interface (API) to guess the gender of each person in my contact list. But the API required the data in first name format, so I had to do some data cleaning to prepare the data in order to feed it to the API. I went back to Google Sheets to perform this data cleaning. I should also mention that in many applications, data cleaning is a huge part of the overall work, but in this case, it is relatively simple. I created a new sheet and copied over the name column from my original sheet, but with one big difference, the copied column utilized the "Special Paste>values only" options that made the data static, meaning I uncoupled it from the formula in my original sheet to make the names unchangeable, so if any updates happened to the site directory, the copied names remained untouched. Subsequently, I needed to understand and use the SPLIT function to divide the copied first and last names by the space between them to isolate the first name in a new column. Once I successfully isolated the first name, I replicated the formula to each row by dragging it down to each row. Now that I had my newly created first name column, I created a new sheet and performed the same copy (special past > value only) to uncouple the first name from the formula in case the original data is changed in any way. My new sheet now had the cleaned data and was ready to be fed into the AI! (Scroll to view the next step)


Step 3: USING AN API TO ACCESS AN AI ENGINE TO CLASSIFY NAMES INTO GENEDERS

In this step, I wanted to automatically guess the gender of every first name using machine learning. But instead of building the machine learning component myself, I used an Application Programming Interface (API) to connect to a web-based Software-as-a-Service (SaaS) system that will do the machine learning part for me. APIs are ways in which pieces of software talk to each other. In this case, the software I am building wants to talk to a machine-learning classifier called "Genderize.io". A company in Denmark has created this Machine Learning classifier that uses a huge database of names and genders to “learn” and then applies that learning to guess the genders of any names we give it. For example, if I ask it to guess the gender for the name “Yousef” it will check its database to see what it has learned. It will see that it has about 28997 cases of Yousef, and in those only about 1% of them were female and the rest was male. So it will guess that with 99% probability, Yousef is male. This API was easy to access as it only needed a URL with the name as an ending argument to classify the gender. I should note not all APIs are this easy to use. I then went to my Google Sheet and went to the sheet that had all the first names we copied in earlier steps. I then used the CONCATENATE function to combine the first names column with the API URL without the name, to generate the appropriate gender classification. Afterward, I used the IMPORT DATA function to ensure the URL with the correct name ran correctly and output the desired API output i.e. the gender with the probabilities. As the output we have now generated correctly classifies the names into genders, the format was not yet usable. I then had to use the SPLIT function to isolate the gender into its own column. After this was successfully completed, there were a few cells that did not correctly render, so I needed to manually input Null to make sure our data set was clean and complete! I finally copied the names column, and the clean gender column to a new sheet to begin visualizing! (Scroll to view the next step)

Pie Chart Visualization of our data


The final step was to create a pie chart to visualize my data. I did this by highlighting the appropriate range in Google Sheets that contained the name and gender columns and in an empty cell to the right of my data, insert the chart to visualize the data. Google Sheets automatically finds the right data that tells you what percentage of the rows are male vs. female vs. Null. With this, I was done creating my visualization! The data showed that 47.2% of the names crawled from the University of Calgary's Arts Staff list were female, and 48.6% were male. There were limitations in using this method as it could be the case that not all the names are correctly classified, and that some rendering issues come up that don't allow generating any classification at all or a "Null" result. These are some weaknesses of using this AI classification tool, and we can then manually perform those operations.

Worried about using an automated AI like Genderize? Read Me!

I understand that some people may have concerns about using an automated AI tool like Genderize.io, but I can assure you that such tools have been developed with a lot of care and attention to detail. The purpose of Genderize.io is to help users identify the likely gender of a given name based on statistical analysis of large datasets. This can be useful in a variety of contexts, from marketing research to data analysis.One of the main strengths of Genderize.io is that it is based on a large and constantly updated database of names and their associated genders. This database is constantly being refined and improved based on user feedback, which means that the tool is getting more accurate over time. However, it's important to note that no automated tool is perfect, and there will always be cases where the tool is not able to accurately identify the gender of a name. It's also important to note that Genderize.io, like any other tool, is just one piece of the puzzle when it comes to understanding the demographics of a given population. While it can be useful for making educated guesses about the gender distribution of a particular dataset, it should always be used in conjunction with other sources of data and analysis. Finally, it's important to remember that the ultimate responsibility for interpreting and using the data generated by Genderize.io (or any other tool) lies with the user. As with any automated tool, it's important to critically evaluate the results generated by Genderize.io and to consider any potential biases or limitations that may be present.In summary, while there are some limitations and concerns associated with using automated AI tools like Genderize.io, they can be a useful and valuable resource when used responsibly and in conjunction with other sources of data and analysis.


© Untitled. All rights reserved.