Rewriting Herstory
Rewriting Herstory
Using text analysis to explore powerful female counterparts of well-known men in history.
What would a "woman's world" look like?
Education of history around the world is incredibly male-centric. In Time Magazine's analysis of most googled people of history, The Smithsonian's list of Most Significant Americans of All-Time, Wikipedia's most viewed pages of people, S&P500 company leaders list, and the Nobel Prize Laureate list, the percentage of female representation did not surpass 20%. In fact, there were more men named "John" than there were women on the board of S&P500 companies.
Using R packages rvest, tm, slam, and Wikipedia's lists of women, I scraped each of these sources to compile a list of the biographies of the top 50 influential men in history and over 7,000 of women. To achieve industry diversity, I included the following occupations to match on:
Astronauts | Astronomers | Business People | Composers | Explorers | Political Leaders | Inventors | Mathematicians | Philosophers | Scientists | Writers | Film Directors | Civil Rights Leaders | Artists | Computer Scientists
By analyzing the cosine similarities in the document-term-matrices of each man against women in his industry, I collected the top matching counterpart for each. The [unfinished] results are below.
Famous Men | Description | Famous Women | Description | Match Quality |
---|---|---|---|---|
William Shakespeare | Poet, Playwright | Delia Bacon | Playwright, known for authorship of attribution of Shakespeare's plays | Unsure |
Aristotle | Philosopher | Mary Louise Gill | Professor of Philosophy, focuses on Aristotle | Unsure |
Charles Darwin | Biologist | Mary Anne Whitby | Introduced silkworm cultivation to UK with Darwin | Good |
Christopher Colombus | Explorer | Carol Beckwith | Photojournalist who documented indigenous tribes of Africa | Good |
Wolfgang Amadeus Mozart | Composer | Jitka Snizkova | Czech composer, President of Mozart Society | Unsure |
Initial takeaways
The quality of the matches vary widely. High-quality matches included Neil Armstrong vs. Peggy Whitson and Marco Polo vs. Freya Stark. Many of the weaker matches resulted from matching famous women who researched the famous men, instead of those who matched in achievements.
The most obviously and amusingly incorrect match was J.P. Morgan vs. Zoe Cruz, the former Co-President of Morgan Stanley. The algorithm pulled the common Wharton freshman mistake of confusing these two banks. Can't blame it though - I didn't know the difference until just last year!
Moving forward
The final product will be a searchable and ever-growing database of counterparts throughout history to promote the education and appreciation of female achievements.
R code
library(readr) library(rvest) library(tm) library(slam) # Reading Women Wiki Pages womencs3 = read_html("https://en.wikipedia.org/w/index.php?title=Category:Women_computer_scientists&pagefrom=Zegura%2C+Ellen+W.%0AEllen+W.+Zegura#mw-pages") womencs3 = html_attr(html_nodes(womencs3, css="a"), "href") womencs3full= c() for (i in 1:length(womencs3)){ womencs3full[i] <- paste("https://en.wikipedia.org", womencs3[i], sep="") } womencs3text = c() for(i in 1:length(womencs3full)){ tryCatch({ z = read_html(womencs3full[i]) z = html_text(html_nodes(z, css="p"), "href") z = paste(z[1:length(z)], collapse=" ") womencs3text[i] <- paste(z, sep="")}, error=function(e){cat("ERROR :",conditionMessage(e), "\n")}) } # Read Man's Wiki Link url = "https://en.wikipedia.org/wiki/Neil_Armstrong" mantext = read_html(url) %>% html_nodes("p") %>% html_text() # Using Cosine mantext = paste(mantext[1:length(mantext)], collapse=" ") mantext.df = data.frame("Name", mantext) womentext.df = data.frame("Name", womenastronauttext) colnames(mantext.df) = c("Person", "Text") colnames(womentext.df) = c("Person", "Text") alltext = rbind(womentext.df, mantext.df) # Turn to Corpus corp = VCorpus(VectorSource(alltext$Text)) corp = tm_map(corp, removePunctuation) corp = tm_map(corp, removeNumbers) corp = tm_map(corp, content_transformer(tolower) ,lazy=TRUE) corp = tm_map(corp, content_transformer(removeWords), stopwords("english") ,lazy=TRUE) corp = tm_map(corp, content_transformer(stemDocument), lazy=TRUE) corp = tm_map(corp, stripWhitespace) dtm <- DocumentTermMatrix(corp) # Find Highest % Match cosine_sim <- tcrossprod_simple_triplet_matrix(dtm, dtm)/sqrt(row_sums(dtm^2) %*% t(row_sums(dtm^2))) diag(cosine_sim) = 0 dim(cosine_sim) matchedvalues = cosine_sim[length(womenastronauttext)+1,] top = sort(matchedvalues, decreasing = TRUE)[1] womenastronautfull[which(matchedvalues == top)]
Let’s work together.
If you have a project in mind, or would like to chat, shoot me an email at laura.y.gao@gmail.com.