The article “Gender identity and lexical variation in social media,” by David Bamman, Jacob Eisenstein, and Tyler Schnoebelen, writes about the analysis of the impact of gender on lexical choices and social networks using about 14,000 individuals on Twitter (135). The authors start off with informing the reader with background information about the subject of gender in sociolinguistics, revealing the theoretical viewpoint of gender as being “constructed, maintained, and disrupted by linguistic practices” (136). Computational studies have looked at data sets quantitatively to categorize which words happen to be the best predictor of “attributes such as age, gender, and regional origin” (136-137). The authors wrote about the “computational literature often distinguishes men and women on pragmatic dimensions of informativeness and involvement”; however, the complex role of gender in identity poses problems for quantitative analysis that only treats gender as an independent variable (137).
In the following section the article describes the statistics collected from Twitter in its shared nature and the simplicity of the collection from streaming API. Subsequently, filtering for social network, gender and name suppliers from the data set which contained “14,464 users and 9,212,118 tweets” (140). The first analysis was on the verbal indicators of gender, a typical computational tactic of separating the tweets through gender then training a logistic regression classifier to distinguish the gender. This classifier ensured “88 percent accuracy”, and this evaluation determined words that was the most related to whatever gender (141). “Pronouns”, “emotion terms”, “kinship terms”, “abbreviations”, and some negation languages was associated more with females (142-143). “Swears and taboo words” was associated more with males (143). In general, “these findings were generally in concert with previous research”; nevertheless, this analysis determined eight categories of word classification (143).
Ending the article, the authors address the topic using a quantitative system of clustering users according to alike lexical groups of words. “Probabilistic clustering” grouped “linguistically similar” writers together without considering the writers sex, “resulting in seventeen clusters” (146). These clusters are verbal collections and a multifaceted communication of social positions (147). Numerous differences between clusters upturned the previous computational results, proposing that “social categories such as gender cannot be separated from other aspects of identity” (148). The data in this analysis displayed gender homophily, directing to the theory that “applies to social media, where it is possible to make accurate predictions about a range of personal attributes based on the attributes of nearby individuals in the social network” (149). Moreover, the authors also uncovered a strong relationship among “individuals with a greater proportion of same-gender ties” and the “use of more gendered language” (149-150). They developed a classifier to recognize gender centered upon the gender skew of social media resulting in a “higher the gender skew of the social network, the more confident the classifier is in its prediction” (151). The article is concluded with talk of constructing a data-driven model, in what ways gender is divided, and taking data from types of machineries should be further investigated.