June 11, 2021

Computational Linguistics To Uncover Ghostwriters And Unfair Contract Cheating – And Its Limitations

We conducted an experiment with a student writing corpus collected from a group of students outside the US and UK all belonging to the same classroom. The selected papers were response papers to the same given prompt. In this set of documents, we then introduced a professionally ghostwritten document composed by ghostwriter R with the same prompt as the students. R was also given the same academic articles to read before writing the response paper and the same writing specifications as the other students. We then uploaded the student papers and the document written by R to Ouriginal to analyze with the Metrics feature. 

Recognizing a ghostwriter through computational linguistics

According to previous theoretical discussions and experimental results in stylometry, genre and topic interference can make differentiation between authors harder. This is because shared word and phrase usage that accompanies writing on the same topic as well as the writing conventions for particular genres make it difficult to identify a strong authorial signature to be used for author verification. We observed in our experiment that R was a high outlier for practically every measure that Ouriginal Metrics looks at.

Given that we controlled for topic and genre, we noted that what varied in this experiment was the education and socialization background of the professional ghostwriter. When R’s assignment was removed from the set of student documents and Ouriginal Metrics was run again, no student submissions flagged red. 

Lexical Originality as one of the measures supporting the Peer Group Similarity Hypothesis

To illustrate the system we will look at one measure in Ouriginal Metrics: Lexical Originality compares the number of unique words in a document to other documents which are uploaded to Ouriginal and selected for comparison. Taking the ‘Peer Group Similarity Hypothesis’ into consideration that has been discussed in part one of this blog series on contract cheating and ghostwriter detection, we expected to see a group of  students in the same classroom, given the same educational materials, and writing on the same topic to have low variance with regards to Lexical Originality.

Without a ghostwritten document introduced to the set, this is indeed what we observed. With the ghostwritten document included in the experiment, we noticed that R used very different language to express his/her thoughts. Even though R was given the same resources, the diction is one give away that R is either quite prodigious or not an academic peer of the other students. Of course, this measure alone is not enough, so Ouriginal Metrics takes an ensembled look. In doing so, we noted that R’s writing looked completely different from the other students on a quantitative basis, which raised a red flag within Ouriginal Metrics and called for further inspection. 

The limitations of the hypothesis: Bilinguality, socioeconomic factors and less homogenous classrooms

We have performed similar experiments which yielded comparable results and this research at Ouriginal supports the Peer Group Similarity Hypothesis. We are, however, fully aware that further experiments which account for linguistically diverse education (multicultural classrooms) may refine or refute our hypothesis. For example: there may be cultural or socioeconomic differences that exist within the classroom which influence student performance across various metrics. Scores across linguistic metrics aside, these social differences in the classroom incite new pedagogical approaches which are responsive to recognizing that classrooms are increasingly less homogeneous. 

Already, multicultural classrooms challenge pedagogies on how to effectively educate students on the negative effects of plagiarism. These changes include requiring instructors to be conscious of the various backgrounds at play in the classroom and to incorporate them into the learning experience. No doubt, the multicultural perspective is the future of education and one that Ourigial supports as a safeguard of Academic Integrity and a thought leader in Educational Technology.

We would love to hear your opinion on identifying ghostwriters through computational linguistics and our software! Discuss the topics with us on Twitter.

Further reading:

This website uses cookies to improve the site’s overall user experience and performance. Read more here.