January 20, 2021

Ghostwriting Detection and Writing Style Analysis: Interesting Basics

No matter who did it for them – relatives, friends, freelancers or even professional agencies such as essay mills with ghostwriters – in 2020, educators worldwide witnessed a huge increase in the submission of assignments that were not written by the students that claimed to have written it. Often enough, professional ghostwriters wrote the content for them.

Can it be detected if a text was not written by a specific person but by a ghostwriter?

This question was the basis for Ouriginal to start developing a feature that supports everyone in need to evaluate the originality of a document, and verify its authorship. During the research phase of the development process, our R&D team experimentally discovered that the ‘Peer Group Similarity Hypothesis’ can be applied to solve the Authorship Verification Problem.

In comparison to the traditional plagiarism detection solution for safeguarding Academic Integrity, the Ouriginal ‘Metrics’ feature takes a birds-eye-view on a group of documents. When inputting a set of documents written by a group of students in the same class who are writing on the same assignment, our preliminary data suggests a Gaussian distribution for the metrics we compute on the documents. What we see is that most students are clustered in the middle of the values, with a few high and low scoring students, all of whom typically perform within a standard deviation of the mean.

Outliers in the low scoring area, are considered to be low-performance students, but we do not flag them. High outliers however, are flagged for further inspection. High outlier performance across a series of measures can indicate either a prodigious student or a potential ghostwriter. Determining which of the two categories the student falls into, is the responsibility of the instructor. Our flag simply raises awareness to certain student submissions that look unusual. We are fully aware that the way we consider high or low scores for these metrics may be the result of culturally specific pedagogical assumptions, which will be addressed at a later date. For now, in order to mitigate these assumptions, we base our score on eight metrics in total and not just on a single one.

Writing style analysis: comparing assignments of one single author

In the long run, the plan for Ouriginal’s ‘Metrics’ is to perform the validation by comparing one assignment with the other assignments the student has previously uploaded. Participants of the yearly PAN competitions have already heavily researched this problem, and so the task of Authorship Verification is somewhat old news. However, our method is more robust, not as prone to fluctuations with the different backgrounds of each student; nor is it as sensitive to genre or topical interference. Nonetheless, in order to work, first, the baseline data for each student has to be established. This is why Ouriginal is currently doing cross-classroom comparisons to collect data and cautiously flag documents.

The concept of ‘speech communities

Situating our work into what’s been done before, we turn to dialectectology–which takes a quantitative approach to identify ‘prototypical’ speakers alongside geographically mapping the prevalence of linguistic variants and the language features which form dialect groups as well as turning to linguistic anthropology–a field which is concerned with the language variation that exists within communities and the social meanings that are constructed through different forms of communication. In both these exciting fields, exists the research concept of ‘speech communities’– where groups of speakers who regularly interact share patterns of language use that identifies them as a member of their respective communities.

Detecting ‘ghostwritten’ works within groups of students

Ouriginal’s preliminary data introduced works of a professional ghostwriter from an online contract cheating forum into a set of documents written by groups of students in the same classroom, all writing on the same topic. Using this dataset, we were able to identify the ghostwriter because they scored as a significantly different high outlier for almost all measures currently implemented in Ouriginal’s ‘Metrics’. Pedagogical research into linguistically diverse education, alternatively phrased as the multicultural classroom, may provide important nuance to our understanding of what constitutes speech communities and peer groups as we must recognize the reality of increasingly globalized and heterogeneous classrooms.

What is your opinion on using advanced technologies and stylometry to evaluate the originality of a text? Which methods do you use to verify the authorship of a document? We’re happy to hear from you by email or discuss the topics with us on Twitter.

 

Read more blogs:

This website uses cookies to improve the site’s overall user experience and performance. Read more here.