June 22, 2021

False positives: Why are they important and why must software algorithms learn and adapt

Plagiarism is an ever-present threat to academic integrity and original thinking. If you do not quote and reference correctly – research, papers, and other material that has been written cannot be attributed properly. Text-matching software can help to identify mistakes. To get reliable results, the used algorithms must be able to detect false positives.

Plagiarism detection systems are therefore helpful and crucial additions to all institutions creating knowledge. They also help us save time by automatically flagging suspected cases of plagiarism instead of us having to manually search for similarities online or elsewhere.

However, finding potential plagiarism in texts, essays or even doctoral thesis can be tricky. A common threat in the fight against plagiarism is to not recognize so-called “false positives” and to underestimate their importance. But first, let us understand what false positives are and why are they a big deal.

How does a text-matching software detect false positives?

A false positive in a plagiarism detection system refers to text that has been marked as matching or similar to content when compared against the system’s database, but which are not strictly true as the match maybe out of context.

For example, have a look at the following examples in which the red text marks a matching text:

“Salt and pepper”<> “Cats and dogs” – 33%
“Three men in a boat” <> “Life in a Medieval City” – 40%
“The Adventures of Tom Sawyer” <> “The Adventures of Sherlock Holmes” – 60%

The matching texts flagged by the plagiarism checker are basically common phrases and words, including “and” or “in a” and hence, should not be included in the analysis report – which determines the overall percentage of matching content found in a submitted assignment. Counting these findings means while the percentage of overall text similarity actually rises, its relevance drops. Including these common words as potential text matches are what is referred to as false positives.

A lot of times, false positives are words that are extremely common in the specific language rather than complicated conjunctions and appositions. The truth is that once we leave the 100% similarity mark, the lines become blurred -because how do you calculate the relevance of the different words that make up a text and translate it into a percentage?

Similarity results can be overwhelming

Showing all matching texts and findings can cause clutter and confusion, not to mention the risk of taking the attention away from actual cases of plagiarism. It’s basically like Googling the phrase “I don’t know” where you’ll simply be swamped with results. Try it. (It will give you around 7 billion search results!)

Or think about the phrase “This page is intentionally left blank”. Would this match even make sense to see in a plagiarism report? False positives also cause you to spend more time than necessary since you will have to go through each finding to ascertain if it’s actually a match or not. What is probably worse is that it undermines the utilization of and trust in plagiarism detection softwares altogether. If you end up spending endless time sifting through false positives, the frustration will most probably make you give up on your efforts to assess the level of real plagiarism in the text you are reviewing.

How can we address the challenge of false positives?

False positives are a big threat to originality, and we need to address them properly. One way to minimize false positives is by using a plagiarism detection software like Ouriginal, which uses machine learning algorithms that improve over time. Our software is designed in such a way that it is constantly learning to recognize what a relevant text match is and what isn’t. Ouriginal’s technology helps you to make more informed decisions by limiting cluttered and irrelevant data, thereby enhancing the accuracy of the findings.

Read more blogs:

The Future of Writing Style Analyses: Ouriginal Metrics as a Learning Analytics Tool

Computational Linguistics to uncover ghostwriters through stylometry and Ouriginal Metrics — Computational Linguistics To Uncover Ghostwriters And Unfair Contract Cheating – And Its Limitations

Namrata Nanda

All Posts

False positives: Why are they important and why must software algorithms learn and adapt

How does a text-matching software detect false positives?

Similarity results can be overwhelming

How can we address the challenge of false positives?

Namrata Nanda

Support

Company

Ouriginal