PHP Detect Duplicate Text

Tags:

I have a site where users can put in a description about themselves.

Most users write something appropriate but some just copy/paste the same text a number of times (to create the appearance of a fair amount of text).

eg: "Love a and peace love a and peace love a and peace love a and peace love a and peace love a and peace"

Is there a good method to detect repetitive text with PHP?

The only concept I currently have would be to break the text into separate words (delimited by space) and then look to see if the word is repeated more then a set limited. Note: I'm not 100% sure how I would code this solution.

Thoughts on the best way to detect duplicate text? Or how to code the above idea?

580

asked Jul 27 '15 00:07

Adam

1 Answers

This is a basic text classification problem. There are lots of articles out there on how to determine if some text is spam/not spam which I'd recommend digging into if you really want to get into the details. A lot of it is probably overkill for what you need to do here.

Granted one approach would be to evaluate why you're requiring people to enter longer bios, but I'll assume you've already decided that forcing people to enter more text is the way to go.

Here's an outline of what I would do:

Build a histogram of word occurrences for the input string
Study the histograms of some valid and invalid text
Come up with a formula for classifying a histogram as valid or not

This approach would require you to figure out what's different between the two sets. Intuitively, I'd expect spam to show fewer unique words and if you plot the histogram values, a higher area under the curve concentrated toward the top words.

Here's some sample code to get you going:

$str = 'Love a and peace love a and peace love a and peace love a and peace love a and peace love a and peace';

// Build a histogram mapping words to occurrence counts
$hist = array();

// Split on any number of consecutive whitespace characters
foreach (preg_split('/\s+/', $str) as $word)
{
  // Force all words lowercase to ignore capitalization differences
  $word = strtolower($word);

  // Count occurrences of the word
  if (isset($hist[$word]))
  {
    $hist[$word]++;
  }
  else
  {
    $hist[$word] = 1;
  }
}

// Once you're done, extract only the counts
$vals = array_values($hist);
rsort($vals); // Sort max to min

// Now that you have the counts, analyze and decide valid/invalid
var_dump($vals);

When you run this code on some repetitive strings, you'll see the difference. Here's a plot of the $vals array from the example string you gave:

repetitive

Compare that with the first two paragraphs of Martin Luther King Jr.'s bio from Wikipedia:

mlk

A long tail indicates lots of unique words. There's still some repetition, but the general shape shows some variation.

FYI, PHP has a stats package you can install if you're going to be doing lots of math like standard deviation, distribution modeling, etc.

147

answered Oct 14 '22 11:10

Zach Rattner

Related questions
                            
                                Use RecyclerView inside ScrollView with flexible Recycler item height
                            
                                Google map zoom parameter in url not working
                            
                                Is there any difference between aes-128-cbc and aes-128 encryption?
                            
                                this.setState is not a function [duplicate]
                            
                                How to change textcolor in AlertDialog
                            
                                How to enable Pan and Zoom in a QGraphicsView
                            
                                Installing PHP using Homebrew on MAC
                            
                                ts2304 cannot find name 'OnInit'
                            
                                Embed UIViewController inside a UIView
                            
                                Include git commit hash as string into Rust program
                            
                                FlatList Dynamic Height Sizing
                            
                                Loading images in google colab

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With