Stefan Koopmanschap has written about a hidden gem he discovered in PHP to help locate blocks of text that seem similar from one or more sources - similar_text.
I am working on a hobby project where I aggregate feeds from several different sources. With the blogs I work it right now, it often happens that an author posts the same post to a few different sites. However, because of site formats and sometimes also quick edits an author makes on one site but not on the author, the article contents are usually not identical strings. So I needed something that would help me figure out whether or not two strings are nearly identical.
After Googling around and finding things like the xdiff extension and soundex, he discovered the two functions he needed - levenshtein and similar_text.
I am still trying to figure out which percentage will catch the duplicates but not catch too many posts which are only similar but not actually duplicates, but with the above 75% I seem to catch quite a few duplicates so far.