I have some HTML/CSS/JavaScript with painfully long class, id, variable and function names and other, combined strings that get used over and over. I could probably rename or restructure a few of them and cut the text in half.
So I'm looking for a simple algorithm that reports on the longest repeated strings in text. Ideally, it would reverse sort by length times instances, so as to highlight to strings that, if renamed globally, would yield the most savings.
This feels like something I could do painfully in 100 lines of code, for which there's some elegant, 10-line recursive regex. It also sounds like a homework problem, but I assure you it's not.
I work in PHP, but would enjoy seeing something in any language.
NOTE: I'm not looking for HTML/CSS/JavaScript minification per se. I like meaningful text, so I want to do it by hand, and weigh legibility against bloat.
This will find all repeated strings:
(?=((.+)(?:.*?\2)+))
Use that with preg_match_all and select the longest one.
function len_cmp($match1,$match2) {
return $match2[0] - $match1[0];
}
preg_match_all('/(?=((.+)(?:.*?\2)+))/s', $text, $matches, PREG_SET_ORDER);
foreach ($matches as $match) {
$match[0] = substr_count($match[1], $match[2]) * strlen($match[2]);
}
usort($matches, "len_cmp");
foreach ($matches as $match) {
echo "($matches[2]) $matches[1]\n";
}
This method could be quite slow though, as there could be a LOT of strings repeating. You could reduce it somewhat by specifying a minimum length, and a minimum number of repetitions in the pattern.
(?=((.{3,})(?:.*?\2){2,}))
This will limit the number of characters repeating to at least three, and the number of repetitions to three (first + 2).
Edit: Changed to allow characters between the repetitions.
Edit: Changed sorting order to reflect best match.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With