I'm looking for an easy way to test if a string contains markdown. Currently I'm thinking to convert the string to HTML and then test if there has html with a simple regex but I wonder if there is a more succinct way to do it.
Here's what I've got so far
/<[a-z][\s\S]*>/i.test( markdownToHtml(string) )
I think you have to accept that it's impossible to know with certainty. Markdown borrows its syntax from existing customs—for example underscores for italics was popular on Usenet (though single asterisks meant bold, not italics as well). And of course, people have been using dashes as obvious substitutes for plaintext bullet points, long before Markdown.
Having decided it's subjective though, we may now embark on the task of determining degrees of likelihood that a piece of text contains Markdown. Here are some things I'd consider evidence for Markdown, in order of decreasing strength:
Consecutive lines beginning with 1.
, e.g. (^|[\n\r])\s*1\.\s.*\s+1\.\s
. (See the Markdown behind this answer, for example.) I'd consider this a dead giveaway, because there's even that joke:
There are only two kinds of people in this world.
1. Those who understand Markdown.
1. And those who don't.
Link markdown, e.g. \[[^]]+\]\(https?:\/\/\S+\)
.
Double underscores or asterisks when a left-right pair (indicated by whether the whitespace is to the left or right, respectively) can be found, e.g. \s(__|\*\*)(?!\s)(.(?!\1))+(?!\s(?=\1))
. Let me know if you want me to explain this one.
And so on. Ultimately, you'll have to come up with your own "scoring" system to determine the weight of each of these things. A good way to go about this would be to gather some sample inputs (if you have real ones, then even better), classify them manually as having Markdown or not, and running your regexes and scoring system to see what weights sort them out most accurately.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With