I need to take a chunk of raw HTML code from a third-party which may contain any number of tags/attributes and potentially dirty or harmful code, and then strip it right back and transform it into clean, safe Markdown code.
A 'Markdownifier' if you will, much like heckyesmarkdown.com does, but from within my server-side .Net (C#) application, not on the client-side. I am happy to use a third-party library (free or paid) to do this, but not a third-party hosted REST API or similar for performance, security and reliability reasons.
There are many libraries available for .Net which allow you to convert Markdown to HTML, however I need to do the reverse and can't seem to find a tool for .Net which has already solved this problem (unless I'm being a bit dim and looking in the wrong place!).
I have found this library on GitHub:
https://github.com/baynezy/Html2Markdown
Looks promising for your problem! I have not tried it myself yet though.
There is a Nuget package also:
Install-Package Html2Markdown
Usage is as follows (html variable is a string):
var markdown = new Converter().Convert(html);
You can try Pandoc (http://pandoc.org/). For Windows it is a command line tool but it works pretty good. This is how I have interfaced it before...
private const string processName = @"c:\program files (x86)\pandoc\pandoc.exe";
private const string args = @"-t markdown -r html5 -o ""{0}"" ""{1}""";
public void Convert(Stream inputStream, Stream outputStream)
{
var process = new Process();
var inputFilename = Path.GetTempFileName();
var outputFilename = Path.GetTempFileName();
using (var fileStream = File.Create(inputFilename))
{
inputStream.CopyTo(fileStream);
}
ProcessStartInfo psi = new ProcessStartInfo(processName, string.Format(args, outputFilename, inputFilename))
{
RedirectStandardOutput = true,
RedirectStandardInput = true,
UseShellExecute = false
};
process.StartInfo = psi;
process.Start();
process.WaitForExit();
var bytes = File.ReadAllBytes(outputFilename);
outputStream.Write(bytes, 0, bytes.Length);
}
EDIT
It should probably be noted that I have not used it for converting markdown before, but I have used it for converting other formats to and from HTML and it does a fairly reasonable job of it and it doesn't just blowup if it can't do something like others do. The arguments I have used have been sourced from http://pandoc.org/README.html in particular this:
pandoc -f html -t markdown http://www.fsf.org
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With