I would like to convert a html page to markdown. In python we could use the markdownify package like described here. Here is some reproducible code in python like described in the source:
import markdownify
html = """
<h1> <strong>Geeks</strong>
for<br>
Geeks
</h1>
"""
h = markdownify.markdownify(html, heading_style="ATX")
print(h)
Output:
# **Geeks**
for
Geeks
I would like to achieve the same but using R. I was only able to find this question but this converts html to rmarkdown file but not markdown like in the output above. So I was wondering if anyone knows if it is possible to convert html to plain markdown using R?
It seems that we can use the pandoc_convert function from the rmarkdown package to convert a html file (input.html) to markdown like this:
library(rmarkdown)
pandoc_convert("input.html", "markdown")
# **Geeks** for Geeks
Created on 2024-03-11 with reprex v2.0.2
Microsoft open sourced an LLM model which can transfer all kind of frequent formats like pdf, excel, etc. - including HTML formats - into Markdown.
It is a Python package.
pip install markitdown
Using reticulate, you could use it from within R:
library(reticulate)
# install.packages("reticulate") if not installed
use_python("path/to/python")
# Replace with the path to your Python binary
# e.g. if you installed markitdown into a conda
# environment, then give the path to your env and
# the environment's Python there
# (use `$ which python` in Linux/MacOS e.g.)
# in windows, if you install first scoop
# you can install unix commands like which
# using scoop
# you can install scoop following the link below
# import markitdown Python package into R:
md <- import("markitdown")
# function for converting html to md:
convert_html_to_md <- function(html_file) {
markitdown <- md$MarkItDown() # instance of markitdown
result <- markitdown$convert(html_file) # conversion
return(result$text_content) # return the markdown
}
# convert the html to markdown:
html_file <- "path/to/your/file.html"
markdown_content <- convert_html_to_md(html_file)
# inspect for yourself
cat(markdown_content)
# write down
output_file <- "output.md"
writeLines(markdown_content, output_file)
In this way, you could convert your HTML into Markdown using a LLM.
Install Scoop following this.
scoop install gow
brings you then typical Linux commands like which into your PowerShell.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With