Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert html page to markdown using R

Tags:

markdown

r

I would like to convert a html page to markdown. In python we could use the markdownify package like described here. Here is some reproducible code in python like described in the source:

import markdownify 
  
html = """   
         <h1> <strong>Geeks</strong> 
         for<br> 
         Geeks 
         </h1> 
        """
  
h = markdownify.markdownify(html, heading_style="ATX") 

print(h)

Output:

#  **Geeks** 
 for 
 Geeks

I would like to achieve the same but using R. I was only able to find this question but this converts html to rmarkdown file but not markdown like in the output above. So I was wondering if anyone knows if it is possible to convert html to plain markdown using R?

like image 620
Quinten Avatar asked Apr 14 '26 19:04

Quinten


2 Answers

It seems that we can use the pandoc_convert function from the rmarkdown package to convert a html file (input.html) to markdown like this:

library(rmarkdown)
pandoc_convert("input.html", "markdown")
# **Geeks** for Geeks

Created on 2024-03-11 with reprex v2.0.2

like image 73
Quinten Avatar answered Apr 16 '26 11:04

Quinten


Microsoft open sourced an LLM model which can transfer all kind of frequent formats like pdf, excel, etc. - including HTML formats - into Markdown.

It is a Python package.

pip install markitdown

Using reticulate, you could use it from within R:

library(reticulate)  
# install.packages("reticulate") if not installed

use_python("path/to/python")  
# Replace with the path to your Python binary
# e.g. if you installed markitdown into a conda 
# environment, then give the path to your env and
# the environment's Python there
# (use `$ which python` in Linux/MacOS e.g.)
# in windows, if you install first scoop
# you can install unix commands like which
# using scoop
# you can install scoop following the link below

# import markitdown Python package into R:
md <- import("markitdown")

# function for converting html to md:
convert_html_to_md <- function(html_file) {
  markitdown <- md$MarkItDown() # instance of markitdown
  result <- markitdown$convert(html_file) # conversion
  return(result$text_content) # return the markdown
}

# convert the html to markdown:
html_file <- "path/to/your/file.html"
markdown_content <- convert_html_to_md(html_file)

# inspect for yourself
cat(markdown_content)

# write down
output_file <- "output.md"
writeLines(markdown_content, output_file)

In this way, you could convert your HTML into Markdown using a LLM.

Install Scoop following this.

scoop install gow

brings you then typical Linux commands like which into your PowerShell.

like image 36
Gwang-Jin Kim Avatar answered Apr 16 '26 10:04

Gwang-Jin Kim



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!