The knitr
and rmarkdown
packages are used in conjunction with pandoc to convert R code and figures to a variety of formats including PDF, and word. Here, I’m exploring how to convert HTML back to markdown format. This post came about when I was searching how to convert XML to markdown, which I still haven’t found an easy way to do. Pandoc is not the only way to convert HTML to markdown (see turndown, html2text)
Pandoc is packaged within RStudio and on Windows, the executables are located within Program Files/RStudio/bin/pandoc
. The rmarkdown
package contains wrapper functions for using pandoc within RStudio.
Here, I am trying to convert this example HTML page back to markdown using the function pandoc_convert
. First, pandoc_convert
requires an actual file which means it does not accept a quoted string of HTML code in its input
argument.
The example html:
1 |
|
I saved the HTML example here as example.html
.
1 |
|
We can print the object in R.
1 |
|
1 |
|
pandoc
can convert between many different formats, and for markdown, it has multiple variants including the github flavored variant (for Github), and php markdown extra (the variant used by WordPress sites).
The safest variant to pick is markdown_strict
which is the original markdown
variant.
Pandoc requires the file path which in my case, is located in a different directory rather than my working directory.
1 |
|
1 |
|
Notice that heading 1 is formatted with ====
rather than the #
that RMarkdown seems to favor. We can require pandoc to use the #
during the conversion by adding an argument.
1 |
|
1 |
|
Right now, the output is being piped to the console. A file can be created instead with:
1 |
|
Pandoc has a multitude of styling extensions for markdown variants, all listed on the manual page.
Pandoc ignores everything enclosed in ``. When converting from markdown to HTML, these comments are usually directly placed as is in the HTML document but the opposite does not seem to be true.
Lastly, this was tested using pandoc version 1.19.2.1. Pandoc 2.5 was released last month.
Related