Migrating a blog to Quarto: Reverse engineering HTML to markdown

I migrated my personal webpage to Quarto in july 2022. The only thing I did not do was my blog; for two reasons: 1) Quarto was still very new and many of todays features were not available. I already needed to hack some things together for my personal page (which are still in place although Quarto now has the features…). For my blog, I wanted to wait until Quarto is more mature. 2) I feared that it will not be straightforward to migrate all my blog entries from blogdown to Quarto.

Time went by and I kept thinking about the migration but 2) kept me away from it. Over time though, I realized (as apparently many others) how broken my hugo/blogdown/theme setup has become. Every update introduced new issues. Lately I barely managed to put a post together without some weird hotfixes in the background. My blog reached the state of FUBAR. So it was time to migrate. Or simply start over.

Migrate or start over?

Pondering about the migration, I reached a point where I considered starting over with my blog. Why go through the hassle of migrating all posts? Obviously I could just leave the blog as is to not break anyones bookmarks (if they even exist) and simply start a new quarto blog. This lazy solution seemed compelling for many other reasons. When I started this iteration of my blog in 2017, I didnt know (or care) about “reproducibilty”. So, many of my early posts cannot be rerun because the data is lost, hidden on some old harddrive, and have paths that do not resolve anymore. So, without trying, I felt it is highly unlikely that I can rerender all posts in Quarto without considerable effort.

But there was one thing I was willing to try, purely out of technical interest:

Is it possible to convert the html posts back to a raw markdown file?

That way, I would not need to to rerun all analysis and only need to render to html with my new Quarto theme (and probably some yaml patching).

Preparatory steps

I created a csv file with all existing posts. This was done semiautomatically by scraping my own blog. the only manual work was the categories. I was very random on my old blog and I recategorized everything to be more consistent.

Next, I downloaded the html files. This was surprisingly challenging. I tried some automatic approaches (download.file(), the rvest package) but the best (for my approach) was to save the page via CTRL+S in Firefox. This created the html file and a folder containing all asset files. In later steps, this turned out to be very beneficial.¹

Html to markdown

To convert the html files to markdown, I used pandoc, a powerful converter for markup languages.²

A little bit of searching gave me this command to convert html to markdown.

pandoc post.html -t gfm-raw_html -o index.md

-t gfm-raw_html supposedly removes all html tags from the file and really only returns raw markdown. It didn’t do so for me. I wrote a quick lua filter to help with the remaining tags.

remove-tags.lua

function Header (elem)
  elem.identifier = ""
  return elem
end

function Div (elem)
  return elem.content
end

So the final pandoc command is

pandoc post.html --lua-filter remove-tags.lua  -t gfm-raw_html -o index.md

For me, this produced a clean markdown file of the post that I can now essentially be used to rerender the posts with quarto 🥳!

Clean up

While I did get a raw markdown file out of the html file, there was still some cleaning up to do. For instance, there where some lines at the beginning and the end that needed to be eliminated. Fortunately, it was the same pattern for all posts (first 16 lines and all lines after the line starting with “Tagged”), so sed does the trick.

sed -i '/^Tagged/,$d' index.md
sed -i '1,16d' index.md

Next up, a yaml header needed to be added. With the csv file, this was also quite straightforward.³ I also injected a line into the post warning about the automatic convertion and a link to the archived original post.

for (i in seq_len(nrow(posts))) {
    file_md <- fs::path("posts", posts$new_folder[i], "index.md")
    post <- readLines(file_md)
    addendum <- c(
        "",
        paste0("*This post was semi automatically converted from blogdown to
        Quarto and may contain errors. The original can be found in the 
        [archive](", 
        str_replace(posts$old_link[i], "blog.", "archive."), ").*"),
        ""
    )
    post <- c(post[1], addendum, post[-1])
    header <- tibble(
        author = list(name = "David Schoch", orcid = "0000-0003-2952-4812")
    ) |>
        yaml::as.yaml() |>
        paste0("title: \"", posts$post_title[i], "\"\n", ... = _) |>
        paste0(... = _, "date: ", posts$pub_date[i], "\n") |>
        paste0(... = _, "categories: [", posts$category[i], "]\n") |>
        paste0("---\n", ... = _, "---\n") |>
        str_replace("name:", "- name:") |>
        str_replace("orcid:", "  orcid:") |>
        str_split("\n")
    writeLines(c(header[[1]], post), file_md)
}

The last step was to fix the images. This is where the post_files folder from Firefox became very handy. In each markdown files, images where included as ![](post_files/image.png). So all that needed to be done was copy the image.png out of the post_files folder and adjust the path to ![](image.png).

And those were all the required steps to convert my blogdown blog to a quarto blog by reverse engineering html to markdown. I am pretty sure that there are still lots of tiny errors⁴, but the main work is done.

Addendum

A full migration script is available on GitHub. This is not going to work out of the box for other blogs, because it does depend on the theme used of the blog. I just shared it in case it helps as a reference point for others who are insne enough to migrate their blog like this.

Footnotes

There is for sure a better (and automatic) way to do this. But I gave this project one afternoon/evening, so I was not willing to dig deeper.↩︎
Actually, quarto is based around it.↩︎
Although the code is probably unnecessarily complicated.↩︎
I swear i will fix them all! Until then http://archive.schochastics.net has the original posts.↩︎

Reuse

CC BY 4.0

Citation

BibTeX citation:

@online{schoch2024,
  author = {Schoch, David},
  title = {Migrating a Blog to {Quarto:} {Reverse} Engineering {HTML} to
    Markdown},
  date = {2024-01-02},
  url = {http://blog.schochastics.net/posts/2024-01-02_migrating-to-quarto/},
  langid = {en}
}

For attribution, please cite this work as:

Schoch, David. 2024. “Migrating a Blog to Quarto: Reverse Engineering HTML to Markdown.” January 2. http://blog.schochastics.net/posts/2024-01-02_migrating-to-quarto/.