How to translate WordPress blog posts to Leanpub Markdown

Imagine having a tool that can automatically detect JPA and Hibernate performance issues. Wouldn’t that be just awesome?

Well, Hypersistence Optimizer is that tool! And it works with Spring Boot, Spring Framework, Jakarta EE, Java EE, Quarkus, or Play Framework.

So, enjoy spending your time on the things you love rather than fixing performance issues in your production system on a Saturday night!

Unix tools to the rescue

I am a big fan of Unix utilities as they help me out on a daily basis. When the Leanbub WordPress export tool failed me, I knew I had to write my own import script. Without an automatic script, I’d had to spend more than half an hour to fix broken source codes or tables and migrate image references to the book repository folder structure.

But the The High-Performance Java Persistence book is using info that’s scattered across over hundreds of blog posts, which would have taken me months to manually import.

With this little script, I managed to cut the import time to a couple of second per article.

#!/bin/bash

url=$1

#Generate the file name from URL
file=`echo $url | sed -r 's/.*\..*?\/(.*?)[\/$]/\1.md/' | sed 's/\//-/'g`

#Generate a temporary file
tmpfile=`uuidgen`

#Create missing files
touch $tmpfile
touch $file

#Download the WordPress post to the temporary file
wget $url --no-cache --cache=off -O $tmpfile

#Extract the relevant post content  
perl -0777 -n -e 'print "<h1>$1<\/h1>\n$2" while (m/<h1\s*class="entry-title">(.*?)<\/h1>.*?<div\s*class="entry-content">(.*?)<\/div>/sgmp)' $tmpfile > $file

#Remove the temporary file
rm $tmpfile

echo 'Importing resource from ' $url 'to' $file

#Convert the HTML document to MarkDown
pandoc -s -r html $file -t markdown-backtick_code_blocks-simple_tables+pipe_tables --atx-headers -o $file

#Adjust code blocks according to Leanpub style
perl -0777 -i -pe 's/(\~+)\s+.*?\.(\w+);.*?\}/{lang="$2",line-numbers=off}\n$1/ig' $file

#Remove unnecessary footer notes
perl -0777 -i -pe 's/Code\s*available\s*(on|for).*$//igs' $file
perl -0777 -i -pe 's/\*\*If\syou\shave\senjoyed.*$//igs' $file

#Migrate image locations from WP to relative image folder
sed -i -r 's_\[\!\[(.*?)\]\(.*?\)\]\(http.*\/([a-zA-Z0-9\-]+\.(gif|png|jpg))\)_![\1]\(images\/\1\.\3\)_g' $file

#First line header is set to ##
sed -i '1s/^#/##/g' $file

#Next lines headers upper limit is ###
sed -i '2,$s/^#/###/g' $file

#Remove backup file generated by perl
rm $file.bak

So this little script takes a blog post URL and the does the following steps:

  1. It first generates a file name from the blog post URL
  2. It creates a temporary file
  3. It downloads the blog HTML content to the temporary file
  4. Using Perl, it extract the article content from the enveloping HTML markup
  5. Using Pandoc, it transform the extracted HTML content to Markdown
  6. With Perl, it then formats all code blocks to Leanpub supported format
  7. It also remove unnecessary blocks (e.g. follow me notes or GitHub references)
  8. All images are changed to reference a relative book repository folder
  9. Headers are migrated to Leanpub chapters format

All in all, I managed to get all articles in a timely manner so I can concentrate on writing instead.

Transactions and Concurrency Control eBook

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.