How to translate WordPress blog posts to Leanpub Markdown

(Last Updated On: January 3, 2018)

Unix tools to the rescue

I am a big fan of Unix utilities as they help me out on a daily basis. When the Leanbub WordPress export tool failed me, I knew I had to write my own import script. Without an automatic script, I’d had to spend more than half an hour to fix broken source codes or tables and migrate image references to the book repository folder structure.

But the The High-Performance Java Persistence book is using info that’s scattered across over hundreds of blog posts, which would have taken me months to manually import.

With this little script, I managed to cut the import time to a couple of second per article.

#!/bin/bash

url=$1

#Generate the file name from URL
file=`echo $url | sed -r 's/.*\..*?\/(.*?)[\/$]/\1.md/' | sed 's/\//-/'g`

#Generate a temporary file
tmpfile=`uuidgen`

#Create missing files
touch $tmpfile
touch $file

#Download the WordPress post to the temporary file
wget $url --no-cache --cache=off -O $tmpfile

#Extract the relevant post content  
perl -0777 -n -e 'print "<h1>$1<\/h1>\n$2" while (m/<h1\s*class="entry-title">(.*?)<\/h1>.*?<div\s*class="entry-content">(.*?)<\/div>/sgmp)' $tmpfile > $file

#Remove the temporary file
rm $tmpfile

echo 'Importing resource from ' $url 'to' $file

#Convert the HTML document to MarkDown
pandoc -s -r html $file -t markdown-backtick_code_blocks-simple_tables+pipe_tables --atx-headers -o $file

#Adjust code blocks according to Leanpub style
perl -0777 -i -pe 's/(\~+)\s+.*?\.(\w+);.*?\}/{lang="$2",line-numbers=off}\n$1/ig' $file

#Remove unnecessary footer notes
perl -0777 -i -pe 's/Code\s*available\s*(on|for).*$//igs' $file
perl -0777 -i -pe 's/\*\*If\syou\shave\senjoyed.*$//igs' $file

#Migrate image locations from WP to relative image folder
sed -i -r 's_\[\!\[(.*?)\]\(.*?\)\]\(http.*\/([a-zA-Z0-9\-]+\.(gif|png|jpg))\)_![\1]\(images\/\1\.\3\)_g' $file

#First line header is set to ##
sed -i '1s/^#/##/g' $file

#Next lines headers upper limit is ###
sed -i '2,$s/^#/###/g' $file

#Remove backup file generated by perl
rm $file.bak

So this little script takes a blog post URL and the does the following steps:

  1. It first generates a file name from the blog post URL
  2. It creates a temporary file
  3. It downloads the blog HTML content to the temporary file
  4. Using Perl, it extract the article content from the enveloping HTML markup
  5. Using Pandoc, it transform the extracted HTML content to Markdown
  6. With Perl, it then formats all code blocks to Leanpub supported format
  7. It also remove unnecessary blocks (e.g. follow me notes or GitHub references)
  8. All images are changed to reference a relative book repository folder
  9. Headers are migrated to Leanpub chapters format

All in all, I managed to get all articles in a timely manner so I can concentrate on writing instead.

Subscribe to our Newsletter

* indicates required
10 000 readers have found this blog worth following!

If you subscribe to my newsletter, you'll get:
  • A free sample of my Video Course about running Integration tests at warp-speed using Docker and tmpfs
  • 3 chapters from my book, High-Performance Java Persistence, 
  • a 10% discount coupon for my book. 
Get the most out of your persistence layer!

Advertisements

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.