How to translate WordPress blog posts to Leanpub Markdown

Unix tools to the rescue

I am a big fan of Unix utilities as they help my out on a daily basis. When the Leanbub WordPress export tool failed me, I knew I had to write my own import script. Without an automatic script I’d had to spend more than half an hour to fix broken source codes or tables and migrate image references to the book repository folder structure.

But the The High-Performance Java Persistence book is using info that’s scattered across over 60 blog posts, which would have taken me months to manually import.

With this little script, I managed to cut the import time to a couple of second per article.



#Generate the file name from URL
file=`echo $url | sed -r 's/.*\..*?\/(.*?)[\/$]/\' | sed 's/\//-/'g`

#Generate a temporary file

#Create missing files
touch $tmpfile
touch $file

#Download the WordPress post to the temporary file
wget $url --no-cache --cache=off -O $tmpfile

#Extract the relevant post content  
perl -0777 -n -e 'print "<h1>$1<\/h1>\n$2" while (m/<h1\s*class="entry-title">(.*?)<\/h1>.*?<div\s*class="entry-content">(.*?)<\/div>/sgmp)' $tmpfile > $file

#Remove the temporary file
rm $tmpfile

echo 'Importing resource from ' $url 'to' $file

#Convert the HTML document to MarkDown
pandoc -s -r html $file -t markdown-backtick_code_blocks-simple_tables+pipe_tables --atx-headers -o $file

#Adjust code blocks according to Leanpub style
perl -0777 -i -pe 's/(\~+)\s+.*?\.(\w+);.*?\}/{lang="$2",line-numbers=off}\n$1/ig' $file

#Remove unnecessary footer notes
perl -0777 -i -pe 's/Code\s*available\s*(on|for).*$//igs' $file
perl -0777 -i -pe 's/\*\*If\syou\shave\senjoyed.*$//igs' $file

#Migrate image locations from WP to relative image folder
sed -i -r 's_\[\!\[(.*?)\]\(.*?\)\]\(http.*\/([a-zA-Z0-9\-]+\.(gif|png|jpg))\)_![\1]\(images\/\1\.\3\)_g' $file

#First line header is set to ##
sed -i '1s/^#/##/g' $file

#Next lines headers upper limit is ###
sed -i '2,$s/^#/###/g' $file

#Remove backup file generated by perl
rm $file.bak

So this little script takes a blog post URL and the does the following steps:

  1. It first generates a file name from the blog post URL
  2. It creates a temporary file
  3. It downloads the blog HTML content to the temporary file
  4. Using Perl, it extract the article content from the enveloping HTML markup
  5. Using Pandoc, it transform the extracted HTML content to Markdown
  6. With Perl, it then formats all code blocks to Leanpub supported format
  7. It also remove unnecessary blocks (e.g. follow me notes or GitHub references)
  8. All images are changed to reference a relative book repository folder
  9. Headers are migrated to Leanpub chapters format

All in all, I managed to get all articles in a timely manner so I can concentrate on writing instead.

Enter your email address to follow this blog and receive notifications of new posts by email.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s