Blog Import - Lessons Learned

So I have been blogging (very irregulary) since 2003. If I remember correctly, I dabbled with Movable Type's platform before eventually settling on Wordpress. When I recently transitioned to using Pelican as a static blogging engine, I didn't bother to import posts from my wordpress installation (for one thing, the Pelican import tool only operates on a wordpress XML export and I just had a database backup).

Over the holidays, I decided to make an attempt at converting all that content to markdown. It's been a challenge, but I was able to pull over most of it.

First, I needed a quick script to pull posts from wp_posts in a wordpress database. I came up with the following:

import MySQLdb as sql
import MySQLdb.converters as converters
import os
import traceback

def connect():
    cnx = sql.connect(host='127.0.0.1', user='root', passwd='root', db='blog')
    return cnx

def write_post(row):
    try:
        year = row['post_date'].year

        if not os.path.exists('content/%s' % year):
            os.makedirs('content/%s' % year)

        with open('content/%s/%s_%s.md' % (year, row['ID'], row['post_name'].replace('-', '_')), 'wb') as fp:
            fp.write('Title: %s
' % row['post_title'])
            fp.write('Date: %s
' % row['post_date'])
            fp.write('Tags: imported
')
            fp.write('Category:
')
            fp.write('Slug: %s
' % row['post_name'].replace('-', '_'))
            fp.write('
')
            fp.write(row['post_content'])
            fp.write('
')
    except Exception, e:
        print(e)
        traceback.print_exec()
        exit()

if __name__ == '__main__':
    cnx = connect()

    with cnx:
        cursor = cnx.cursor(sql.cursors.SSDictCursor)

        cursor.execute("""
            select ID, post_date, post_title, post_name, post_content from wp_posts where post_status = 'publish';
        """)

        rows = cursor.fetchmany(size=10)

        while len(rows) > 0:
            for row in rows:
                print(row['post_name'])
                write_post(row)

            rows = cursor.fetchmany(size=10)

There was more post processing that I should have done in this script (removing some legacy html tags, join with the tags table to populate that field). I came up with a few regex patterns and used Atom to do a search and replace within the content path and I ended up just going through all 300+ posts and re-tagged and re-catorgorized them.

Below are the biggest takeaways from this process: