Curse of the Redesign: Handling Existing Content

As I spoke about in the previous post about redesigns my work generally focuses on clients who have existing sites that need to be moved to a new Content Management System and have a new design applied to them.

Whilst this can have it’s advantages for design work by allowing the front end developers far more freedom with markup it has some challenges for the backend.

Discovery

Every database I have received have been from a different CMS, some have been MySQL and some have been Microsoft SQL database but all have required some discovery.

We do a lot of work in Drupal so I always import the old database to MySQL to make working between the two databases as smooth as possible.

Once I have the old database imported I then set about piecing together the structure. Most database are actually fairly simple when you start to look into them and follow the same patterns.

A base table with master ID
Associated tables with extra data tied to the masterID
A table joining master IDs together e.g. blog posts and authors

Once you have found the patterns it is usually the same for all content types that you will be importing.

Tidy Up

Running an import of data gives us a great chance to tidy up both content and data.

Tidying up the content can be simple clauses such as only import items that have been created in the last 5 years.

Or it can be more complicated. We worked on a project where the tagging had got out of hand so we set a limit of a maximum of 5 tags in the new site. This meant a decision had to be made on how to handle the import of a subset of tags.

We made a list of priority tags, if these were linked to from the item then they were included. If they were not then we had a manual review process. The manual job actually ended up being a lot smaller than anticipated and could be worked through very quickly.

Old data from a CMS is often littered with poor HTML that needs to be cleaned and during a pragmatic import of data is a perfect time to do it. A tool such as HTML Purifier can help you clean up that old mess.

Store IDs

Probably my top tip for importing data is to keep a mapping between the old IDs and the new IDs. This allows repeated updating of the data and also a useful way to compare the two data sets.

It also allows the import of join tables as mentioned above. If you import authors and store their old IDs when you import the blog posts you have a way of joining the two together. The basic logic would be

Find old author ID for this blog post
Look up mapping between old author id and new author ID
Set the blog post to be connected to the new author ID

Choose Candidates

If the dataset is large enough to warrant migration scripts then it is also too large to check every single imported item. Work with the client in selecting some candidate items for data comparison. This will mean you have useful reference points for both you and the client to cross check the imported data.

Use Your CMS Tools

Most Content Management Systems offer an API for data creation and Drupal is no exception. You should always use these APIs for content creation rather than inserting the data straight into the database.

Run It and Run It Again

The great thing about computers is you can get them to do things over and over again very quickly in exactly the same way.

We always have to run scripts many times. This can be due to new fields being added or data needing to be formatted in a different way.

It is also highly likely that lots of new content will be added between the time you got a copy of the old site database and the new site going live. All these items can also be imported saving the client manually adding them.

Get Stuck In

The first time I was faced with an unknown database it was quite overwhelming but many years on and many different databases in I’m no longer phased by it. My advice is to logically work through the data and always think about repeatedly running the import scripts.

Drupal 6 num_rows

I’ve been working on a site powered by Drupal and needed to find the number of rows a database query returned. Simple enough and quite common procedure. But, the Drupal database API no longer offers the db_num_rows() function.

I had a quick Google around and the answered seemed to be to run the query twice but the second time using a count query. Two calls to the database when only one is necessary? Not on my watch. The piece of code is to appear on the home page so it will be called upon frequently.

My solution was too loop through the results set and add the formatted HTML to an array. Then check if there are any items in the array, if so loop through it an output them if not display the message that there are currently no results.

I know it’s still doubling up but I think it is a slightly more elegant solution than the double database call.