A while ago I found out some of my older posts were cut-off in the middle of a sentence or even in the middle of a word. Turns out that somewhere along the way of performing WordPress upgrades the MySQL tables’ collation had changed from a latin1 variant to utf8. Turns out that this conversion doesn’t work without it’s flaws and ends up cutting off your posts where “special characters” are used in the text. So if you use a word like “trés” your post suddenly ends with “tr”…
I wasn’t very happy with that as you can imagine. What’s even worse is that this kind of data corruption is hard to detect as well. I ran into it by accident while dishing up an old post to link to. How many more of my posts were also mangled like this? So it was time to dish up some SQL magic to help and figure out how big the damage was.
I ended up writing this simple statement which detects any posts not having their last character in a list of characters which typically end a post. A dot for example as a last character indicates a full sentence and without a space after it chances are big it was the last sentence as well. The query also displays the last 50 characters of the post which allows you to judge if a post really has been cut off or if it’s a false positive.
SELECT id, right(post_content, 50), post_content
WHERE right(post_content, 1) NOT IN (".", "!", ")", "?", ">", "]") AND post_status = "publish"
All I needed after this was a backup of my database that still contained an intact copy of the affected posts. Luckily I have been using the WP backup tool to email me a database backup every day since 2007 so that wasn’t a problem. To restore the content of the posts I transformed the SQL insert statement from the backup to an update statement looking like the following to update the post back into it’s original state:
SET post_content = '<Old post content goes here>'
WHERE id = <id>
The special characters in the post were in most cases transformed into blocky question mark symbols when reloading the post in the WP editor. So after some manual labour to reset those into the proper characters the post was ready to be republished and finally restored. Yay!
So this proves once again that there’s no such thing as too many backups and that collation issues are a bitch for all developers out there. Some tables are still in latin1 so I think I’ll have to convert those at some point to utf8 to be safe. But that’s something for later.
Photo by daveknapik, cc-licensed.