Finding Multi-byte Characters in MySQL Fields

This is an old post!

This post is over 2 years old. Solutions referenced in this article may no longer be valid. Please consider this when utilizing any information referenced here.

So I was recently helping a client with an issue in MySQL where a migration failed to transfer the full contents of some fields. This amounted to a little over 1% of the total messages transferred. In doing some research, we discovered that the one thing every message had in common was the presence of multi-byte (high unicode) characters. In many cases, this was due to a user pasting some text from Microsoft Word.

Remember, there is a correct way to dump and restore data from MySQL, and it doesn’t involve piping output. That was the root cause of this problem.

So, how to clean up the mess? Well, fortunately, we still had the old database up. So we needed to figure out what messages might be suspicious and need repair. That means finding fields that contain unicode characters.

There are several other articles on the Internet, but they are mostly wrong. They all tell you to do things like this:

SELECT * FROM TABLE WHERE NOT HEX(COLUMN) REGEXP '^([0-7][0-9A-F])*$';
SELECT id FROM posts WHERE LENGTH(body) != CHAR_LENGTH(body);

THIS IS WRONG. DO NOT DO THIS. User eggyal in this Stack Overflow thread :explains why

Suppose, for example, that one has a UTF-16 column containing ‘ā’ (encoded by the byte sequence 0x0101) - it would be deemed “ASCII” using this test: a false negative; indeed, some character sets do not encode ASCII characters within 0x00 to 0x7f whereupon this solution would yield a false positive.

Many character sets are fixed-length and so LENGTH(column) will be a constant multiple of CHAR_LENGTH(column) irrespective of the value.

This is the correct way:

SELECT id FROM posts WHERE body <> CONVERT(body USING ASCII)

Ollie Jones explains why:

The CONVERT(col USING charset) function will turns the unconvertable characters into replacement characters. Then, the converted and unconverted text will be unequal.

Armed with this, I was able to write a program to query all the suspicious messages and compare them with the data in the old database. Where they were not the same, we were able to restore the old data.

Comments (0)

Interested in why you can't leave comments on my blog? Read the article about why comments are uniquely terrible and need to die. If you are still interested in commenting on this article, feel free to reach out to me directly and/or share it on social media.

Contact Me
Share It
MySQL
Discovered something neat with the new version of MySQL and thought it warranted a mention. Storing tree structures in a relational database is a common use case across many different areas of tech. The problem comes when you need to construct a query based on a subset of that tree. But MySQL 8 has some nice new features that makes doing this a breeze.
Read More
Linux
Here’s a little quickie for you. Say you have a small MySQL server floating around your house that you want to have regular backups of. You do want regular backups right? In my case, the biggest motivation was wanting a regular way to grab a recent MySQL dump of an internal tool I use at home to develop against. After poking around the Internet a bit, I was surprised that, other than mysqldump itself, there doesn’t seem to be a simple tool out there that you can slam into a cronjob and let it do it’s thing. So, like any good hacker, I decided to brew my own. After all, when you have 256,428 different solutions, why not make solution 256,429? :)
Read More
MySQL
So I came across an interesting quirk in MySQL the other day. Let’s say you have a table schema and some values that look like this: +-------------------+------------------+------+-----+---------+-------+ | Field             | Type             | Null | Key | Default | Extra | +-------------------+------------------+------+-----+---------+-------+ | page_id       | varchar(30)      | YES  |     | NULL    |       | | clicks            | int(10) unsigned | YES  |     | NULL    |       | +-------------------+------------------+------+-----+---------+-------+ +---------+--------+ | page_id | clicks | +---------+--------+ | 1 | NULL | +---------+--------+ And then let’s say you pass the following SQL statement to MySQL: update page_click_count set clicks = clicks + 1 where page_id=1; If you come from a loosely-typed language such as PHP, you would probably expect clicks for page_id 1 to now be 1. But that’s not the case in MySQL. After the query is run, the table will still look like this: +---------+--------+ | page_id | clicks | +---------+--------+ | 1 | NULL | +---------+--------+ Not only does the query fail, but it fails with no warnings given. It appears that mathematical operations on null values silently fail. There are a couple of ways around this. The first and most obvious is to set NOT NULL and a default value on the column. In the example above, this would work. The NULL value in that field becomes a 0 and you can to normal mathematical operations on it. But what happens if, for whatever reason, you can’t do that? We actually have this situation in a few places at dealnews, where NULL represents a distinct value of that field that is different from 0. In this case, you can use COALESCE() to fill in the appropriate value for the field. update page_click_count set clicks = coalesce(clicks, 0) + 1 where page_id=1; Edit: Brian Moon informs me that this is actually part of the SQL specification. So hooray for specifications. Still, it’s kind of arcane; in working with MySQL (and PHP) for a decade now, this is the first time I’ve ever actually encountered this. Hopefully this helps someone who was as confused as I was.
Read More