Originally posted by StapleButter
It's also worth noting that the board's pages don't specify an encoding.
Here, Firefox defaults to Windows-1252
Not all the time, which is part of the problem. By the way, I can quote and preview Joe’s original post. As usual, different people have different behavior in their browsers, and that leads to a mess.
The way to go here would be:
- Dump every field of the database to a separate file.
- Use something like uchardet to figure out exactly how each field is encoded.
- Use iconv to convert to UTF-8.
- (Maybe not necessary: strip out any UTF-8 byte‐order marks, the sequence of three bytes 0xEF,0xBB,0xBF. This shouldn’t happen but Windows likes to add them to UTF-8 files; PHP and MySQL do not play well with them.)
- Reconstruct the database from scratch using MySQL’s utf8mb4 encoding (not utf8, because in true MySQL form that doesn’t actually support the full range of Unicode).
- Have the board emit the following HTTP header:
Content-Type: text/html;charset=utf-8
- And the following meta tag:
<meta http-equiv="Content-Type" content="text/html;charset=utf-8">
- Keep a read‐only copy of the old board/database, just in case some posts got truncated.
After that, as long as nobody manually sets their browser otherwise, everything and everyone will being UTF-8 and there will be no more problems.
Originally posted by Joe
To be fair, I turned off Firefox's autodetect to make sure my posts get HTML-escaped when I post Japanese text, so it might happen less often to you than to me.
This helps for the moment but is not perfect—it only converts characters that aren’t representable in your native encoding. Since the board usually shows up for you as CP-1252, characters like × and — (which are valid in CP-1252) don’t get escaped when you post, and are inserted raw in the database.
Originally posted by Joe
Or does PHP detect the meta tag and become even more dumb than usual?
Thankfully not. That would be a nightmare…
____________________
|