Register - Login
Views: 86642367
Main - Memberlist - Active users - Calendar - Wiki - IRC Chat - Online users
Ranks - Rules/FAQ - JCS - Stats - Latest Posts - Color Chart - Smilies
10-24-17 03:43:05 AM

Jul - Meta - Quoting and previewing are broken (sometimes) New poll - New thread - New reply
Next newer thread | Next older thread
Joe
Common spammer
🗿
Level: 104


Posts: 3141/3287
EXP: 11518864
For next: 343262

Since: 08-02-07
From: Pororoca

Since last post: 20 days
Last activity: 15 hours

Posted on 07-14-14 02:37:11 PM Link | Quote
You can't quote this post.

—×

If you include one of those characters in your post and click on preview, the textarea will be empty.
Xkeeper






Posted on 07-15-14 12:00:08 AM Link | Quote
I have no idea why this happens
Xkeeper






Posted on 07-15-14 12:05:56 AM Link | Quote
Originally posted by Joe
You can't quote this post.

—×

If you include one of those characters in your post and click on preview, the textarea will be empty.

inu pointed it out, php apparently defaults to utf8 now and we're still rocking good ol' ISO-8859-1.


"if php runs into something unexpected, catch fire and throw away everything without even printing a warning". thx
StapleButter
Member
Level: 40


Posts: 93/461
EXP: 408400
For next: 32909

Since: 02-24-13
From: your dreams

Since last post: 3 days
Last activity: 1 day

Posted on 07-19-14 12:21:11 PM Link | Quote
It's also worth noting that the board's pages don't specify an encoding.

Here, Firefox defaults to Windows-1252, so it works, but it might break if the browser chooses to use something else as a default. Hell, I remember there's a related security issue with IE.
Joe
Common spammer
🗿
Level: 104


Posts: 3143/3287
EXP: 11518864
For next: 343262

Since: 08-02-07
From: Pororoca

Since last post: 20 days
Last activity: 15 hours

Posted on 07-19-14 12:45:16 PM Link | Quote
Throwing everything away is actually a feature. It's impossible to have any UTF-8 exploits if the parser immediately gives up when it hits something malformed.

It'd be nice if the board at least had a <meta> tag or something, at least until the eventual conversion to UTF-8.
StapleButter
Member
Level: 40


Posts: 95/461
EXP: 408400
For next: 32909

Since: 02-24-13
From: your dreams

Since last post: 3 days
Last activity: 1 day

Posted on 07-19-14 12:48:52 PM Link | Quote
The trickiest part in the conversion is that the database uses Latin1. Every text field in there would have to be reencoded to UTF8.

If you want to change the database to use UTF8, you have to change the encoding used by every table but also every text column, because MySQL is silly like that. This isn't necessary, though. Database encoding doesn't matter to PHP's MySQL backend which just returns the raw data. However, other applications may care about the encoding and not work properly.
Xkeeper






Posted on 07-24-14 01:16:31 AM Link | Quote
the problem with adding a meta tag is (like stated) everything previously posted ends up half-corrupt. there actually used to be a meta tag for a little while before we found that little feature out
Joe
Common spammer
🗿
Level: 104


Posts: 3147/3287
EXP: 11518864
For next: 343262

Since: 08-02-07
From: Pororoca

Since last post: 20 days
Last activity: 15 hours

Posted on 07-24-14 12:59:27 PM Link | Quote
Posts already show up corrupt all the time.

Or does PHP detect the meta tag and become even more dumb than usual?
Xkeeper






Posted on 07-24-14 02:39:44 PM Link | Quote
I forget the specifics.
IIMarckus
Member
Level: 23


Posts: 102/107
EXP: 63576
For next: 4147

Since: 10-11-08


Since last post: 172 days
Last activity: 2.0 years

Posted on 07-25-14 01:52:00 PM Link | Quote
Originally posted by StapleButter
It's also worth noting that the board's pages don't specify an encoding.

Here, Firefox defaults to Windows-1252

Not all the time, which is part of the problem. By the way, I can quote and preview Joe’s original post. As usual, different people have different behavior in their browsers, and that leads to a mess.

The way to go here would be:

  1. Dump every field of the database to a separate file.
  2. Use something like uchardet to figure out exactly how each field is encoded.
  3. Use iconv to convert to UTF-8.
  4. (Maybe not necessary: strip out any UTF-8 byte‐order marks, the sequence of three bytes 0xEF,0xBB,0xBF. This shouldn’t happen but Windows likes to add them to UTF-8 files; PHP and MySQL do not play well with them.)
  5. Reconstruct the database from scratch using MySQL’s utf8mb4 encoding (not utf8, because in true MySQL form that doesn’t actually support the full range of Unicode).
  6. Have the board emit the following HTTP header: Content-Type: text/html;charset=utf-8
  7. And the following meta tag: <meta http-equiv="Content-Type" content="text/html;charset=utf-8">
  8. Keep a read‐only copy of the old board/database, just in case some posts got truncated.

After that, as long as nobody manually sets their browser otherwise, everything and everyone will being UTF-8 and there will be no more problems.

Originally posted by Joe
To be fair, I turned off Firefox's autodetect to make sure my posts get HTML-escaped when I post Japanese text, so it might happen less often to you than to me.

This helps for the moment but is not perfect—it only converts characters that aren’t representable in your native encoding. Since the board usually shows up for you as CP-1252, characters like × and — (which are valid in CP-1252) don’t get escaped when you post, and are inserted raw in the database.

Originally posted by Joe
Or does PHP detect the meta tag and become even more dumb than usual?

Thankfully not. That would be a nightmare…

Joe
Common spammer
🗿
Level: 104


Posts: 3148/3287
EXP: 11518864
For next: 343262

Since: 08-02-07
From: Pororoca

Since last post: 20 days
Last activity: 15 hours

Posted on 07-25-14 03:58:12 PM Link | Quote
Originally posted by IIMarckus
Originally posted by Joe
To be fair, I turned off Firefox's autodetect to make sure my posts get HTML-escaped when I post Japanese text, so it might happen less often to you than to me.

This helps for the moment but is not perfect—it only converts characters that aren’t representable in your native encoding. Since the board usually shows up for you as CP-1252, characters like × and — (which are valid in CP-1252) don’t get escaped when you post, and are inserted raw in the database.


Many posts already contain punctuation encoded as Windows-1252, not just mine. I made this thread because I came across a post I couldn't quote.

It would probably be best to assume Windows-1252 when the code page detection is inconclusive.
IIMarckus
Member
Level: 23


Posts: 103/107
EXP: 63576
For next: 4147

Since: 10-11-08


Since last post: 172 days
Last activity: 2.0 years

Posted on 07-25-14 04:43:50 PM Link | Quote
Originally posted by Joe
Originally posted by IIMarckus
Originally posted by Joe
To be fair, I turned off Firefox's autodetect to make sure my posts get HTML-escaped when I post Japanese text, so it might happen less often to you than to me.

This helps for the moment but is not perfect—it only converts characters that aren’t representable in your native encoding. Since the board usually shows up for you as CP-1252, characters like × and — (which are valid in CP-1252) don’t get escaped when you post, and are inserted raw in the database.


Many posts already contain punctuation encoded as Windows-1252, not just mine. I made this thread because I came across a post I couldn't quote.

Sure, I wasn’t trying to imply otherwise.

Originally posted by Joe
It would probably be best to assume Windows-1252 when the code page detection is inconclusive.

uchardet seems to be pretty good at detecting that things are 1252. But I haven’t tried it on anything as large as this database, just a few individual pages saved from the board.

usr_share
70
Level: 19


Posts: 77/79
EXP: 31805
For next: 3972

Since: 03-12-12


Since last post: 2.0 years
Last activity: 171 days

Posted on 07-26-14 05:12:49 AM Link | Quote
I thought a good idea would be to impose a charset only on new threads (with a larger ID or created after a certain date), so that older threads aren't broken, and newer ones work better.
Lyskar
12210
-The Chaos within trumps the Chaos without-
Level: 182


Posts: 12161/12211
EXP: 82805543
For next: 99092

Since: 07-03-07
From: 52-2-88-7

Since last post: 2.0 years
Last activity: 2.0 years

Posted on 07-27-14 01:23:37 AM Link | Quote
Eh. The better answer is to know that the codebase is hopelessly ancient and unlikely to really be updated unless it's a spot fix.

This isn't a spot fix.

It could be fixed 'some day' but this sort of issue's been known for a good while... and well, surprise, nothing can be done about it when everyone has jobs and is busy.
Xkeeper






Posted on 07-19-17 01:31:21 AM Link | Quote
Originally posted by Joe
You can't quote this post.

—×

If you include one of those characters in your post and click on preview, the textarea will be empty.


So not only is this fixed, but all of the tables are internally utf8mb4 now (a post about this in a certain other thread is coming soon)

I see I probably fixed it before, but I'm still fixing some issues.

Originally posted by Lyskar
Eh. The better answer is to know that the codebase is hopelessly ancient and unlikely to really be updated unless it's a spot fix.

This isn't a spot fix.

It could be fixed 'some day' but this sort of issue's been known for a good while... and well, surprise, nothing can be done about it when everyone has jobs and is busy.

Quoting this just for posterity.

It's not wrong... yet.
Next newer thread | Next older thread
Jul - Meta - Quoting and previewing are broken (sometimes) New poll - New thread - New reply




Rusted Logic

Acmlmboard - commit 2f1bc75 [2017-08-27]
©2000-2017 Acmlm, Xkeeper, Inuyasha, et al.

30 database queries, 5 query cache hits.
Query execution time: 0.150520 seconds
Script execution time: 0.011708 seconds
Total render time: 0.162228 seconds