Register - Login
Views: 87328686
Main - Memberlist - Active users - Calendar - Wiki - IRC Chat - Online users
Ranks - Rules/FAQ - JCS - Stats - Latest Posts - Color Chart - Smilies
11-18-17 03:39:38 PM
0 users currently in Meta. | 1 guest

Jul - Meta - Quoting and previewing are broken (sometimes)
  
Login Info: Username: Password:
Reply: Mood avatar list:

















 
Options: - -

Thread history
UserPost
Xkeeper
Posts: 22216/22605
Originally posted by Joe
You can't quote this post.

—×

If you include one of those characters in your post and click on preview, the textarea will be empty.


So not only is this fixed, but all of the tables are internally utf8mb4 now (a post about this in a certain other thread is coming soon)

I see I probably fixed it before, but I'm still fixing some issues.

Originally posted by Lyskar
Eh. The better answer is to know that the codebase is hopelessly ancient and unlikely to really be updated unless it's a spot fix.

This isn't a spot fix.

It could be fixed 'some day' but this sort of issue's been known for a good while... and well, surprise, nothing can be done about it when everyone has jobs and is busy.

Quoting this just for posterity.

It's not wrong... yet.
Lyskar
Posts: 12161/12211
Eh. The better answer is to know that the codebase is hopelessly ancient and unlikely to really be updated unless it's a spot fix.

This isn't a spot fix.

It could be fixed 'some day' but this sort of issue's been known for a good while... and well, surprise, nothing can be done about it when everyone has jobs and is busy.
usr_share
Posts: 77/79
I thought a good idea would be to impose a charset only on new threads (with a larger ID or created after a certain date), so that older threads aren't broken, and newer ones work better.
IIMarckus
Posts: 103/107
Originally posted by Joe
Originally posted by IIMarckus
Originally posted by Joe
To be fair, I turned off Firefox's autodetect to make sure my posts get HTML-escaped when I post Japanese text, so it might happen less often to you than to me.

This helps for the moment but is not perfect—it only converts characters that aren’t representable in your native encoding. Since the board usually shows up for you as CP-1252, characters like × and — (which are valid in CP-1252) don’t get escaped when you post, and are inserted raw in the database.


Many posts already contain punctuation encoded as Windows-1252, not just mine. I made this thread because I came across a post I couldn't quote.

Sure, I wasn’t trying to imply otherwise.

Originally posted by Joe
It would probably be best to assume Windows-1252 when the code page detection is inconclusive.

uchardet seems to be pretty good at detecting that things are 1252. But I haven’t tried it on anything as large as this database, just a few individual pages saved from the board.

Joe
Posts: 3148/3289
Originally posted by IIMarckus
Originally posted by Joe
To be fair, I turned off Firefox's autodetect to make sure my posts get HTML-escaped when I post Japanese text, so it might happen less often to you than to me.

This helps for the moment but is not perfect—it only converts characters that aren’t representable in your native encoding. Since the board usually shows up for you as CP-1252, characters like × and — (which are valid in CP-1252) don’t get escaped when you post, and are inserted raw in the database.


Many posts already contain punctuation encoded as Windows-1252, not just mine. I made this thread because I came across a post I couldn't quote.

It would probably be best to assume Windows-1252 when the code page detection is inconclusive.
IIMarckus
Posts: 102/107
Originally posted by StapleButter
It's also worth noting that the board's pages don't specify an encoding.

Here, Firefox defaults to Windows-1252

Not all the time, which is part of the problem. By the way, I can quote and preview Joe’s original post. As usual, different people have different behavior in their browsers, and that leads to a mess.

The way to go here would be:

  1. Dump every field of the database to a separate file.
  2. Use something like uchardet to figure out exactly how each field is encoded.
  3. Use iconv to convert to UTF-8.
  4. (Maybe not necessary: strip out any UTF-8 byte‐order marks, the sequence of three bytes 0xEF,0xBB,0xBF. This shouldn’t happen but Windows likes to add them to UTF-8 files; PHP and MySQL do not play well with them.)
  5. Reconstruct the database from scratch using MySQL’s utf8mb4 encoding (not utf8, because in true MySQL form that doesn’t actually support the full range of Unicode).
  6. Have the board emit the following HTTP header: Content-Type: text/html;charset=utf-8
  7. And the following meta tag: <meta http-equiv="Content-Type" content="text/html;charset=utf-8">
  8. Keep a read‐only copy of the old board/database, just in case some posts got truncated.

After that, as long as nobody manually sets their browser otherwise, everything and everyone will being UTF-8 and there will be no more problems.

Originally posted by Joe
To be fair, I turned off Firefox's autodetect to make sure my posts get HTML-escaped when I post Japanese text, so it might happen less often to you than to me.

This helps for the moment but is not perfect—it only converts characters that aren’t representable in your native encoding. Since the board usually shows up for you as CP-1252, characters like × and — (which are valid in CP-1252) don’t get escaped when you post, and are inserted raw in the database.

Originally posted by Joe
Or does PHP detect the meta tag and become even more dumb than usual?

Thankfully not. That would be a nightmare…

Xkeeper
Posts: 21331/22605
I forget the specifics.
Joe
Posts: 3147/3289
Posts already show up corrupt all the time.

Or does PHP detect the meta tag and become even more dumb than usual?
Xkeeper
Posts: 21330/22605
the problem with adding a meta tag is (like stated) everything previously posted ends up half-corrupt. there actually used to be a meta tag for a little while before we found that little feature out
StapleButter
Posts: 95/461
The trickiest part in the conversion is that the database uses Latin1. Every text field in there would have to be reencoded to UTF8.

If you want to change the database to use UTF8, you have to change the encoding used by every table but also every text column, because MySQL is silly like that. This isn't necessary, though. Database encoding doesn't matter to PHP's MySQL backend which just returns the raw data. However, other applications may care about the encoding and not work properly.
Joe
Posts: 3143/3289
Throwing everything away is actually a feature. It's impossible to have any UTF-8 exploits if the parser immediately gives up when it hits something malformed.

It'd be nice if the board at least had a <meta> tag or something, at least until the eventual conversion to UTF-8.
StapleButter
Posts: 93/461
It's also worth noting that the board's pages don't specify an encoding.

Here, Firefox defaults to Windows-1252, so it works, but it might break if the browser chooses to use something else as a default. Hell, I remember there's a related security issue with IE.
Xkeeper
Posts: 21324/22605
Originally posted by Joe
You can't quote this post.

—×

If you include one of those characters in your post and click on preview, the textarea will be empty.

inu pointed it out, php apparently defaults to utf8 now and we're still rocking good ol' ISO-8859-1.


"if php runs into something unexpected, catch fire and throw away everything without even printing a warning". thx
Xkeeper
Posts: 21323/22605
I have no idea why this happens
Joe
Posts: 3141/3289
You can't quote this post.

—×

If you include one of those characters in your post and click on preview, the textarea will be empty.
Jul - Meta - Quoting and previewing are broken (sometimes)



Rusted Logic

Acmlmboard - commit 2f1bc75 [2017-08-27]
©2000-2017 Acmlm, Xkeeper, Inuyasha, et al.

20 database queries.
Query execution time: 0.168714 seconds
Script execution time: 0.007461 seconds
Total render time: 0.176175 seconds