Register - Login
Views: 99388626
Main - Memberlist - Active users - Calendar - Wiki - IRC Chat - Online users
Ranks - Rules/FAQ - Stats - Latest Posts - Color Chart - Smilies
04-24-22 07:24:46 AM
Jul - Computers and Technology - Browser Automatic Encoding Detection New poll - New thread - New reply
Next newer thread | Next older thread
paulguy

Green Birdo
Level: 93


Posts: 1695/2294
EXP: 8025612
For next: 27198

Since: 09-14-07

From: Buffalo, NY

Since last post: 9.7 years
Last activity: 9.6 years

Posted on 07-13-11 01:20:27 PM Link | Quote
Paulguy's Post configuration
Why can it suck sometimes? I wonder how it works.... I had a surprising situation with firefox not detecting some unicode utf-8 text. Does this board have any ISO-8859 encoded text around that would throw off autodetection (I am using Kafuka theme, by the way.)?

I know this seems like it might be bug report/problem thread, but I kind of like this kind of discussion. Like, text/data encodings and stuff.

I honestly wish browsers could just default to UTF-8 and be done with it. Trouble is all kinds of legacy sites lying around before UTF-8 became particularly popular, and of course the Japanese (and probably others) who like to use their own shit encodings for their rather crap reasons. There's almost no reason other than dealing with legacy systems to use anything other than UTF-8. The slight decoding/encoding overhead is worth the benefits on a modern system.

Anyway, I'm just tired and ranting at an annoyance. :p

____________________
Liliana
"A horrible person". That's what it says. "A horrible person."

We weren't even testing for that.


Level: NaN


Posts: 3041/-3841
EXP: NaN
For next: 0

Since: 07-23-07


Since last post: 10.2 years
Last activity: 10.1 years

Posted on 07-13-11 01:33:36 PM Link | Quote
Ahh, encoding detection. A lovely topic.

Especially Internet Explorer seems bad at it (consider the well-known UTF-7 bug, for one). I noticed it when browsing Russian websites, which are often encoded in legacy encodings. For any sane human, it should be pretty obvious that Russian words do not have random capital letters appearing in the middle of words, but for some reason IE doesn't seem to grasp that and shows it like this anyway.

A problem with this board is that it has no character set defined in the page's header, so it's left to the browser to choose. And as it turns out, browsers will then assume the system encoding, which is a problem since that differs depending on your localization. Just as an example, when Sanky posts text here, his browser will assume the system encoding, which is Windows-1250. Now, when I view his post, Opera assumes the system encoding too, which is however Windows-1252 in my case, and thus results in a mess. There isn't a reliable way to tell apart these encodings since they're very alike, so the only way to fix it would be defining some kind of encoding in the page's header (even plain us-ascii will help browsers a lot).

____________________
paulguy

Green Birdo
Level: 93


Posts: 1696/2294
EXP: 8025612
For next: 27198

Since: 09-14-07

From: Buffalo, NY

Since last post: 9.7 years
Last activity: 9.6 years

Posted on 07-13-11 02:01:54 PM Link | Quote
Paulguy's Post configuration
I don't have issues with Sanky's posts. His seem to fit plain ASCII just fine. Also, my system locale is set to UTF-8 as well, so there must be something on this site throwing it off and making it default to ISO-8859-1. Trouble is why it's priorities are so off that some random extended ASCII character sitting around would make it choose ISO-8859-1 even with a bunch of UTF-8 in a post.

How many languages do you know, anyway?

____________________
Liliana
"A horrible person". That's what it says. "A horrible person."

We weren't even testing for that.


Level: NaN


Posts: 3042/-3841
EXP: NaN
For next: 0

Since: 07-23-07


Since last post: 10.2 years
Last activity: 10.1 years

Posted on 07-13-11 02:15:35 PM Link | Quote
It's pretty easy to explain. When the browser goes through the page, and encounters, for example, the sequence "E9 20" (which would be "é " in ISO 8859-1), it checks to see if it's a valid UTF-8 sequence, and sees that it isn't, so it will assume that the entire page must use some 8-bit encoding. Now when it goes further and encounters, say, "C3 A9 20" (which is the very same sequence, just in UTF-8), it checks to see if it's a valid iso-8859-1 sequence (and it is), so it will assume that the page is is in ISO 8859-1, which creates a mess.

To me, it seems to be some sort of hypercorrection; I remember that older browsers used to opt to utf-8 in this case which would create those question mark boxes we all know and love. It seems that quite a few people got annoyed by that, so now, the detection routine goes right the other way. The question mark boxes eat the next, normal ASCII character too, so a word like "écran" would turn into "�ran", this of course is a huge problem as it can render entire texts unreadable.

Originally posted by paulguy
How many languages do you know, anyway?

Not enough yet. But on a more serious matter, I speak German, mostly good English, and have a little grasp of French and Russian.

____________________
Joe
Common spammer
🍬
Level: 111


Posts: 2190/3392
EXP: 14489064
For next: 379296

Since: 08-02-07

From: Pororoca

Since last post: 3 days
Last activity: 5 hours

Posted on 07-13-11 04:31:24 PM Link | Quote
Oh boy. Encoding.

Jul is encoded as Windows-1252, but lacks headers, so browsers have to guess.

Older Windows programs use different encodings. This is particularly amusing when the program is in a multi-byte encoding, but Windows is not: even with AppLocale, Wine often runs the program better than Windows.

CJK characters have overlap in Unicode, such as where and are the same character but appear very different depending on the language. This is often related to encoding, because using an encoding that only covers characters in one language implies that you want the glyphs for that language.

____________________
Xkeeper

Level: 263


Posts: 19111/25343
EXP: 296718906
For next: 2241547

Since: 07-03-07

Pronouns: they/them/????????

Since last post: 9 days
Last activity: 3 days

Posted on 07-13-11 06:34:37 PM Link | Quote
The problem is that the board stores the actual internal stuff in a UTF-8 database but barfs it out as Windows-12bullshit.

Setting a header field has been disastrous whether it was set to utf-8 or windows-bs, so I've left it alone as a project for another day.
Liliana
"A horrible person". That's what it says. "A horrible person."

We weren't even testing for that.


Level: NaN


Posts: 3049/-3841
EXP: NaN
For next: 0

Since: 07-23-07


Since last post: 10.2 years
Last activity: 10.1 years

Posted on 07-14-11 01:26:03 PM Link | Quote
Reading this topic again reminded me of an old Outlook Express bug. Starting any line with the word begin and two spaces afterwards would cause Outlook Express to treat the following text as a binary attachment.

Back when Usenet was popular, this was a common way to mock Outlook Express users. Up to this date, I have no idea how Microsoft managed to mess that up. (Since then, they released a hotfix which corrects it.)

____________________
Nicole

Disk-kun
Level: 146


Posts: 4450/6469
EXP: 38252993
For next: 260301

Since: 07-07-07

Pronouns: she/her
From: Boston, MA

Since last post: 69 days
Last activity: 1 day

Posted on 07-15-11 02:47:54 AM (last edited by Imajin at 07-14-11 11:48 PM) Link | Quote
My browser tends to detect pages that use, say, accents or proper dashes as Shift-JIS instead.
Originally posted by Liliana
Reading this topic again reminded me of an old Outlook Express bug. Starting any line with the word begin and two spaces afterwards would cause Outlook Express to treat the following text as a binary attachment.

Back when Usenet was popular, this was a common way to mock Outlook Express users. Up to this date, I have no idea how Microsoft managed to mess that up. (Since then, they released a hotfix which corrects it.)

Bush hid the facts

____________________
Joe
Common spammer
🍬
Level: 111


Posts: 2192/3392
EXP: 14489064
For next: 379296

Since: 08-02-07

From: Pororoca

Since last post: 3 days
Last activity: 5 hours

Posted on 07-15-11 03:23:29 AM Link | Quote
Originally posted by Imajin
My browser tends to detect pages that use, say, accents or proper dashes as Shift-JIS instead.
Your browser might be set to Japanese autodetect. Shift-JIS is (mostly) ASCII-compatible, and multi-byte sequences start with a byte >=0x80 and end with a byte >=0x40. It so happens that in Windows-1252 the special characters you mention are all within the range to be start bytes, and most printable characters are in range to be end bytes.

____________________
Next newer thread | Next older thread
Jul - Computers and Technology - Browser Automatic Encoding Detection New poll - New thread - New reply


Rusted Logic

Acmlmboard - commit 47be4dc [2021-08-23]
©2000-2022 Acmlm, Xkeeper, Kaito Sinclaire, et al.

27 database queries, 2 query cache hits.
Query execution time:  0.093733 seconds
Script execution time:  0.034060 seconds
Total render time:  0.127793 seconds


TidyHTML vomit below
line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 119 column 11 - Warning: <form> isn't allowed in <table> elements
line 118 column 10 - Info: <table> previously mentioned
line 120 column 11 - Warning: missing <tr>
line 120 column 119 - Warning: missing </font> before </td>
line 124 column 16 - Warning: plain text isn't allowed in <tr> elements
line 120 column 11 - Info: <tr> previously mentioned
line 125 column 68 - Warning: missing </nobr> before </td>
line 141 column 68 - Warning: missing </nobr> before <tr>
line 147 column 35 - Warning: missing <tr>
line 147 column 50 - Warning: missing </font> before </td>
line 148 column 37 - Warning: unescaped & or unknown entity "&id"
line 147 column 209 - Warning: missing </font> before </table>
line 149 column 35 - Warning: missing <tr>
line 149 column 50 - Warning: missing </font> before </td>
line 149 column 91 - Warning: missing </font> before </table>
line 156 column 9 - Warning: <div> isn't allowed in <table> elements
line 152 column 17 - Info: <table> previously mentioned
line 158 column 9 - Warning: missing <tr>
line 176 column 13 - Warning: missing <tr>
line 177 column 102 - Warning: unescaped & or unknown entity "&postid"
line 179 column 74 - Warning: <style> isn't allowed in <td> elements
line 179 column 9 - Info: <td> previously mentioned
line 188 column 9 - Warning: <div> isn't allowed in <table> elements
line 152 column 17 - Info: <table> previously mentioned
line 190 column 9 - Warning: missing <tr>
line 208 column 13 - Warning: missing <tr>
line 209 column 102 - Warning: unescaped & or unknown entity "&postid"
line 218 column 9 - Warning: <div> isn't allowed in <table> elements
line 152 column 17 - Info: <table> previously mentioned
line 220 column 9 - Warning: missing <tr>
line 238 column 13 - Warning: missing <tr>
line 239 column 102 - Warning: unescaped & or unknown entity "&postid"
line 241 column 74 - Warning: <style> isn't allowed in <td> elements
line 241 column 9 - Info: <td> previously mentioned
line 246 column 9 - Warning: <div> isn't allowed in <table> elements
line 152 column 17 - Info: <table> previously mentioned
line 248 column 9 - Warning: missing <tr>
line 266 column 13 - Warning: missing <tr>
line 267 column 102 - Warning: unescaped & or unknown entity "&postid"
line 269 column 192 - Warning: missing </span> before <blockquote>
line 273 column 1296 - Warning: inserting implicit <span>
line 273 column 1296 - Warning: missing </span> before <hr>
line 273 column 1360 - Warning: inserting implicit <span>
line 273 column 1360 - Warning: missing </span> before <hr>
line 274 column 1 - Warning: inserting implicit <span>
line 277 column 9 - Warning: <div> isn't allowed in <table> elements
line 152 column 17 - Info: <table> previously mentioned
line 279 column 9 - Warning: missing <tr>
line 297 column 13 - Warning: missing <tr>
line 298 column 102 - Warning: unescaped & or unknown entity "&postid"
line 300 column 74 - Warning: <style> isn't allowed in <td> elements
line 300 column 9 - Info: <td> previously mentioned
line 309 column 9 - Warning: <div> isn't allowed in <table> elements
line 152 column 17 - Info: <table> previously mentioned
line 311 column 9 - Warning: missing <tr>
line 329 column 13 - Warning: missing <tr>
line 330 column 102 - Warning: unescaped & or unknown entity "&postid"
line 337 column 9 - Warning: <div> isn't allowed in <table> elements
line 152 column 17 - Info: <table> previously mentioned
line 339 column 9 - Warning: missing <tr>
line 357 column 13 - Warning: missing <tr>
line 358 column 102 - Warning: unescaped & or unknown entity "&postid"
line 365 column 9 - Warning: <div> isn't allowed in <table> elements
line 152 column 17 - Info: <table> previously mentioned
line 367 column 9 - Warning: missing <tr>
line 385 column 13 - Warning: missing <tr>
line 386 column 102 - Warning: unescaped & or unknown entity "&postid"
line 395 column 9 - Warning: <div> isn't allowed in <table> elements
line 152 column 17 - Info: <table> previously mentioned
line 397 column 9 - Warning: missing <tr>
line 415 column 13 - Warning: missing <tr>
line 416 column 102 - Warning: unescaped & or unknown entity "&postid"
line 418 column 74 - Warning: <style> isn't allowed in <td> elements
line 418 column 9 - Info: <td> previously mentioned
line 421 column 17 - Warning: missing <tr>
line 421 column 17 - Warning: discarding unexpected <table>
line 424 column 35 - Warning: missing <tr>
line 424 column 50 - Warning: missing </font> before </td>
line 424 column 91 - Warning: missing </font> before </table>
line 426 column 35 - Warning: missing <tr>
line 426 column 50 - Warning: missing </font> before </td>
line 427 column 37 - Warning: unescaped & or unknown entity "&id"
line 426 column 209 - Warning: missing </font> before </table>
line 428 column 17 - Warning: discarding unexpected </textarea>
line 428 column 28 - Warning: discarding unexpected </form>
line 428 column 35 - Warning: discarding unexpected </embed>
line 428 column 43 - Warning: discarding unexpected </noembed>
line 428 column 53 - Warning: discarding unexpected </noscript>
line 428 column 64 - Warning: discarding unexpected </noembed>
line 428 column 74 - Warning: discarding unexpected </embed>
line 428 column 82 - Warning: discarding unexpected </table>
line 428 column 90 - Warning: discarding unexpected </table>
line 430 column 9 - Warning: missing </font> before <table>
line 442 column 25 - Warning: discarding unexpected </font>
line 451 column 57 - Warning: discarding unexpected </font>
line 429 column 1 - Warning: missing </center>
line 120 column 63 - Warning: <img> lacks "alt" attribute
line 125 column 19 - Warning: <td> attribute "width" has invalid value "120px"
line 125 column 93 - Warning: <img> lacks "alt" attribute
line 141 column 19 - Warning: <td> attribute "width" has invalid value "120px"
line 141 column 98 - Warning: <img> lacks "alt" attribute
line 148 column 44 - Warning: <img> proprietary attribute value "absmiddle"
line 148 column 142 - Warning: <img> proprietary attribute value "absmiddle"
line 148 column 246 - Warning: <img> proprietary attribute value "absmiddle"
line 160 column 11 - Warning: <img> lacks "alt" attribute
line 161 column 22 - Warning: <img> lacks "alt" attribute
line 161 column 63 - Warning: <img> lacks "alt" attribute
line 161 column 112 - Warning: <img> lacks "alt" attribute
line 161 column 161 - Warning: <img> lacks "alt" attribute
line 162 column 11 - Warning: <img> lacks "alt" attribute
line 172 column 15 - Warning: <img> lacks "alt" attribute
line 193 column 23 - Warning: <img> lacks "alt" attribute
line 193 column 64 - Warning: <img> lacks "alt" attribute
line 204 column 15 - Warning: <img> lacks "alt" attribute
line 215 column 1618 - Warning: <img> lacks "alt" attribute
line 222 column 11 - Warning: <img> lacks "alt" attribute
line 223 column 22 - Warning: <img> lacks "alt" attribute
line 223 column 63 - Warning: <img> lacks "alt" attribute
line 223 column 112 - Warning: <img> lacks "alt" attribute
line 223 column 161 - Warning: <img> lacks "alt" attribute
line 224 column 11 - Warning: <img> lacks "alt" attribute
line 234 column 15 - Warning: <img> lacks "alt" attribute
line 251 column 23 - Warning: <img> lacks "alt" attribute
line 251 column 64 - Warning: <img> lacks "alt" attribute
line 262 column 15 - Warning: <img> lacks "alt" attribute
line 274 column 1436 - Warning: <img> proprietary attribute value "absmiddle"
line 274 column 1436 - Warning: <img> lacks "alt" attribute
line 274 column 1786 - Warning: <img> lacks "alt" attribute
line 282 column 23 - Warning: <img> lacks "alt" attribute
line 282 column 64 - Warning: <img> lacks "alt" attribute
line 282 column 113 - Warning: <img> lacks "alt" attribute
line 282 column 163 - Warning: <img> lacks "alt" attribute
line 283 column 11 - Warning: <img> lacks "alt" attribute
line 293 column 15 - Warning: <img> lacks "alt" attribute
line 314 column 23 - Warning: <img> lacks "alt" attribute
line 314 column 64 - Warning: <img> lacks "alt" attribute
line 314 column 113 - Warning: <img> lacks "alt" attribute
line 314 column 163 - Warning: <img> lacks "alt" attribute
line 315 column 11 - Warning: <img> lacks "alt" attribute
line 325 column 15 - Warning: <img> lacks "alt" attribute
line 342 column 23 - Warning: <img> lacks "alt" attribute
line 342 column 64 - Warning: <img> lacks "alt" attribute
line 353 column 15 - Warning: <img> lacks "alt" attribute
line 362 column 859 - Warning: <img> lacks "alt" attribute
line 369 column 11 - Warning: <img> lacks "alt" attribute
line 370 column 23 - Warning: <img> lacks "alt" attribute
line 370 column 64 - Warning: <img> lacks "alt" attribute
line 370 column 113 - Warning: <img> lacks "alt" attribute
line 370 column 163 - Warning: <img> lacks "alt" attribute
line 371 column 11 - Warning: <img> lacks "alt" attribute
line 381 column 15 - Warning: <img> lacks "alt" attribute
line 388 column 642 - Warning: <img> proprietary attribute value "absmiddle"
line 388 column 642 - Warning: <img> lacks "alt" attribute
line 400 column 23 - Warning: <img> lacks "alt" attribute
line 400 column 64 - Warning: <img> lacks "alt" attribute
line 400 column 113 - Warning: <img> lacks "alt" attribute
line 400 column 163 - Warning: <img> lacks "alt" attribute
line 401 column 11 - Warning: <img> lacks "alt" attribute
line 411 column 15 - Warning: <img> lacks "alt" attribute
line 418 column 954 - Warning: <img> proprietary attribute value "absmiddle"
line 418 column 954 - Warning: <img> lacks "alt" attribute
line 427 column 44 - Warning: <img> proprietary attribute value "absmiddle"
line 427 column 142 - Warning: <img> proprietary attribute value "absmiddle"
line 427 column 246 - Warning: <img> proprietary attribute value "absmiddle"
line 436 column 25 - Warning: <img> lacks "alt" attribute
line 441 column 267 - Warning: <img> lacks "alt" attribute
line 149 column 50 - Warning: trimming empty <font>
line 421 column 17 - Warning: trimming empty <tr>
line 424 column 50 - Warning: trimming empty <font>
line 125 column 68 - Warning: <nobr> is not approved by W3C
line 141 column 68 - Warning: <nobr> is not approved by W3C
line 177 column 27 - Warning: <nobr> is not approved by W3C
line 209 column 27 - Warning: <nobr> is not approved by W3C
line 211 column 74 - Warning: <div> proprietary attribute "width"
line 215 column 1481 - Warning: <div> proprietary attribute "width"
line 239 column 27 - Warning: <nobr> is not approved by W3C
line 267 column 27 - Warning: <nobr> is not approved by W3C
line 269 column 74 - Warning: <div> proprietary attribute "width"
line 274 column 1649 - Warning: <div> proprietary attribute "width"
line 298 column 27 - Warning: <nobr> is not approved by W3C
line 330 column 27 - Warning: <nobr> is not approved by W3C
line 358 column 27 - Warning: <nobr> is not approved by W3C
line 360 column 74 - Warning: <div> proprietary attribute "width"
line 362 column 722 - Warning: <div> proprietary attribute "width"
line 386 column 27 - Warning: <nobr> is not approved by W3C
line 416 column 27 - Warning: <nobr> is not approved by W3C
Info: Document content looks like HTML5
Info: No system identifier in emitted doctype
Tidy found 172 warnings and 0 errors!


The alt attribute should be used to give a short description
of an image; longer descriptions should be given with the
longdesc attribute which takes a URL linked to the description.
These measures are needed for people using non-graphical browsers.

For further advice on how to make your pages accessible
see http://www.w3.org/WAI/GL.
You are recommended to use CSS to specify the font and
properties such as its size and color. This will reduce
the size of HTML files and make them easier to maintain
compared with using <FONT> elements.

You are recommended to use CSS to control line wrapping.
Use "white-space: nowrap" to inhibit wrapping in place
of inserting <NOBR>...</NOBR> into the markup.

About HTML Tidy: https://github.com/htacg/tidy-html5
Bug reports and comments: https://github.com/htacg/tidy-html5/issues
Official mailing list: https://lists.w3.org/Archives/Public/public-htacg/
Latest HTML specification: http://dev.w3.org/html5/spec-author-view/
Validate your HTML documents: http://validator.w3.org/nu/
Lobby your company to join the W3C: http://www.w3.org/Consortium

Do you speak a language other than English, or a different variant of
English? Consider helping us to localize HTML Tidy. For details please see
https://github.com/htacg/tidy-html5/blob/master/README/LOCALIZE.md