Jump to content

Talk:UTF-8

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Table should not only use color to encode information (but formatting like bold and underline)

[edit]

As in a previous comment https://en.wikipedia.org/wiki/Talk:UTF-8/Archive_1#Colour_in_example_table? this has been done before, and is *better* so that everyone can clearly see the different part of the code. Relying on color alone is not good, due to color vision deficiencies and varying color rendition on devices. — Preceding unsigned comment added by 88.219.179.109 (talkcontribs) 02:26, 17 April 2020‎ (UTC)[reply]

[edit]
   and Microsoft has a script for Windows 10, to enable it by default for its program Microsoft Notepad
   "Script How to set default encoding to UTF-8 for notepad by PowerShell". gallery.technet.microsoft.com. Retrieved 2018-01-30.
   https://gallery.technet.microsoft.com/scriptcenter/How-to-set-default-2d9669ae?ranMID=24542&ranEAID=TnL5HPStwNw&ranSiteID=TnL5HPStwNw-1ayuyj6iLWwQHN_gI6Np_w&tduid=(1f29517b2ebdfe80772bf649d4c144b1)(256380)(2459594)(TnL5HPStwNw-1ayuyj6iLWwQHN_gI6Np_w)()

This link is dead. How to fix it? — Preceding unsigned comment added by Un1Gfn (talkcontribs) 02:58, 5 April 2021 (UTC)[reply]

That text, and that link, appears to have been removed, so there's no longer anything to fix. Guy Harris (talk) 23:43, 21 December 2023 (UTC)[reply]

The article contains "{{efn", which looks like a mistake.

I would've fixed it myself but I don't know how to transform the remaining sentence to make sense. 2A01:C23:8D8D:BF00:C070:85C1:B1B8:4094 (talk) 16:17, 2 April 2024 (UTC)[reply]

I fixed it, I think. I'm not 100% sure it's how the previous editors intended. I invite them to review and confirm. Indefatigable (talk) 19:03, 2 April 2024 (UTC)[reply]

Should "The Manifesto" be mentioned somewhere?

[edit]

More specifically, this one: https://utf8everywhere.org -- Preceding unsigned comment added by Rudxain (talk o contribs) 21:52, 12 July 2024 (UTC)[reply]

Only if it's got significant coverage in reliable sources. Remsense 22:10, 12 July 2024 (UTC)[reply]
It's kind of ahistorical, since the Microsoft decisions that they deplore were made while developing Windows NT 3.1, and UTF-8 wasn't even a standard until Windows NT 3.1 was close to being released. There was more money to be made from East Asian customized computer systems than Unicode computer systems in 1993, so Unicode was probably not their main focus at that time... AnonMoos (talk) 20:30, 15 July 2024 (UTC)[reply]

The number of 3 byte encodings is incorrect

[edit]

This sentence is incorrect:

Three bytes are needed for the remaining 61,440 codepoints...

FFFF - 0800 + 1 = F800 = 63,488 three byte codepoints.

The other calculations for 1, 2, and 4 byte encodings are correct. Bantling66 (talk) 02:56, 23 August 2024 (UTC)[reply]

You forgot to subtract 2048 surrogates in the D800–DFFF range. – MwGamera (talk) 08:58, 23 August 2024 (UTC)[reply]

Multi-point flags

[edit]

I'm struggling to assume good faith here with this edit. A flag which consists of five code points is already sufficiently illustrative of the issue being discussed. That an editor saw fit to first remove that example without discussion, and then to swap it out for the other example when it was pared down to one flag, invites discussion of why that particular flag was removed, and the obvious answer isn't a charitable one. Chris Cunningham (user:thumperward) (talk) 12:35, 17 September 2024 (UTC)[reply]

Yes it was restored to the pride flag for precisely the reasons you state. Spitzak (talk) 20:48, 17 September 2024 (UTC)[reply]
A better, more in-depth explanations of the flags can be found on the articles regional indicator symbol and Tags_(Unicode_block)#Current_use (the mechanism for these specific flags). I don't think it belongs in articles of specific character encodings like UTF-8 at all.
The fact that one code point does not necessarily produce one grapheme has nothing to do with a specific character encoding like UTF-8. It's a more fundamental property of the text itself and any encoding that can be used to encode some string of characters decodes back to the same characters when decoded back from the binary representation. Although very popular, UTF-8 is just one of the numerous ways to encode text to binary and back.
I wrote more about this below at Other issues in the article and sadly only then noticed this was already being somewhat discussed here. Mossymountain (talk) 10:45, 20 September 2024 (UTC)[reply]

Why was the "heart" of the article, almost the whole section of UTF-8#Encoding (Old revision) removed instead of adding a note?

[edit]

NOTE: The section seems to have been renamed to UTF-8#Description in this edit.

I don't understand why such a large part of UTF-8#Encoding (old revision) was suddenly removed in this edit (edit A), and then this edit (edit B) (diff after both edits) instead of either:

  • Adding a note about parts of it being written poorly.
  • Rewriting some of it. (the best and the most difficult option)
  • Carefully considering removing parts that were definitely redundant (such as arguably the latter part of UTF-8#Examples (old revision)).

The first edit (edit A)

[edit]
→‎Encoding: this entire section is almost completely opaque and its inclusion stymies the addition of some clear prose describing how unicode is decoded
— user:Thumperward, (edit A)

To me, this reads as if UTF-8 was accidentally conflated with Unicode, causing a mistake to remove the parts from the wrong article.

I am strongly of the mind that the deleted parts included the two most important parts of the whole article, and they absolutely should be included because of that:

  1. The UTF-8#Codepage layout (old revision), in my opinion the most important part of any article about a binary character encoding. This part was also in my opinion written exemplarily well here, I see no problems with it at all.
    - Precedents/Examples in other articles about a specific character encoding:
  2. The first list (numbered 1..7) of UTF-8#Examples (old revision) that clearly, by simple example demonstrates how UTF-8 works. (I agree it could be rewritten, the language used is quite verbose)

The second edit (edit B)

[edit]
→Encoding: this now refers to removed text and contradicts repeated assertions elsewhere that overlong encodings are unnecessary
— user:Thumperward, (edit B)

It removed the whole section UTF-8#Overlong encodings (old revision). I disagree with its removal.

  1. The example removed in this edit was a clear and easy to understand way of explaining what an overlong encoding means. Yes, you could explain it without using an example, but in my opinion an example is the easiest way for someone unfamiliar to understand the concept. I see it as teaching with your hands and drawing relevant things on a whiteboard versus not having those options available.
  2. I don't understand what the deleted text contradicted, unless this for example refers to the mention in UTF-8#Implementations and adoption of Java's "Modified UTF-8" that uses an overlong encoding for the null character.
    • Also seems to have lacked a citation, which probably should have been rfc3629 § 3.

Other issues in the article

[edit]

The UTF-8 article does talk about generic things about Unicode quite a bit more than I think it should, possibly adding to the likelihood of misunderstandings like what I think ultimately lead to edit A.
Such things include explaining how some "graphical characters can be more than 4 bytes in UTF-8". This is because Unicode (and by extension UTF-8) does not deal in graphemes in the first place, but code points (essentially just numbers to index into Unicode), which can correspond to valid Unicode characters, which in turn can directly correspond to a grapheme. Some characters don't correspond to a grapheme at all (control characters), and some combine/join with other character(s) to to produce a combination grapheme (combining/joining characters), or something akin to the latter two, such as the formatting tag characters used in the flag example, which didn't seem to produce graphemes on their own or if used incorrectly.

The possibility of needing to use multiple code points for one grapheme like that is a direct consequence of these types of characters in general and isn't caused by UTF-8 or any other encoding, and can happen through ANY and all encodings capable of encoding such code points, not just UTF-8.

Maybe some of those explanations should be moved to the Unicode article or other appropriate articles instead and/or drastically shortened and replaced with links to the longer explanations?
Mossymountain (talk) 05:09, 20 September 2024 (UTC)[reply]

Because the editor was offended that that section used color. Akeosnhaoe (talk) 08:56, 20 September 2024 (UTC)[reply]
It's pretty important that we not communicate information solely through color, but I wonder how we could better do something like that. Remsense ‥  09:02, 20 September 2024 (UTC)[reply]
Most of the information wasn't in the color, it was in the text readable without formatting in monochrome. The color was there just to make it easier to quickly identify which is which.
If what Akeosnhaoe said is the case (which I don't think it is, I think this was an honest misunderstanding from someone who means good), obviously the colors should be changed to the intended visibility standard, not the information removed. Mossymountain (talk) 10:17, 20 September 2024 (UTC)[reply]