Talk:UTF-8

This is the talk page for discussing improvements to the UTF-8 article.
This is not a forum for general discussion of the article's subject.

Put new text under old text. Click here to start a new topic.
New to Wikipedia? Welcome! Learn to edit; get help.

Article policies

Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL

Computing Mid‑importance

	This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.ComputingWikipedia:WikiProject ComputingTemplate:WikiProject ComputingComputing articles
Mid	This article has been rated as Mid-importance on the project's importance scale.

Computer science Mid‑importance

This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.Computer scienceWikipedia:WikiProject Computer scienceTemplate:WikiProject Computer scienceComputer science articles

Mid

This article has been rated as Mid-importance on the project's importance scale.

Things you can help WikiProject Computer science with:

Here are some tasks awaiting attention:

Article requests :
- Requested articles/Applied arts and sciences/Computer science, computing, and Internet
Cleanup :
- Computer science articles needing attention
- Computer science articles needing expert attention
Copyedit :
- Computing
Expand :
- Computer science
Infobox :
- Computer science articles without infoboxes
Maintain :
- Timeline of computing 2020–present
Photo :
- Find pictures for the biographies of computer scientists (see List of computer scientists)
- Computing articles needing images
Stubs :
- Computer science stubs
Unreferenced :
- WikiProject Computer science/Unreferenced BLPs
Project-related :
- Tag all relevant articles in Category:Computer science and sub-categories with {{WikiProject Computer science}}

Typography Mid‑importance

	This article is within the scope of WikiProject Typography, a collaborative effort to improve the coverage of articles related to Typography on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.TypographyWikipedia:WikiProject TypographyTemplate:WikiProject TypographyTypography articles
Mid	This article has been rated as Mid-importance on the importance scale.

Archives

Index

Archive 1	Archive 2	Archive 3
Archive 4	Archive 5

This page has archives. Sections older than 730 days may be automatically archived by .

Table should not only use color to encode information (but formatting like bold and underline)

As in a previous comment https://en.wikipedia.org/wiki/Talk:UTF-8/Archive_1#Colour_in_example_table? this has been done before, and is *better* so that everyone can clearly see the different part of the code. Relying on color alone is not good, due to color vision deficiencies and varying color rendition on devices. — Preceding unsigned comment added by 88.219.179.109 (talk • contribs) 02:26, 17 April 2020‎ (UTC)[reply]

Microsoft script dead link

   and Microsoft has a script for Windows 10, to enable it by default for its program Microsoft Notepad

   "Script How to set default encoding to UTF-8 for notepad by PowerShell". gallery.technet.microsoft.com. Retrieved 2018-01-30.

   https://gallery.technet.microsoft.com/scriptcenter/How-to-set-default-2d9669ae?ranMID=24542&ranEAID=TnL5HPStwNw&ranSiteID=TnL5HPStwNw-1ayuyj6iLWwQHN_gI6Np_w&tduid=(1f29517b2ebdfe80772bf649d4c144b1)(256380)(2459594)(TnL5HPStwNw-1ayuyj6iLWwQHN_gI6Np_w)()

This link is dead. How to fix it? — Preceding unsigned comment added by Un1Gfn (talk • contribs) 02:58, 5 April 2021 (UTC)[reply]

That text, and that link, appears to have been removed, so there's no longer anything to fix. Guy Harris (talk) 23:43, 21 December 2023 (UTC)[reply]

The article contains "{{efn", which looks like a mistake.

I would've fixed it myself but I don't know how to transform the remaining sentence to make sense. 2A01:C23:8D8D:BF00:C070:85C1:B1B8:4094 (talk) 16:17, 2 April 2024 (UTC)[reply]

I fixed it, I think. I'm not 100% sure it's how the previous editors intended. I invite them to review and confirm. Indefatigable (talk) 19:03, 2 April 2024 (UTC)[reply]

Should "The Manifesto" be mentioned somewhere?

More specifically, this one: https://utf8everywhere.org -- Preceding unsigned comment added by Rudxain (talk o contribs) 21:52, 12 July 2024 (UTC)[reply]

Only if it's got significant coverage in reliable sources. Remsense 22:10, 12 July 2024 (UTC)[reply]

It's kind of ahistorical, since the Microsoft decisions that they deplore were made while developing Windows NT 3.1, and UTF-8 wasn't even a standard until Windows NT 3.1 was close to being released. There was more money to be made from East Asian customized computer systems than Unicode computer systems in 1993, so Unicode was probably not their main focus at that time... AnonMoos (talk) 20:30, 15 July 2024 (UTC)[reply]

The number of 3 byte encodings is incorrect

This sentence is incorrect:

Three bytes are needed for the remaining 61,440 codepoints...

FFFF - 0800 + 1 = F800 = 63,488 three byte codepoints.

The other calculations for 1, 2, and 4 byte encodings are correct. Bantling66 (talk) 02:56, 23 August 2024 (UTC)[reply]

You forgot to subtract 2048 surrogates in the D800–DFFF range. – MwGamera (talk) 08:58, 23 August 2024 (UTC)[reply]

Multi-point flags

I'm struggling to assume good faith here with this edit. A flag which consists of five code points is already sufficiently illustrative of the issue being discussed. That an editor saw fit to first remove that example without discussion, and then to swap it out for the other example when it was pared down to one flag, invites discussion of why that particular flag was removed, and the obvious answer isn't a charitable one. Chris Cunningham (user:thumperward) (talk) 12:35, 17 September 2024 (UTC)[reply]

Yes it was restored to the pride flag for precisely the reasons you state. Spitzak (talk) 20:48, 17 September 2024 (UTC)[reply]

A better, more in-depth explanations of the flags can be found on the articles regional indicator symbol and Tags_(Unicode_block)#Current_use (the mechanism for these specific flags). I don't think it belongs in articles of specific character encodings like UTF-8 at all.

The fact that one code point does not necessarily produce one grapheme has nothing to do with a specific character encoding like UTF-8. It's a more fundamental property of the text itself and any encoding that can be used to encode some string of characters decodes back to the same characters when decoded back from the binary representation. Although very popular, UTF-8 is just one of the numerous ways to encode text to binary and back.

I wrote more about this below at Other issues in the article and sadly only then noticed this was already being somewhat discussed here. Mossymountain (talk) 10:45, 20 September 2024 (UTC)[reply]

Why was the "heart" of the article, almost the whole section of UTF-8#Encoding (Old revision) removed instead of adding a note?

NOTE: The section seems to have been renamed to UTF-8#Description in this edit.

I don't understand why such a large part of UTF-8#Encoding (old revision) was suddenly removed in this edit (edit A), and then this edit (edit B) (diff after both edits) instead of either:

Adding a note about parts of it being written poorly.
Rewriting some of it. (the best and the most difficult option)
Carefully considering removing parts that were definitely redundant (such as arguably the latter part of UTF-8#Examples (old revision)).

The first edit (edit A)

→‎Encoding: this entire section is almost completely opaque and its inclusion stymies the addition of some clear prose describing how unicode is decoded
— user:Thumperward, (edit A)

To me, this reads as if UTF-8 was accidentally conflated with Unicode, causing a mistake to remove the parts from the wrong article.

I am strongly of the mind that the deleted parts included the two most important parts of the whole article, and they absolutely should be included because of that:

The UTF-8#Codepage layout (old revision), in my opinion the most important part of any article about a binary character encoding. This part was also in my opinion written exemplarily well here, I see no problems with it at all.
- Precedents/Examples in other articles about a specific character encoding:
- Variable multi-byte (like UTF-8):
  - Shift_JIS#Shift_JIS_byte_map
  - GBK_(character_encoding)#Encoding
- single byte:
  - ASCII#Character_set (a strict subset of UTF-8)
  - Code_page_437#Character_set
  - ISO/IEC_8859-1#Code_page_layout
The first list (numbered 1..7) of UTF-8#Examples (old revision) that clearly, by simple example demonstrates how UTF-8 works. (I agree it could be rewritten, the language used is quite verbose)

The second edit (edit B)

→Encoding: this now refers to removed text and contradicts repeated assertions elsewhere that overlong encodings are unnecessary
— user:Thumperward, (edit B)

It removed the whole section UTF-8#Overlong encodings (old revision). I disagree with its removal.

The example removed in this edit was a clear and easy to understand way of explaining what an overlong encoding means. Yes, you could explain it without using an example, but in my opinion an example is the easiest way for someone unfamiliar to understand the concept. I see it as teaching with your hands and drawing relevant things on a whiteboard versus not having those options available.
I don't understand what the deleted text contradicted, unless this for example refers to the mention in UTF-8#Implementations and adoption of Java's "Modified UTF-8" that uses an overlong encoding for the null character.
- Also seems to have lacked a citation, which probably should have been rfc3629 § 3.