• NeatNit@discuss.tchncs.de
    ·
    edit-2
    4 months ago

    I'm assuming Unicode anyway, and UTF-8 is by far the most natural because most files will be in ASCII. A "normal form" (see link above), you might think of it as a canonical form, is a way to check if two strings are equivalent, even if they encoded the text differently. Like the example mentioned on Wikipedia:

    For example, the distinct Unicode strings "U+212B" (the angstrom sign "Å") and "U+00C5" (the Swedish letter "Å") are both expanded by NFD (or NFKD) into the sequence "U+0041 U+030A" (Latin letter "A" and combining ring above "°") which is then reduced by NFC (or NFKC) to "U+00C5" (the Swedish letter "Å").