"The sound written with ss, which goes back to an inherited Germanic /s/, differed from the sound written with sz; the ss was pronounced as a voiceless alveolo-palatal fricative [ɕ], whereas the sz was pronounced as a voiceless alveolar fricative [s]. Even when these two sounds merged, both spellings were retained. However, they were confused because no one knew anymore where an sz had originally been and where an ss had been."
I hate you both.
U+01C5: Dž (lower dž, upper DŽ)
U+01C8: Lj (lower lj, upper LJ)
U+01CB: Nj (lower nj, upper NJ)
U+01F2: Dz (lower dz, upper DZ)
U+1F88: ᾈ (lower ᾀ, upper ἈΙ)
U+1F89: ᾉ (lower ᾁ, upper ἉΙ)
U+1F8A: ᾊ (lower ᾂ, upper ἊΙ)
U+1F8B: ᾋ (lower ᾃ, upper ἋΙ)
U+1F8C: ᾌ (lower ᾄ, upper ἌΙ)
U+1F8D: ᾍ (lower ᾅ, upper ἍΙ)
U+1F8E: ᾎ (lower ᾆ, upper ἎΙ)
U+1F8F: ᾏ (lower ᾇ, upper ἏΙ)
U+1F98: ᾘ (lower ᾐ, upper ἨΙ)
U+1F99: ᾙ (lower ᾑ, upper ἩΙ)
U+1F9A: ᾚ (lower ᾒ, upper ἪΙ)
U+1F9B: ᾛ (lower ᾓ, upper ἫΙ)
U+1F9C: ᾜ (lower ᾔ, upper ἬΙ)
U+1F9D: ᾝ (lower ᾕ, upper ἭΙ)
U+1F9E: ᾞ (lower ᾖ, upper ἮΙ)
U+1F9F: ᾟ (lower ᾗ, upper ἯΙ)
U+1FA8: ᾨ (lower ᾠ, upper ὨΙ)
U+1FA9: ᾩ (lower ᾡ, upper ὩΙ)
U+1FAA: ᾪ (lower ᾢ, upper ὪΙ)
U+1FAB: ᾫ (lower ᾣ, upper ὫΙ)
U+1FAC: ᾬ (lower ᾤ, upper ὬΙ)
U+1FAD: ᾭ (lower ᾥ, upper ὭΙ)
U+1FAE: ᾮ (lower ᾦ, upper ὮΙ)
U+1FAF: ᾯ (lower ᾧ, upper ὯΙ)
U+1FBC: ᾼ (lower ᾳ, upper ΑΙ)
U+1FCC: ῌ (lower ῃ, upper ΗΙ)
U+1FFC: ῼ (lower ῳ, upper ΩΙ)
- Unicode codepoints that expand or contract when case is changed in UTF-8: https://gist.github.com/rendello/d37552507a389656e248f3255a6...
- Unicode roundtrip-unsafe characters: https://gist.github.com/rendello/4d8266b7c52bf0e98eab2073b38...
For example, if we do uppercase→lower→upper, some characters don't survive the roundtrip:
Ω ω Ω
İ i̇ İ
K k K
Å å Å
ẞ ß SS
ϴ θ Θ
I'm using the scripts to build out a little automated-testing generator library, something like "Tricky Unicode/UTF-8 case-change characters". Any other weird case quirks anyone can think of to put in the generators?
I do feel it is a error that unit/math symbols get changed, imho they should stay as-is through case conversions.
Someone pointed out the canonical source, which I'll have to look at more closely:
The others are much worse in this regard, since they actually lose meaningful information.
Polytonic orthography (from Ancient Greek πολύς (polýs) 'much, many' and τόνος (tónos) 'accent') is the standard system for Ancient Greek and Medieval Greek and includes:
- acute accent (´)
- circumflex accent (ˆ)
- grave accent (`); these 3 accents indicate different kinds of pitch accent
- rough breathing (῾) indicates the presence of the /h/ sound before a letter
- smooth breathing (᾿) indicates the absence of /h/.
Since in Modern Greek the pitch accent has been replaced by a dynamic accent (stress), and /h/ was lost, most polytonic diacritics have no phonetic significance, and merely reveal the underlying Ancient Greek etymology.
https://en.wikipedia.org/wiki/Vietnamese_phonology#Tone?wpro...
In upper case, ῳ can be written as ῼ, Ω with the subscript or ΩΙ with the distinction between the first two often made as a matter of font design (in fact the appearance of ῼ differs depending on whether it’s in the edit box or in text on this site.
Playing with this, I was thinking that I could enable use of the Silvio Levy’s old 7-bit ascii input for Greek and realized that you would need different mappings of characters depending on where the character mapping happened relative to case folding. Text is messier than most peopler realize.
Here's a list of Unicode digraphs: DZ, Dz, dz, DŽ, Dž, dž, IJ, ij, LJ, Lj, lj, NJ, Nj, nj, ᵺ
https://en.wikipedia.org/wiki/Digraph_(orthography)#In_Unico...
Digraphs ⟨dž⟩, ⟨lj⟩ and ⟨nj⟩ in their upper case, title case and lower case forms have dedicated Unicode code points as shown in the table below, However, these are included chiefly for backwards compatibility with legacy encodings which kept a one-to-one correspondence with Cyrillic; modern texts use a sequence of characters.
[1] https://en.wikipedia.org/wiki/Gaj%27s_Latin_alphabet#Computi... LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON
LATIN CAPITAL LETTER {_} WITH SMALL LETTER {_}
L,J
N,J
D,Z
GREEK CAPITAL LETTER {ALPHA,ETA,OMEGA} WITH PROSGEGRAMMENI
GREEK CAPITAL LETTER {ALPHA,ETA,OMEGA} WITH {PSILI,DASIA} AND {_}
PROSGEGRAMMENI
VARIA AND PROSGEGRAMMENI
OXIA AND PROSGEGRAMMENI
PERISPOMENI AND PROSGEGRAMMENI
[[:Changes_When_Lowercased:]&[:Changes_When_Uppercased:]]
https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%...It’s a handy way of finding all kinds of things along these lines. Look at the properties of some characters you care about, and see how you can add, subtract and intersect them.
[0] Lowestcase and uppestcase letters: Advances in derp learning, Sigbovik 1st April 2021
Each nominal syllable sound in Japanese can be written using a characater in one of these two scripts:
Roman transcription: a i u e o ka ki ku ke ko
Hiragana: あ い う え お か き く け こ
Katakana: ア イ ウ エ オ カ キ ク ケ コ
There are some rough parallels between upper case and katakana.
- Katakana is used less than hiragana; "katakana heavy" text will be something that is loaded with foreign words (like a software manual) or terms from zoology and botany.
- It is sometimes used to denote SHOUTING, like in quoted speech such as cartoon bubbles.
- Some early computing displays in the west could only produce upper case characters; in Japan, some early displays only featured katakana. It needs less resolution for clarity.
data:text/html,<div lang="nl" style="text-transform:capitalize">ijmuiden
In Firefox, this displays correctly as "IJmuiden" (thanks to the lang attribute; without that, it would show "Ijmuiden").Actually it kinda makes sense to have two Latin letters form a digraph if they are used to represent a single Cyrillic letter, while it makes less sense for Hungarian, which (AFAIK) has always been written with Latin letters? I mean, of course you could do it, but then I want an extra Unicode code point for the German "sch" too!
If you look at the whole Hungarian alphabet (https://learnhungarianfromhome.com/wp-content/uploads/2020/0...), you get a total of 8 digraphs and 1 trigraph (plus 9 letters with diacritics), but "Lj" and "Nj" are not among them...
> These digraphs owe their existence in Unicode not to Hungarian but to Serbo-Croatian. Serbo-Croatian is written in both Latin script (Croatian) and Cyrillic script (Serbian), and these digraphs permit one-to-one transliteration between them.¹
Of course, at the time it made sense to someone.
> These digraphs owe their existence in Unicode ... to Serbo-Croatian. Serbo-Croatian is written in both Latin script (Croatian) and Cyrillic script (Serbian), and these digraphs permit one-to-one transliteration between them.
I really wish the Unicode consortium would learn to say "No". If you added a three-letter letter to your alphabet, you can probably make do with three lettets in your text files.
There's so many characters with little to no utility and weird properties that seem to exist just to trip up programs attempting to commit the unforgivable sin of basic text manipulation.
There's a note in the docs [0],
Changed in version 3.8: The first character is now put into titlecase rather than uppercase. This means that characters like digraphs will only have their first letter capitalized, instead of the full character.
[0] https://docs.python.org/3/library/stdtypes.html#str.capitali...EDIT: As the other reply says, ".title()" is probably a better answer to your question. Warning as the docs show [1], this splits things on sequence of letters, not whitespace!
>>> "they're bill's friends from the UK".title()
"They'Re Bill'S Friends From The Uk"
[1] https://docs.python.org/3/library/stdtypes.html#str.titleMy native language also happens to use a Cyrillic alphabet and has letters that would translate to multiple ones in the Latin alphabet:
ш -> sh
щ -> sht
я -> ya
Somehow we manage to get by without special sh, sht, and ya unicode characters, weird.I should also note that it's not like Cyrillic doesn't have its share of digraphs - that's what combinations like нь effectively are, since they signify a single phoneme. And, conversely, it's pretty obvious that you can have a Latin-based orthography with no digraphs at all, just diacritics.
This whole situation has to do with legacy encodings and not much else.
That's a bit of an exaggeration, the Glagolitic script was only ever used by scholars, the earliest Cyrillic writings are not not even 50 years older than the Glagolitic.
You're right that the Cyrillic is indeed way closer to the Greek alphabet than the Glagolitic, despite being named after Cyril. I'm not complaining about the "forsaking of culture", I just found it interesting that I was being "mono-cultural" for disagreeing with the existence of a few weird Unicode code-points that themselves are a direct result of someone's attempt to move towards a "mono-culture".
What I'm complaining against, if anything, are overly complex standards. This is just one of what's probably 100 different quirks that you should be aware of when working with Unicode text, and this one could've been easily avoided by just not including a few useless characters.
Correctly supporting the entirety of Unicode faithfully in this sense has been unreachable for your average app for a very long time now, IMO, so it's fine to just do the best you can (i.e. usually, the most you can defer to the libraries) for the audience that you actually have or anticipate for convoluted stuff like this. I don't think that correctly handling casing for legacy digraph codepoints is something that many people need in practice, not even speakers of languages whence those Unicode digraphs originate.
It's still a massive improvement for interop because at least you can be sure that any two apps that need the symbol will use the same encoding for it and will be able to exchange that data, even if nobody truly supports the whole thing.
Incidentally I believe that this is kinda also the approach HN takes, there is at least some Unicode filtering going on here.
In Croatian it doesn't matter, literally nobody uses the digraph Unicode characters because they do not appear on the keyboard. Instead you just write these digraphs as two regular Latin characters: nj, lj and dž.
Well of course not, it’s double u, not double v… so maybe “lau” should match “law”!
(That’s one thing French got right. Dooblah vay, double v. (Is there are proper French spelling for that pronunciation? Like how h is aitch in English.))
In Belgium it can be pronounced or heard as "way" (wé) usually for - BMW as "bay-hem-way" (bé-m-wé) - www as "way-way-way" (wé-wé-wé) - WC as "way-say" (wé-c) .
It's one thing I keep using from Belgian French despite having lived in Switzerland for over a decade, because it's objectively better.
(Swiss French has the objectively better names for 70-80-90, though. No quatre-vingt-dix BS like on France. :-p)
No, French doesn't have spelling for the name of letters.
(I'm a native French speaker.)
I think it's a double v in German. Since French doesn't really use it, they could import any of the names. Portuguese is on the same boat. It imported the double u name, but still has plenty of words where it's a double v... you can't make it all correct.
I suddenly realized the code must only be removing one part of some of the surrogate emojis, leaving behind an invisible non-printing part of an emoji in the string.
Some emojis got so complex they literally scrapped them this year. The family emojis seemed cool in someone's head, but then someone tried to make a family with mixed-ethnic parents and the children are locked to one skin color; the only solution presented was to add 7,000 more emojis to Unicode.
https://www.mobiletechjournal.com/the-family-emojis-are-now-...
Is the answer a switch statement?
Edit: ah no, we're actually talking about human language and characters.
"Dž" ~~ /<:Lt>/ #「Dž」 (matches)
"Dž" ~~ /<:Lu>/ #Nil (doesn't match)
Lt = Titlecase_Letter
Lu = Uppercase_Letter
raku regex are a step improvement over the original perl5 regex which is used in most current languages (both regex engines were designed by Larry Wall - raku is perl6 with a new name)deep support for Unicode and Graphemes makes raku almost unique in its support for Unicode properties within this new regex 2.0 (I hear that Swift is also strong in this area)
here is a great blog series by Paweł bbkr Pabian that explains all these unicode things in a very unserstandable way https://dev.to/bbkr/utf-8-regular-expressions-20h0
This doesn't seem right. If the individual letters "d" and "z" exist, then it should be possible to have them next to each other in a text file without them necessarily collapsing into a single letter -- especially if they're actually represented as separate characters, which they are in the example. Even if the letter "w" wasn't correctly represented and required actually typing "uu", you wouldn't want the word "vacuum" to be interpreted as having a "w"!