How to make tr aware of non-ascii(unicode) characters? Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) 2019 Community Moderator Election Results Why I closed the “Why is Kali so hard” questiontr not replacing apostropheHow do I turn accented lowercase letters to uppercase? - Using the 'tr' commandHow can I convert Persian numerals in UTF-8 to European numerals in ASCII?tr analog for unicode characters?How to translate Unicode characters?How do I extract only alphanumeric characters from a given text file and print them?Character count of language X in mixed text file?Print out binary data as is without breaking the terminalRemove new line, space from fileHow to do a regex search in a UTF-16LE file while in a UTF-8 locale?Non-ASCII printable characters in sshd bannerHow can I make the TTY use the appropriate charset?How to make the login shell xterm use utf-8?Unicode support in talk?Detect how much of Unicode my terminal supports, even through screenWhy doesn't my Perl play nice with Unicode?tr analog for unicode characters?How to translate Unicode characters?Removing characters with sed
What is the meaning of the new sigil in Game of Thrones Season 8 intro?
Should I discuss the type of campaign with my players?
Can a USB port passively 'listen only'?
Error "illegal generic type for instanceof" when using local classes
Why light coming from distant stars is not discreet?
How to tell that you are a giant?
Resolving to minmaj7
Can an alien society believe that their star system is the universe?
Why are Kinder Surprise Eggs illegal in the USA?
Single word antonym of "flightless"
Identifying polygons that intersect with another layer using QGIS?
String `!23` is replaced with `docker` in command line
Denied boarding although I have proper visa and documentation. To whom should I make a complaint?
When do you get frequent flier miles - when you buy, or when you fly?
Book where humans were engineered with genes from animal species to survive hostile planets
How to deal with a team lead who never gives me credit?
Echoing a tail command produces unexpected output?
51k Euros annually for a family of 4 in Berlin: Is it enough?
Why is my conclusion inconsistent with the van't Hoff equation?
Sci-Fi book where patients in a coma ward all live in a subconscious world linked together
How do pianists reach extremely loud dynamics?
Can a non-EU citizen traveling with me come with me through the EU passport line?
Check which numbers satisfy the condition [A*B*C = A! + B! + C!]
What does an IRS interview request entail when called in to verify expenses for a sole proprietor small business?
How to make tr aware of non-ascii(unicode) characters?
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)
2019 Community Moderator Election Results
Why I closed the “Why is Kali so hard” questiontr not replacing apostropheHow do I turn accented lowercase letters to uppercase? - Using the 'tr' commandHow can I convert Persian numerals in UTF-8 to European numerals in ASCII?tr analog for unicode characters?How to translate Unicode characters?How do I extract only alphanumeric characters from a given text file and print them?Character count of language X in mixed text file?Print out binary data as is without breaking the terminalRemove new line, space from fileHow to do a regex search in a UTF-16LE file while in a UTF-8 locale?Non-ASCII printable characters in sshd bannerHow can I make the TTY use the appropriate charset?How to make the login shell xterm use utf-8?Unicode support in talk?Detect how much of Unicode my terminal supports, even through screenWhy doesn't my Perl play nice with Unicode?tr analog for unicode characters?How to translate Unicode characters?Removing characters with sed
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
I'm trying to remove some characters from file(UTF-8). I'm using tr
for this purpose:
tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat
File contains some foreign characters (like "Латвийская" or "àé"). tr
doesn't seem to understand them: it treats them as non-alpha and removes too.
I've tried changing some of my locale settings:
LC_CTYPE=C LC_COLLATE=C tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat
LC_CTYPE=ru_RU.UTF-8 LC_COLLATE=C tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat
LC_CTYPE=ru_RU.UTF-8 LC_COLLATE=ru_RU.UTF-8 tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat
Unfortunately, none of these worked.
How can I make tr
understand Unicode?
linux text-processing unicode tr
add a comment |
I'm trying to remove some characters from file(UTF-8). I'm using tr
for this purpose:
tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat
File contains some foreign characters (like "Латвийская" or "àé"). tr
doesn't seem to understand them: it treats them as non-alpha and removes too.
I've tried changing some of my locale settings:
LC_CTYPE=C LC_COLLATE=C tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat
LC_CTYPE=ru_RU.UTF-8 LC_COLLATE=C tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat
LC_CTYPE=ru_RU.UTF-8 LC_COLLATE=ru_RU.UTF-8 tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat
Unfortunately, none of these worked.
How can I make tr
understand Unicode?
linux text-processing unicode tr
add a comment |
I'm trying to remove some characters from file(UTF-8). I'm using tr
for this purpose:
tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat
File contains some foreign characters (like "Латвийская" or "àé"). tr
doesn't seem to understand them: it treats them as non-alpha and removes too.
I've tried changing some of my locale settings:
LC_CTYPE=C LC_COLLATE=C tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat
LC_CTYPE=ru_RU.UTF-8 LC_COLLATE=C tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat
LC_CTYPE=ru_RU.UTF-8 LC_COLLATE=ru_RU.UTF-8 tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat
Unfortunately, none of these worked.
How can I make tr
understand Unicode?
linux text-processing unicode tr
I'm trying to remove some characters from file(UTF-8). I'm using tr
for this purpose:
tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat
File contains some foreign characters (like "Латвийская" or "àé"). tr
doesn't seem to understand them: it treats them as non-alpha and removes too.
I've tried changing some of my locale settings:
LC_CTYPE=C LC_COLLATE=C tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat
LC_CTYPE=ru_RU.UTF-8 LC_COLLATE=C tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat
LC_CTYPE=ru_RU.UTF-8 LC_COLLATE=ru_RU.UTF-8 tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat
Unfortunately, none of these worked.
How can I make tr
understand Unicode?
linux text-processing unicode tr
linux text-processing unicode tr
edited Sep 9 '15 at 14:53
Toby Speight
5,61811234
5,61811234
asked Sep 9 '15 at 12:57
MatthewRockMatthewRock
4,07331849
4,07331849
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
That's a known (1, 2, 3, 4, 5, 6) limitation of the GNU implementation of tr
.
It's not as much that it doesn't support foreign, non-English or non-ASCII characters, but that it doesn't support multi-byte characters.
Those Cyrillic characters would be treated OK, if written in the iso8859-5 (single-byte per character) character set (and your locale was using that charset), but your problem is that you're using UTF-8 where non-ASCII characters are encoded in 2 or more bytes.
GNU's got a plan (see also) to fix that and work is under way but not there yet.
FreeBSD or Solaris tr
don't have the problem.
In the mean time, for most use cases of tr
, you can use GNU sed or GNU awk which do support multi-byte characters.
For instance, your:
tr -cs '[[:alpha:][:space:]]' ' '
could be written:
gsed -E 's/( |[^[:space:][:alpha:]])+/ /'
or:
gawk -v RS='( |[^[:space:][:alpha:]])+' 'printf "%s", sep $0; sep=" "'
To convert between lower and upper case (tr '[:upper:]' '[:lower:]'
):
gsed 's/[[:upper:]]/l&/g'
(that l
is a lowercase L
, not the 1
digit).
or:
gawk 'print tolower($0)'
For portability, perl
is another alternative:
perl -Mopen=locale -pe 's/([^[:space:][:alpha:]]| )+/ /g'
perl -Mopen=locale -pe '$_=lc$_'
If you know the data can be represented in a single-byte character set, then you can process it in that charset:
(export LC_ALL=ru_RU.iso88595
iconv -f utf-8 |
tr -cs '[:alpha:][:space:]' ' ' |
iconv -t utf-8) < Russian-file.utf8
1
I've accepted your question because of information about tr. I've solved the problem, and removed question about how to solve it(so people looking for tr will find only information about tr, not some arbitrary problem). If you could please remove solution too, since it's no longer needed, I'd be thankful.
– MatthewRock
Sep 9 '15 at 14:41
3
@MatthewRock I've kept it but reworded it and made more generic as giving a word around would be useful to people with the same problem.
– Stéphane Chazelas
Sep 9 '15 at 15:22
Where do you get an idea that Cyrillic is (customarily) encoded in ISO 8859-5? Did you ever see a Russian text in anything but Unicode?
– Incnis Mrsi
Sep 9 '15 at 16:35
9
@IncnisMrsi, all that matters here is that ISO 8859-5 is one of those singe-byte charsets that has those Cyrillic characters. Whether it's in widespread use or not is irrelevant here. If you have a locale with KOI-R or window-1251 charset, by all means, use it instead.
– Stéphane Chazelas
Sep 9 '15 at 16:43
@IncnisMrsi Russian on the web is almost always encoded in UTF-8 (or occasionally in Windows-1251), but only because we’ve felt the pain of many single-byte encodings early on. Here’s an ancient (circa 1998) web page with a (non-functional) encoding switcher: sch57.ru/collect.
– Alex Shpilkin
Apr 18 '18 at 21:06
|
show 1 more comment
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f228558%2fhow-to-make-tr-aware-of-non-asciiunicode-characters%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
That's a known (1, 2, 3, 4, 5, 6) limitation of the GNU implementation of tr
.
It's not as much that it doesn't support foreign, non-English or non-ASCII characters, but that it doesn't support multi-byte characters.
Those Cyrillic characters would be treated OK, if written in the iso8859-5 (single-byte per character) character set (and your locale was using that charset), but your problem is that you're using UTF-8 where non-ASCII characters are encoded in 2 or more bytes.
GNU's got a plan (see also) to fix that and work is under way but not there yet.
FreeBSD or Solaris tr
don't have the problem.
In the mean time, for most use cases of tr
, you can use GNU sed or GNU awk which do support multi-byte characters.
For instance, your:
tr -cs '[[:alpha:][:space:]]' ' '
could be written:
gsed -E 's/( |[^[:space:][:alpha:]])+/ /'
or:
gawk -v RS='( |[^[:space:][:alpha:]])+' 'printf "%s", sep $0; sep=" "'
To convert between lower and upper case (tr '[:upper:]' '[:lower:]'
):
gsed 's/[[:upper:]]/l&/g'
(that l
is a lowercase L
, not the 1
digit).
or:
gawk 'print tolower($0)'
For portability, perl
is another alternative:
perl -Mopen=locale -pe 's/([^[:space:][:alpha:]]| )+/ /g'
perl -Mopen=locale -pe '$_=lc$_'
If you know the data can be represented in a single-byte character set, then you can process it in that charset:
(export LC_ALL=ru_RU.iso88595
iconv -f utf-8 |
tr -cs '[:alpha:][:space:]' ' ' |
iconv -t utf-8) < Russian-file.utf8
1
I've accepted your question because of information about tr. I've solved the problem, and removed question about how to solve it(so people looking for tr will find only information about tr, not some arbitrary problem). If you could please remove solution too, since it's no longer needed, I'd be thankful.
– MatthewRock
Sep 9 '15 at 14:41
3
@MatthewRock I've kept it but reworded it and made more generic as giving a word around would be useful to people with the same problem.
– Stéphane Chazelas
Sep 9 '15 at 15:22
Where do you get an idea that Cyrillic is (customarily) encoded in ISO 8859-5? Did you ever see a Russian text in anything but Unicode?
– Incnis Mrsi
Sep 9 '15 at 16:35
9
@IncnisMrsi, all that matters here is that ISO 8859-5 is one of those singe-byte charsets that has those Cyrillic characters. Whether it's in widespread use or not is irrelevant here. If you have a locale with KOI-R or window-1251 charset, by all means, use it instead.
– Stéphane Chazelas
Sep 9 '15 at 16:43
@IncnisMrsi Russian on the web is almost always encoded in UTF-8 (or occasionally in Windows-1251), but only because we’ve felt the pain of many single-byte encodings early on. Here’s an ancient (circa 1998) web page with a (non-functional) encoding switcher: sch57.ru/collect.
– Alex Shpilkin
Apr 18 '18 at 21:06
|
show 1 more comment
That's a known (1, 2, 3, 4, 5, 6) limitation of the GNU implementation of tr
.
It's not as much that it doesn't support foreign, non-English or non-ASCII characters, but that it doesn't support multi-byte characters.
Those Cyrillic characters would be treated OK, if written in the iso8859-5 (single-byte per character) character set (and your locale was using that charset), but your problem is that you're using UTF-8 where non-ASCII characters are encoded in 2 or more bytes.
GNU's got a plan (see also) to fix that and work is under way but not there yet.
FreeBSD or Solaris tr
don't have the problem.
In the mean time, for most use cases of tr
, you can use GNU sed or GNU awk which do support multi-byte characters.
For instance, your:
tr -cs '[[:alpha:][:space:]]' ' '
could be written:
gsed -E 's/( |[^[:space:][:alpha:]])+/ /'
or:
gawk -v RS='( |[^[:space:][:alpha:]])+' 'printf "%s", sep $0; sep=" "'
To convert between lower and upper case (tr '[:upper:]' '[:lower:]'
):
gsed 's/[[:upper:]]/l&/g'
(that l
is a lowercase L
, not the 1
digit).
or:
gawk 'print tolower($0)'
For portability, perl
is another alternative:
perl -Mopen=locale -pe 's/([^[:space:][:alpha:]]| )+/ /g'
perl -Mopen=locale -pe '$_=lc$_'
If you know the data can be represented in a single-byte character set, then you can process it in that charset:
(export LC_ALL=ru_RU.iso88595
iconv -f utf-8 |
tr -cs '[:alpha:][:space:]' ' ' |
iconv -t utf-8) < Russian-file.utf8
1
I've accepted your question because of information about tr. I've solved the problem, and removed question about how to solve it(so people looking for tr will find only information about tr, not some arbitrary problem). If you could please remove solution too, since it's no longer needed, I'd be thankful.
– MatthewRock
Sep 9 '15 at 14:41
3
@MatthewRock I've kept it but reworded it and made more generic as giving a word around would be useful to people with the same problem.
– Stéphane Chazelas
Sep 9 '15 at 15:22
Where do you get an idea that Cyrillic is (customarily) encoded in ISO 8859-5? Did you ever see a Russian text in anything but Unicode?
– Incnis Mrsi
Sep 9 '15 at 16:35
9
@IncnisMrsi, all that matters here is that ISO 8859-5 is one of those singe-byte charsets that has those Cyrillic characters. Whether it's in widespread use or not is irrelevant here. If you have a locale with KOI-R or window-1251 charset, by all means, use it instead.
– Stéphane Chazelas
Sep 9 '15 at 16:43
@IncnisMrsi Russian on the web is almost always encoded in UTF-8 (or occasionally in Windows-1251), but only because we’ve felt the pain of many single-byte encodings early on. Here’s an ancient (circa 1998) web page with a (non-functional) encoding switcher: sch57.ru/collect.
– Alex Shpilkin
Apr 18 '18 at 21:06
|
show 1 more comment
That's a known (1, 2, 3, 4, 5, 6) limitation of the GNU implementation of tr
.
It's not as much that it doesn't support foreign, non-English or non-ASCII characters, but that it doesn't support multi-byte characters.
Those Cyrillic characters would be treated OK, if written in the iso8859-5 (single-byte per character) character set (and your locale was using that charset), but your problem is that you're using UTF-8 where non-ASCII characters are encoded in 2 or more bytes.
GNU's got a plan (see also) to fix that and work is under way but not there yet.
FreeBSD or Solaris tr
don't have the problem.
In the mean time, for most use cases of tr
, you can use GNU sed or GNU awk which do support multi-byte characters.
For instance, your:
tr -cs '[[:alpha:][:space:]]' ' '
could be written:
gsed -E 's/( |[^[:space:][:alpha:]])+/ /'
or:
gawk -v RS='( |[^[:space:][:alpha:]])+' 'printf "%s", sep $0; sep=" "'
To convert between lower and upper case (tr '[:upper:]' '[:lower:]'
):
gsed 's/[[:upper:]]/l&/g'
(that l
is a lowercase L
, not the 1
digit).
or:
gawk 'print tolower($0)'
For portability, perl
is another alternative:
perl -Mopen=locale -pe 's/([^[:space:][:alpha:]]| )+/ /g'
perl -Mopen=locale -pe '$_=lc$_'
If you know the data can be represented in a single-byte character set, then you can process it in that charset:
(export LC_ALL=ru_RU.iso88595
iconv -f utf-8 |
tr -cs '[:alpha:][:space:]' ' ' |
iconv -t utf-8) < Russian-file.utf8
That's a known (1, 2, 3, 4, 5, 6) limitation of the GNU implementation of tr
.
It's not as much that it doesn't support foreign, non-English or non-ASCII characters, but that it doesn't support multi-byte characters.
Those Cyrillic characters would be treated OK, if written in the iso8859-5 (single-byte per character) character set (and your locale was using that charset), but your problem is that you're using UTF-8 where non-ASCII characters are encoded in 2 or more bytes.
GNU's got a plan (see also) to fix that and work is under way but not there yet.
FreeBSD or Solaris tr
don't have the problem.
In the mean time, for most use cases of tr
, you can use GNU sed or GNU awk which do support multi-byte characters.
For instance, your:
tr -cs '[[:alpha:][:space:]]' ' '
could be written:
gsed -E 's/( |[^[:space:][:alpha:]])+/ /'
or:
gawk -v RS='( |[^[:space:][:alpha:]])+' 'printf "%s", sep $0; sep=" "'
To convert between lower and upper case (tr '[:upper:]' '[:lower:]'
):
gsed 's/[[:upper:]]/l&/g'
(that l
is a lowercase L
, not the 1
digit).
or:
gawk 'print tolower($0)'
For portability, perl
is another alternative:
perl -Mopen=locale -pe 's/([^[:space:][:alpha:]]| )+/ /g'
perl -Mopen=locale -pe '$_=lc$_'
If you know the data can be represented in a single-byte character set, then you can process it in that charset:
(export LC_ALL=ru_RU.iso88595
iconv -f utf-8 |
tr -cs '[:alpha:][:space:]' ' ' |
iconv -t utf-8) < Russian-file.utf8
edited 10 hours ago
answered Sep 9 '15 at 13:47
Stéphane ChazelasStéphane Chazelas
315k57597955
315k57597955
1
I've accepted your question because of information about tr. I've solved the problem, and removed question about how to solve it(so people looking for tr will find only information about tr, not some arbitrary problem). If you could please remove solution too, since it's no longer needed, I'd be thankful.
– MatthewRock
Sep 9 '15 at 14:41
3
@MatthewRock I've kept it but reworded it and made more generic as giving a word around would be useful to people with the same problem.
– Stéphane Chazelas
Sep 9 '15 at 15:22
Where do you get an idea that Cyrillic is (customarily) encoded in ISO 8859-5? Did you ever see a Russian text in anything but Unicode?
– Incnis Mrsi
Sep 9 '15 at 16:35
9
@IncnisMrsi, all that matters here is that ISO 8859-5 is one of those singe-byte charsets that has those Cyrillic characters. Whether it's in widespread use or not is irrelevant here. If you have a locale with KOI-R or window-1251 charset, by all means, use it instead.
– Stéphane Chazelas
Sep 9 '15 at 16:43
@IncnisMrsi Russian on the web is almost always encoded in UTF-8 (or occasionally in Windows-1251), but only because we’ve felt the pain of many single-byte encodings early on. Here’s an ancient (circa 1998) web page with a (non-functional) encoding switcher: sch57.ru/collect.
– Alex Shpilkin
Apr 18 '18 at 21:06
|
show 1 more comment
1
I've accepted your question because of information about tr. I've solved the problem, and removed question about how to solve it(so people looking for tr will find only information about tr, not some arbitrary problem). If you could please remove solution too, since it's no longer needed, I'd be thankful.
– MatthewRock
Sep 9 '15 at 14:41
3
@MatthewRock I've kept it but reworded it and made more generic as giving a word around would be useful to people with the same problem.
– Stéphane Chazelas
Sep 9 '15 at 15:22
Where do you get an idea that Cyrillic is (customarily) encoded in ISO 8859-5? Did you ever see a Russian text in anything but Unicode?
– Incnis Mrsi
Sep 9 '15 at 16:35
9
@IncnisMrsi, all that matters here is that ISO 8859-5 is one of those singe-byte charsets that has those Cyrillic characters. Whether it's in widespread use or not is irrelevant here. If you have a locale with KOI-R or window-1251 charset, by all means, use it instead.
– Stéphane Chazelas
Sep 9 '15 at 16:43
@IncnisMrsi Russian on the web is almost always encoded in UTF-8 (or occasionally in Windows-1251), but only because we’ve felt the pain of many single-byte encodings early on. Here’s an ancient (circa 1998) web page with a (non-functional) encoding switcher: sch57.ru/collect.
– Alex Shpilkin
Apr 18 '18 at 21:06
1
1
I've accepted your question because of information about tr. I've solved the problem, and removed question about how to solve it(so people looking for tr will find only information about tr, not some arbitrary problem). If you could please remove solution too, since it's no longer needed, I'd be thankful.
– MatthewRock
Sep 9 '15 at 14:41
I've accepted your question because of information about tr. I've solved the problem, and removed question about how to solve it(so people looking for tr will find only information about tr, not some arbitrary problem). If you could please remove solution too, since it's no longer needed, I'd be thankful.
– MatthewRock
Sep 9 '15 at 14:41
3
3
@MatthewRock I've kept it but reworded it and made more generic as giving a word around would be useful to people with the same problem.
– Stéphane Chazelas
Sep 9 '15 at 15:22
@MatthewRock I've kept it but reworded it and made more generic as giving a word around would be useful to people with the same problem.
– Stéphane Chazelas
Sep 9 '15 at 15:22
Where do you get an idea that Cyrillic is (customarily) encoded in ISO 8859-5? Did you ever see a Russian text in anything but Unicode?
– Incnis Mrsi
Sep 9 '15 at 16:35
Where do you get an idea that Cyrillic is (customarily) encoded in ISO 8859-5? Did you ever see a Russian text in anything but Unicode?
– Incnis Mrsi
Sep 9 '15 at 16:35
9
9
@IncnisMrsi, all that matters here is that ISO 8859-5 is one of those singe-byte charsets that has those Cyrillic characters. Whether it's in widespread use or not is irrelevant here. If you have a locale with KOI-R or window-1251 charset, by all means, use it instead.
– Stéphane Chazelas
Sep 9 '15 at 16:43
@IncnisMrsi, all that matters here is that ISO 8859-5 is one of those singe-byte charsets that has those Cyrillic characters. Whether it's in widespread use or not is irrelevant here. If you have a locale with KOI-R or window-1251 charset, by all means, use it instead.
– Stéphane Chazelas
Sep 9 '15 at 16:43
@IncnisMrsi Russian on the web is almost always encoded in UTF-8 (or occasionally in Windows-1251), but only because we’ve felt the pain of many single-byte encodings early on. Here’s an ancient (circa 1998) web page with a (non-functional) encoding switcher: sch57.ru/collect.
– Alex Shpilkin
Apr 18 '18 at 21:06
@IncnisMrsi Russian on the web is almost always encoded in UTF-8 (or occasionally in Windows-1251), but only because we’ve felt the pain of many single-byte encodings early on. Here’s an ancient (circa 1998) web page with a (non-functional) encoding switcher: sch57.ru/collect.
– Alex Shpilkin
Apr 18 '18 at 21:06
|
show 1 more comment
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f228558%2fhow-to-make-tr-aware-of-non-asciiunicode-characters%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
-linux, text-processing, tr, unicode