How to make tr aware of non-ascii(unicode) characters? Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) 2019 Community Moderator Election Results Why I closed the “Why is Kali so hard” questiontr not replacing apostropheHow do I turn accented lowercase letters to uppercase? - Using the 'tr' commandHow can I convert Persian numerals in UTF-8 to European numerals in ASCII?tr analog for unicode characters?How to translate Unicode characters?How do I extract only alphanumeric characters from a given text file and print them?Character count of language X in mixed text file?Print out binary data as is without breaking the terminalRemove new line, space from fileHow to do a regex search in a UTF-16LE file while in a UTF-8 locale?Non-ASCII printable characters in sshd bannerHow can I make the TTY use the appropriate charset?How to make the login shell xterm use utf-8?Unicode support in talk?Detect how much of Unicode my terminal supports, even through screenWhy doesn't my Perl play nice with Unicode?tr analog for unicode characters?How to translate Unicode characters?Removing characters with sed

What is the meaning of the new sigil in Game of Thrones Season 8 intro?

Should I discuss the type of campaign with my players?

Can a USB port passively 'listen only'?

Error "illegal generic type for instanceof" when using local classes

Why light coming from distant stars is not discreet?

How to tell that you are a giant?

Resolving to minmaj7

Can an alien society believe that their star system is the universe?

Why are Kinder Surprise Eggs illegal in the USA?

Single word antonym of "flightless"

Identifying polygons that intersect with another layer using QGIS?

String `!23` is replaced with `docker` in command line

Denied boarding although I have proper visa and documentation. To whom should I make a complaint?

When do you get frequent flier miles - when you buy, or when you fly?

Book where humans were engineered with genes from animal species to survive hostile planets

How to deal with a team lead who never gives me credit?

Echoing a tail command produces unexpected output?

51k Euros annually for a family of 4 in Berlin: Is it enough?

Why is my conclusion inconsistent with the van't Hoff equation?

Sci-Fi book where patients in a coma ward all live in a subconscious world linked together

How do pianists reach extremely loud dynamics?

Can a non-EU citizen traveling with me come with me through the EU passport line?

Check which numbers satisfy the condition [A*B*C = A! + B! + C!]

What does an IRS interview request entail when called in to verify expenses for a sole proprietor small business?



How to make tr aware of non-ascii(unicode) characters?



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)
2019 Community Moderator Election Results
Why I closed the “Why is Kali so hard” questiontr not replacing apostropheHow do I turn accented lowercase letters to uppercase? - Using the 'tr' commandHow can I convert Persian numerals in UTF-8 to European numerals in ASCII?tr analog for unicode characters?How to translate Unicode characters?How do I extract only alphanumeric characters from a given text file and print them?Character count of language X in mixed text file?Print out binary data as is without breaking the terminalRemove new line, space from fileHow to do a regex search in a UTF-16LE file while in a UTF-8 locale?Non-ASCII printable characters in sshd bannerHow can I make the TTY use the appropriate charset?How to make the login shell xterm use utf-8?Unicode support in talk?Detect how much of Unicode my terminal supports, even through screenWhy doesn't my Perl play nice with Unicode?tr analog for unicode characters?How to translate Unicode characters?Removing characters with sed



.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








33















I'm trying to remove some characters from file(UTF-8). I'm using tr for this purpose:



tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat 


File contains some foreign characters (like "Латвийская" or "àé"). tr doesn't seem to understand them: it treats them as non-alpha and removes too.



I've tried changing some of my locale settings:



LC_CTYPE=C LC_COLLATE=C tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat
LC_CTYPE=ru_RU.UTF-8 LC_COLLATE=C tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat
LC_CTYPE=ru_RU.UTF-8 LC_COLLATE=ru_RU.UTF-8 tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat


Unfortunately, none of these worked.



How can I make tr understand Unicode?










share|improve this question






























    33















    I'm trying to remove some characters from file(UTF-8). I'm using tr for this purpose:



    tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat 


    File contains some foreign characters (like "Латвийская" or "àé"). tr doesn't seem to understand them: it treats them as non-alpha and removes too.



    I've tried changing some of my locale settings:



    LC_CTYPE=C LC_COLLATE=C tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat
    LC_CTYPE=ru_RU.UTF-8 LC_COLLATE=C tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat
    LC_CTYPE=ru_RU.UTF-8 LC_COLLATE=ru_RU.UTF-8 tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat


    Unfortunately, none of these worked.



    How can I make tr understand Unicode?










    share|improve this question


























      33












      33








      33


      4






      I'm trying to remove some characters from file(UTF-8). I'm using tr for this purpose:



      tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat 


      File contains some foreign characters (like "Латвийская" or "àé"). tr doesn't seem to understand them: it treats them as non-alpha and removes too.



      I've tried changing some of my locale settings:



      LC_CTYPE=C LC_COLLATE=C tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat
      LC_CTYPE=ru_RU.UTF-8 LC_COLLATE=C tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat
      LC_CTYPE=ru_RU.UTF-8 LC_COLLATE=ru_RU.UTF-8 tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat


      Unfortunately, none of these worked.



      How can I make tr understand Unicode?










      share|improve this question
















      I'm trying to remove some characters from file(UTF-8). I'm using tr for this purpose:



      tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat 


      File contains some foreign characters (like "Латвийская" or "àé"). tr doesn't seem to understand them: it treats them as non-alpha and removes too.



      I've tried changing some of my locale settings:



      LC_CTYPE=C LC_COLLATE=C tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat
      LC_CTYPE=ru_RU.UTF-8 LC_COLLATE=C tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat
      LC_CTYPE=ru_RU.UTF-8 LC_COLLATE=ru_RU.UTF-8 tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat


      Unfortunately, none of these worked.



      How can I make tr understand Unicode?







      linux text-processing unicode tr






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Sep 9 '15 at 14:53









      Toby Speight

      5,61811234




      5,61811234










      asked Sep 9 '15 at 12:57









      MatthewRockMatthewRock

      4,07331849




      4,07331849




















          1 Answer
          1






          active

          oldest

          votes


















          27














          That's a known (1, 2, 3, 4, 5, 6) limitation of the GNU implementation of tr.



          It's not as much that it doesn't support foreign, non-English or non-ASCII characters, but that it doesn't support multi-byte characters.



          Those Cyrillic characters would be treated OK, if written in the iso8859-5 (single-byte per character) character set (and your locale was using that charset), but your problem is that you're using UTF-8 where non-ASCII characters are encoded in 2 or more bytes.



          GNU's got a plan (see also) to fix that and work is under way but not there yet.



          FreeBSD or Solaris tr don't have the problem.




          In the mean time, for most use cases of tr, you can use GNU sed or GNU awk which do support multi-byte characters.



          For instance, your:



          tr -cs '[[:alpha:][:space:]]' ' '


          could be written:



          gsed -E 's/( |[^[:space:][:alpha:]])+/ /'


          or:



          gawk -v RS='( |[^[:space:][:alpha:]])+' 'printf "%s", sep $0; sep=" "'


          To convert between lower and upper case (tr '[:upper:]' '[:lower:]'):



          gsed 's/[[:upper:]]/l&/g'


          (that l is a lowercase L, not the 1 digit).



          or:



          gawk 'print tolower($0)'


          For portability, perl is another alternative:



          perl -Mopen=locale -pe 's/([^[:space:][:alpha:]]| )+/ /g'
          perl -Mopen=locale -pe '$_=lc$_'


          If you know the data can be represented in a single-byte character set, then you can process it in that charset:



          (export LC_ALL=ru_RU.iso88595
          iconv -f utf-8 |
          tr -cs '[:alpha:][:space:]' ' ' |
          iconv -t utf-8) < Russian-file.utf8





          share|improve this answer




















          • 1





            I've accepted your question because of information about tr. I've solved the problem, and removed question about how to solve it(so people looking for tr will find only information about tr, not some arbitrary problem). If you could please remove solution too, since it's no longer needed, I'd be thankful.

            – MatthewRock
            Sep 9 '15 at 14:41






          • 3





            @MatthewRock I've kept it but reworded it and made more generic as giving a word around would be useful to people with the same problem.

            – Stéphane Chazelas
            Sep 9 '15 at 15:22











          • Where do you get an idea that Cyrillic is (customarily) encoded in ISO 8859-5? Did you ever see a Russian text in anything but Unicode?

            – Incnis Mrsi
            Sep 9 '15 at 16:35






          • 9





            @IncnisMrsi, all that matters here is that ISO 8859-5 is one of those singe-byte charsets that has those Cyrillic characters. Whether it's in widespread use or not is irrelevant here. If you have a locale with KOI-R or window-1251 charset, by all means, use it instead.

            – Stéphane Chazelas
            Sep 9 '15 at 16:43











          • @IncnisMrsi Russian on the web is almost always encoded in UTF-8 (or occasionally in Windows-1251), but only because we’ve felt the pain of many single-byte encodings early on. Here’s an ancient (circa 1998) web page with a (non-functional) encoding switcher: sch57.ru/collect.

            – Alex Shpilkin
            Apr 18 '18 at 21:06












          Your Answer








          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "106"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f228558%2fhow-to-make-tr-aware-of-non-asciiunicode-characters%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          27














          That's a known (1, 2, 3, 4, 5, 6) limitation of the GNU implementation of tr.



          It's not as much that it doesn't support foreign, non-English or non-ASCII characters, but that it doesn't support multi-byte characters.



          Those Cyrillic characters would be treated OK, if written in the iso8859-5 (single-byte per character) character set (and your locale was using that charset), but your problem is that you're using UTF-8 where non-ASCII characters are encoded in 2 or more bytes.



          GNU's got a plan (see also) to fix that and work is under way but not there yet.



          FreeBSD or Solaris tr don't have the problem.




          In the mean time, for most use cases of tr, you can use GNU sed or GNU awk which do support multi-byte characters.



          For instance, your:



          tr -cs '[[:alpha:][:space:]]' ' '


          could be written:



          gsed -E 's/( |[^[:space:][:alpha:]])+/ /'


          or:



          gawk -v RS='( |[^[:space:][:alpha:]])+' 'printf "%s", sep $0; sep=" "'


          To convert between lower and upper case (tr '[:upper:]' '[:lower:]'):



          gsed 's/[[:upper:]]/l&/g'


          (that l is a lowercase L, not the 1 digit).



          or:



          gawk 'print tolower($0)'


          For portability, perl is another alternative:



          perl -Mopen=locale -pe 's/([^[:space:][:alpha:]]| )+/ /g'
          perl -Mopen=locale -pe '$_=lc$_'


          If you know the data can be represented in a single-byte character set, then you can process it in that charset:



          (export LC_ALL=ru_RU.iso88595
          iconv -f utf-8 |
          tr -cs '[:alpha:][:space:]' ' ' |
          iconv -t utf-8) < Russian-file.utf8





          share|improve this answer




















          • 1





            I've accepted your question because of information about tr. I've solved the problem, and removed question about how to solve it(so people looking for tr will find only information about tr, not some arbitrary problem). If you could please remove solution too, since it's no longer needed, I'd be thankful.

            – MatthewRock
            Sep 9 '15 at 14:41






          • 3





            @MatthewRock I've kept it but reworded it and made more generic as giving a word around would be useful to people with the same problem.

            – Stéphane Chazelas
            Sep 9 '15 at 15:22











          • Where do you get an idea that Cyrillic is (customarily) encoded in ISO 8859-5? Did you ever see a Russian text in anything but Unicode?

            – Incnis Mrsi
            Sep 9 '15 at 16:35






          • 9





            @IncnisMrsi, all that matters here is that ISO 8859-5 is one of those singe-byte charsets that has those Cyrillic characters. Whether it's in widespread use or not is irrelevant here. If you have a locale with KOI-R or window-1251 charset, by all means, use it instead.

            – Stéphane Chazelas
            Sep 9 '15 at 16:43











          • @IncnisMrsi Russian on the web is almost always encoded in UTF-8 (or occasionally in Windows-1251), but only because we’ve felt the pain of many single-byte encodings early on. Here’s an ancient (circa 1998) web page with a (non-functional) encoding switcher: sch57.ru/collect.

            – Alex Shpilkin
            Apr 18 '18 at 21:06
















          27














          That's a known (1, 2, 3, 4, 5, 6) limitation of the GNU implementation of tr.



          It's not as much that it doesn't support foreign, non-English or non-ASCII characters, but that it doesn't support multi-byte characters.



          Those Cyrillic characters would be treated OK, if written in the iso8859-5 (single-byte per character) character set (and your locale was using that charset), but your problem is that you're using UTF-8 where non-ASCII characters are encoded in 2 or more bytes.



          GNU's got a plan (see also) to fix that and work is under way but not there yet.



          FreeBSD or Solaris tr don't have the problem.




          In the mean time, for most use cases of tr, you can use GNU sed or GNU awk which do support multi-byte characters.



          For instance, your:



          tr -cs '[[:alpha:][:space:]]' ' '


          could be written:



          gsed -E 's/( |[^[:space:][:alpha:]])+/ /'


          or:



          gawk -v RS='( |[^[:space:][:alpha:]])+' 'printf "%s", sep $0; sep=" "'


          To convert between lower and upper case (tr '[:upper:]' '[:lower:]'):



          gsed 's/[[:upper:]]/l&/g'


          (that l is a lowercase L, not the 1 digit).



          or:



          gawk 'print tolower($0)'


          For portability, perl is another alternative:



          perl -Mopen=locale -pe 's/([^[:space:][:alpha:]]| )+/ /g'
          perl -Mopen=locale -pe '$_=lc$_'


          If you know the data can be represented in a single-byte character set, then you can process it in that charset:



          (export LC_ALL=ru_RU.iso88595
          iconv -f utf-8 |
          tr -cs '[:alpha:][:space:]' ' ' |
          iconv -t utf-8) < Russian-file.utf8





          share|improve this answer




















          • 1





            I've accepted your question because of information about tr. I've solved the problem, and removed question about how to solve it(so people looking for tr will find only information about tr, not some arbitrary problem). If you could please remove solution too, since it's no longer needed, I'd be thankful.

            – MatthewRock
            Sep 9 '15 at 14:41






          • 3





            @MatthewRock I've kept it but reworded it and made more generic as giving a word around would be useful to people with the same problem.

            – Stéphane Chazelas
            Sep 9 '15 at 15:22











          • Where do you get an idea that Cyrillic is (customarily) encoded in ISO 8859-5? Did you ever see a Russian text in anything but Unicode?

            – Incnis Mrsi
            Sep 9 '15 at 16:35






          • 9





            @IncnisMrsi, all that matters here is that ISO 8859-5 is one of those singe-byte charsets that has those Cyrillic characters. Whether it's in widespread use or not is irrelevant here. If you have a locale with KOI-R or window-1251 charset, by all means, use it instead.

            – Stéphane Chazelas
            Sep 9 '15 at 16:43











          • @IncnisMrsi Russian on the web is almost always encoded in UTF-8 (or occasionally in Windows-1251), but only because we’ve felt the pain of many single-byte encodings early on. Here’s an ancient (circa 1998) web page with a (non-functional) encoding switcher: sch57.ru/collect.

            – Alex Shpilkin
            Apr 18 '18 at 21:06














          27












          27








          27







          That's a known (1, 2, 3, 4, 5, 6) limitation of the GNU implementation of tr.



          It's not as much that it doesn't support foreign, non-English or non-ASCII characters, but that it doesn't support multi-byte characters.



          Those Cyrillic characters would be treated OK, if written in the iso8859-5 (single-byte per character) character set (and your locale was using that charset), but your problem is that you're using UTF-8 where non-ASCII characters are encoded in 2 or more bytes.



          GNU's got a plan (see also) to fix that and work is under way but not there yet.



          FreeBSD or Solaris tr don't have the problem.




          In the mean time, for most use cases of tr, you can use GNU sed or GNU awk which do support multi-byte characters.



          For instance, your:



          tr -cs '[[:alpha:][:space:]]' ' '


          could be written:



          gsed -E 's/( |[^[:space:][:alpha:]])+/ /'


          or:



          gawk -v RS='( |[^[:space:][:alpha:]])+' 'printf "%s", sep $0; sep=" "'


          To convert between lower and upper case (tr '[:upper:]' '[:lower:]'):



          gsed 's/[[:upper:]]/l&/g'


          (that l is a lowercase L, not the 1 digit).



          or:



          gawk 'print tolower($0)'


          For portability, perl is another alternative:



          perl -Mopen=locale -pe 's/([^[:space:][:alpha:]]| )+/ /g'
          perl -Mopen=locale -pe '$_=lc$_'


          If you know the data can be represented in a single-byte character set, then you can process it in that charset:



          (export LC_ALL=ru_RU.iso88595
          iconv -f utf-8 |
          tr -cs '[:alpha:][:space:]' ' ' |
          iconv -t utf-8) < Russian-file.utf8





          share|improve this answer















          That's a known (1, 2, 3, 4, 5, 6) limitation of the GNU implementation of tr.



          It's not as much that it doesn't support foreign, non-English or non-ASCII characters, but that it doesn't support multi-byte characters.



          Those Cyrillic characters would be treated OK, if written in the iso8859-5 (single-byte per character) character set (and your locale was using that charset), but your problem is that you're using UTF-8 where non-ASCII characters are encoded in 2 or more bytes.



          GNU's got a plan (see also) to fix that and work is under way but not there yet.



          FreeBSD or Solaris tr don't have the problem.




          In the mean time, for most use cases of tr, you can use GNU sed or GNU awk which do support multi-byte characters.



          For instance, your:



          tr -cs '[[:alpha:][:space:]]' ' '


          could be written:



          gsed -E 's/( |[^[:space:][:alpha:]])+/ /'


          or:



          gawk -v RS='( |[^[:space:][:alpha:]])+' 'printf "%s", sep $0; sep=" "'


          To convert between lower and upper case (tr '[:upper:]' '[:lower:]'):



          gsed 's/[[:upper:]]/l&/g'


          (that l is a lowercase L, not the 1 digit).



          or:



          gawk 'print tolower($0)'


          For portability, perl is another alternative:



          perl -Mopen=locale -pe 's/([^[:space:][:alpha:]]| )+/ /g'
          perl -Mopen=locale -pe '$_=lc$_'


          If you know the data can be represented in a single-byte character set, then you can process it in that charset:



          (export LC_ALL=ru_RU.iso88595
          iconv -f utf-8 |
          tr -cs '[:alpha:][:space:]' ' ' |
          iconv -t utf-8) < Russian-file.utf8






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited 10 hours ago

























          answered Sep 9 '15 at 13:47









          Stéphane ChazelasStéphane Chazelas

          315k57597955




          315k57597955







          • 1





            I've accepted your question because of information about tr. I've solved the problem, and removed question about how to solve it(so people looking for tr will find only information about tr, not some arbitrary problem). If you could please remove solution too, since it's no longer needed, I'd be thankful.

            – MatthewRock
            Sep 9 '15 at 14:41






          • 3





            @MatthewRock I've kept it but reworded it and made more generic as giving a word around would be useful to people with the same problem.

            – Stéphane Chazelas
            Sep 9 '15 at 15:22











          • Where do you get an idea that Cyrillic is (customarily) encoded in ISO 8859-5? Did you ever see a Russian text in anything but Unicode?

            – Incnis Mrsi
            Sep 9 '15 at 16:35






          • 9





            @IncnisMrsi, all that matters here is that ISO 8859-5 is one of those singe-byte charsets that has those Cyrillic characters. Whether it's in widespread use or not is irrelevant here. If you have a locale with KOI-R or window-1251 charset, by all means, use it instead.

            – Stéphane Chazelas
            Sep 9 '15 at 16:43











          • @IncnisMrsi Russian on the web is almost always encoded in UTF-8 (or occasionally in Windows-1251), but only because we’ve felt the pain of many single-byte encodings early on. Here’s an ancient (circa 1998) web page with a (non-functional) encoding switcher: sch57.ru/collect.

            – Alex Shpilkin
            Apr 18 '18 at 21:06













          • 1





            I've accepted your question because of information about tr. I've solved the problem, and removed question about how to solve it(so people looking for tr will find only information about tr, not some arbitrary problem). If you could please remove solution too, since it's no longer needed, I'd be thankful.

            – MatthewRock
            Sep 9 '15 at 14:41






          • 3





            @MatthewRock I've kept it but reworded it and made more generic as giving a word around would be useful to people with the same problem.

            – Stéphane Chazelas
            Sep 9 '15 at 15:22











          • Where do you get an idea that Cyrillic is (customarily) encoded in ISO 8859-5? Did you ever see a Russian text in anything but Unicode?

            – Incnis Mrsi
            Sep 9 '15 at 16:35






          • 9





            @IncnisMrsi, all that matters here is that ISO 8859-5 is one of those singe-byte charsets that has those Cyrillic characters. Whether it's in widespread use or not is irrelevant here. If you have a locale with KOI-R or window-1251 charset, by all means, use it instead.

            – Stéphane Chazelas
            Sep 9 '15 at 16:43











          • @IncnisMrsi Russian on the web is almost always encoded in UTF-8 (or occasionally in Windows-1251), but only because we’ve felt the pain of many single-byte encodings early on. Here’s an ancient (circa 1998) web page with a (non-functional) encoding switcher: sch57.ru/collect.

            – Alex Shpilkin
            Apr 18 '18 at 21:06








          1




          1





          I've accepted your question because of information about tr. I've solved the problem, and removed question about how to solve it(so people looking for tr will find only information about tr, not some arbitrary problem). If you could please remove solution too, since it's no longer needed, I'd be thankful.

          – MatthewRock
          Sep 9 '15 at 14:41





          I've accepted your question because of information about tr. I've solved the problem, and removed question about how to solve it(so people looking for tr will find only information about tr, not some arbitrary problem). If you could please remove solution too, since it's no longer needed, I'd be thankful.

          – MatthewRock
          Sep 9 '15 at 14:41




          3




          3





          @MatthewRock I've kept it but reworded it and made more generic as giving a word around would be useful to people with the same problem.

          – Stéphane Chazelas
          Sep 9 '15 at 15:22





          @MatthewRock I've kept it but reworded it and made more generic as giving a word around would be useful to people with the same problem.

          – Stéphane Chazelas
          Sep 9 '15 at 15:22













          Where do you get an idea that Cyrillic is (customarily) encoded in ISO 8859-5? Did you ever see a Russian text in anything but Unicode?

          – Incnis Mrsi
          Sep 9 '15 at 16:35





          Where do you get an idea that Cyrillic is (customarily) encoded in ISO 8859-5? Did you ever see a Russian text in anything but Unicode?

          – Incnis Mrsi
          Sep 9 '15 at 16:35




          9




          9





          @IncnisMrsi, all that matters here is that ISO 8859-5 is one of those singe-byte charsets that has those Cyrillic characters. Whether it's in widespread use or not is irrelevant here. If you have a locale with KOI-R or window-1251 charset, by all means, use it instead.

          – Stéphane Chazelas
          Sep 9 '15 at 16:43





          @IncnisMrsi, all that matters here is that ISO 8859-5 is one of those singe-byte charsets that has those Cyrillic characters. Whether it's in widespread use or not is irrelevant here. If you have a locale with KOI-R or window-1251 charset, by all means, use it instead.

          – Stéphane Chazelas
          Sep 9 '15 at 16:43













          @IncnisMrsi Russian on the web is almost always encoded in UTF-8 (or occasionally in Windows-1251), but only because we’ve felt the pain of many single-byte encodings early on. Here’s an ancient (circa 1998) web page with a (non-functional) encoding switcher: sch57.ru/collect.

          – Alex Shpilkin
          Apr 18 '18 at 21:06






          @IncnisMrsi Russian on the web is almost always encoded in UTF-8 (or occasionally in Windows-1251), but only because we’ve felt the pain of many single-byte encodings early on. Here’s an ancient (circa 1998) web page with a (non-functional) encoding switcher: sch57.ru/collect.

          – Alex Shpilkin
          Apr 18 '18 at 21:06


















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Unix & Linux Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f228558%2fhow-to-make-tr-aware-of-non-asciiunicode-characters%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          -linux, text-processing, tr, unicode

          Popular posts from this blog

          Frič See also Navigation menuinternal link

          Identify plant with long narrow paired leaves and reddish stems Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) Announcing the arrival of Valued Associate #679: Cesar Manara Unicorn Meta Zoo #1: Why another podcast?What is this plant with long sharp leaves? Is it a weed?What is this 3ft high, stalky plant, with mid sized narrow leaves?What is this young shrub with opposite ovate, crenate leaves and reddish stems?What is this plant with large broad serrated leaves?Identify this upright branching weed with long leaves and reddish stemsPlease help me identify this bulbous plant with long, broad leaves and white flowersWhat is this small annual with narrow gray/green leaves and rust colored daisy-type flowers?What is this chilli plant?Does anyone know what type of chilli plant this is?Help identify this plant

          fontconfig warning: “/etc/fonts/fonts.conf”, line 100: unknown “element blank” The 2019 Stack Overflow Developer Survey Results Are In“tar: unrecognized option --warning” during 'apt-get install'How to fix Fontconfig errorHow do I figure out which font file is chosen for a system generic font alias?Why are some apt-get-installed fonts being ignored by fc-list, xfontsel, etc?Reload settings in /etc/fonts/conf.dTaking 30 seconds longer to boot after upgrade from jessie to stretchHow to match multiple font names with a single <match> element?Adding a custom font to fontconfigRemoving fonts from fontconfig <match> resultsBroken fonts after upgrading Firefox ESR to latest Firefox