Is there a convenient way to classify files as “binary” or “text”? Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) 2019 Community Moderator Election Results Why I closed the “Why is Kali so hard” questionBash script: check if a file is a text fileShould I end my text/script files with a newline?Subtracting Binary filesWhich extension to use for text files? (Unix/Linux)Finding all “Non-Binary” filesIs there a way to merge two files smartly?md5sum command binary and text modeIs there an auto refreshing graphical text file reader?Convenient way to name files in LinuxText file being identified as binaryIs there a portable way to switch text case from the command line?

Can an alien society believe that their star system is the universe?

Using et al. for a last / senior author rather than for a first author

When do you get frequent flier miles - when you buy, or when you fly?

Identifying polygons that intersect with another layer using QGIS?

Why do we bend a book to keep it straight?

How can I make names more distinctive without making them longer?

What to do with chalk when deepwater soloing?

Why do people hide their license plates in the EU?

String `!23` is replaced with `docker` in command line

Why am I getting the error "non-boolean type specified in a context where a condition is expected" for this request?

What's the purpose of writing one's academic biography in the third person?

How widely used is the term Treppenwitz? Is it something that most Germans know?

What exactly is a "Meth" in Altered Carbon?

How to tell that you are a giant?

Can I cast Passwall to drop an enemy into a 20-foot pit?

How to bypass password on Windows XP account?

Why didn't this character "real die" when they blew their stack out in Altered Carbon?

Is there a (better) way to access $wpdb results?

How does debian/ubuntu knows a package has a updated version

What does "fit" mean in this sentence?

Overriding an object in memory with placement new

Coloring maths inside a tcolorbox

Output the ŋarâþ crîþ alphabet song without using (m)any letters

Why did the Falcon Heavy center core fall off the ASDS OCISLY barge?



Is there a convenient way to classify files as “binary” or “text”?



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)
2019 Community Moderator Election Results
Why I closed the “Why is Kali so hard” questionBash script: check if a file is a text fileShould I end my text/script files with a newline?Subtracting Binary filesWhich extension to use for text files? (Unix/Linux)Finding all “Non-Binary” filesIs there a way to merge two files smartly?md5sum command binary and text modeIs there an auto refreshing graphical text file reader?Convenient way to name files in LinuxText file being identified as binaryIs there a portable way to switch text case from the command line?



.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








33















Standard Unix utilities like grep and diff use some heuristic to classify files as "text" or "binary". (E.g. grep's output may include lines like Binary file frobozz matches.)



Is there a convenient test one can apply in a zsh script to perform a similar "text/binary" classification? (Other than something like grep '' somefile | grep -q Binary.)



(I realize that any such test would necessarily be heuristic, and therefore imperfect.)










share|improve this question



















  • 10





    file is a standard utility and can run through the file magic for determining file types to the best of its abilities. It can tell most text formats and does a pretty decent job on binary formats. If all you're trying to do is find out if a file is text or not, that's the command you're interested in.

    – Bratchley
    Apr 10 '16 at 16:37











  • @Bratchley: some versions of file will print, e.g. shell script, for some files I would like classified as "text". Is there a way to get file to print just text or binary?

    – kjo
    Apr 10 '16 at 16:48







  • 1





    @don_crissti That question is about someone trying to get people to debug his bash script. Detecting text is just what the script is supposed to do. They ended up having an issue in one of their cut commands.

    – Bratchley
    Apr 10 '16 at 17:18






  • 1





    @don_crissti The fact that there's an answer on question A that works for question B does not always make A a duplicate of B. Consider someone who is looking for a way to classify files as text or binary. Which is more useful: a “debug my script” question which happens to have a generic answer buried among other answers that are specific to that script, or a generic “how do I classify fiels as text or binary?”?

    – Gilles
    Apr 10 '16 at 21:05






  • 1





    @Gilles - depends on how you read it. I actually see the question there as a typical case of XY problem: OP there wants to check if a file is a text file - and thinks piping file output to cut is the solution - sure, there's a missing space which makes it fail and that has made most people there address the Y instead of the X but Stéphane's comments and answer show the proper way to determine whether the file is text or not.

    – don_crissti
    Apr 10 '16 at 21:15


















33















Standard Unix utilities like grep and diff use some heuristic to classify files as "text" or "binary". (E.g. grep's output may include lines like Binary file frobozz matches.)



Is there a convenient test one can apply in a zsh script to perform a similar "text/binary" classification? (Other than something like grep '' somefile | grep -q Binary.)



(I realize that any such test would necessarily be heuristic, and therefore imperfect.)










share|improve this question



















  • 10





    file is a standard utility and can run through the file magic for determining file types to the best of its abilities. It can tell most text formats and does a pretty decent job on binary formats. If all you're trying to do is find out if a file is text or not, that's the command you're interested in.

    – Bratchley
    Apr 10 '16 at 16:37











  • @Bratchley: some versions of file will print, e.g. shell script, for some files I would like classified as "text". Is there a way to get file to print just text or binary?

    – kjo
    Apr 10 '16 at 16:48







  • 1





    @don_crissti That question is about someone trying to get people to debug his bash script. Detecting text is just what the script is supposed to do. They ended up having an issue in one of their cut commands.

    – Bratchley
    Apr 10 '16 at 17:18






  • 1





    @don_crissti The fact that there's an answer on question A that works for question B does not always make A a duplicate of B. Consider someone who is looking for a way to classify files as text or binary. Which is more useful: a “debug my script” question which happens to have a generic answer buried among other answers that are specific to that script, or a generic “how do I classify fiels as text or binary?”?

    – Gilles
    Apr 10 '16 at 21:05






  • 1





    @Gilles - depends on how you read it. I actually see the question there as a typical case of XY problem: OP there wants to check if a file is a text file - and thinks piping file output to cut is the solution - sure, there's a missing space which makes it fail and that has made most people there address the Y instead of the X but Stéphane's comments and answer show the proper way to determine whether the file is text or not.

    – don_crissti
    Apr 10 '16 at 21:15














33












33








33


7






Standard Unix utilities like grep and diff use some heuristic to classify files as "text" or "binary". (E.g. grep's output may include lines like Binary file frobozz matches.)



Is there a convenient test one can apply in a zsh script to perform a similar "text/binary" classification? (Other than something like grep '' somefile | grep -q Binary.)



(I realize that any such test would necessarily be heuristic, and therefore imperfect.)










share|improve this question
















Standard Unix utilities like grep and diff use some heuristic to classify files as "text" or "binary". (E.g. grep's output may include lines like Binary file frobozz matches.)



Is there a convenient test one can apply in a zsh script to perform a similar "text/binary" classification? (Other than something like grep '' somefile | grep -q Binary.)



(I realize that any such test would necessarily be heuristic, and therefore imperfect.)







files text






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Apr 10 '16 at 21:03









Gilles

548k13011131631




548k13011131631










asked Apr 10 '16 at 16:16









kjokjo

4,238114070




4,238114070







  • 10





    file is a standard utility and can run through the file magic for determining file types to the best of its abilities. It can tell most text formats and does a pretty decent job on binary formats. If all you're trying to do is find out if a file is text or not, that's the command you're interested in.

    – Bratchley
    Apr 10 '16 at 16:37











  • @Bratchley: some versions of file will print, e.g. shell script, for some files I would like classified as "text". Is there a way to get file to print just text or binary?

    – kjo
    Apr 10 '16 at 16:48







  • 1





    @don_crissti That question is about someone trying to get people to debug his bash script. Detecting text is just what the script is supposed to do. They ended up having an issue in one of their cut commands.

    – Bratchley
    Apr 10 '16 at 17:18






  • 1





    @don_crissti The fact that there's an answer on question A that works for question B does not always make A a duplicate of B. Consider someone who is looking for a way to classify files as text or binary. Which is more useful: a “debug my script” question which happens to have a generic answer buried among other answers that are specific to that script, or a generic “how do I classify fiels as text or binary?”?

    – Gilles
    Apr 10 '16 at 21:05






  • 1





    @Gilles - depends on how you read it. I actually see the question there as a typical case of XY problem: OP there wants to check if a file is a text file - and thinks piping file output to cut is the solution - sure, there's a missing space which makes it fail and that has made most people there address the Y instead of the X but Stéphane's comments and answer show the proper way to determine whether the file is text or not.

    – don_crissti
    Apr 10 '16 at 21:15













  • 10





    file is a standard utility and can run through the file magic for determining file types to the best of its abilities. It can tell most text formats and does a pretty decent job on binary formats. If all you're trying to do is find out if a file is text or not, that's the command you're interested in.

    – Bratchley
    Apr 10 '16 at 16:37











  • @Bratchley: some versions of file will print, e.g. shell script, for some files I would like classified as "text". Is there a way to get file to print just text or binary?

    – kjo
    Apr 10 '16 at 16:48







  • 1





    @don_crissti That question is about someone trying to get people to debug his bash script. Detecting text is just what the script is supposed to do. They ended up having an issue in one of their cut commands.

    – Bratchley
    Apr 10 '16 at 17:18






  • 1





    @don_crissti The fact that there's an answer on question A that works for question B does not always make A a duplicate of B. Consider someone who is looking for a way to classify files as text or binary. Which is more useful: a “debug my script” question which happens to have a generic answer buried among other answers that are specific to that script, or a generic “how do I classify fiels as text or binary?”?

    – Gilles
    Apr 10 '16 at 21:05






  • 1





    @Gilles - depends on how you read it. I actually see the question there as a typical case of XY problem: OP there wants to check if a file is a text file - and thinks piping file output to cut is the solution - sure, there's a missing space which makes it fail and that has made most people there address the Y instead of the X but Stéphane's comments and answer show the proper way to determine whether the file is text or not.

    – don_crissti
    Apr 10 '16 at 21:15








10




10





file is a standard utility and can run through the file magic for determining file types to the best of its abilities. It can tell most text formats and does a pretty decent job on binary formats. If all you're trying to do is find out if a file is text or not, that's the command you're interested in.

– Bratchley
Apr 10 '16 at 16:37





file is a standard utility and can run through the file magic for determining file types to the best of its abilities. It can tell most text formats and does a pretty decent job on binary formats. If all you're trying to do is find out if a file is text or not, that's the command you're interested in.

– Bratchley
Apr 10 '16 at 16:37













@Bratchley: some versions of file will print, e.g. shell script, for some files I would like classified as "text". Is there a way to get file to print just text or binary?

– kjo
Apr 10 '16 at 16:48






@Bratchley: some versions of file will print, e.g. shell script, for some files I would like classified as "text". Is there a way to get file to print just text or binary?

– kjo
Apr 10 '16 at 16:48





1




1





@don_crissti That question is about someone trying to get people to debug his bash script. Detecting text is just what the script is supposed to do. They ended up having an issue in one of their cut commands.

– Bratchley
Apr 10 '16 at 17:18





@don_crissti That question is about someone trying to get people to debug his bash script. Detecting text is just what the script is supposed to do. They ended up having an issue in one of their cut commands.

– Bratchley
Apr 10 '16 at 17:18




1




1





@don_crissti The fact that there's an answer on question A that works for question B does not always make A a duplicate of B. Consider someone who is looking for a way to classify files as text or binary. Which is more useful: a “debug my script” question which happens to have a generic answer buried among other answers that are specific to that script, or a generic “how do I classify fiels as text or binary?”?

– Gilles
Apr 10 '16 at 21:05





@don_crissti The fact that there's an answer on question A that works for question B does not always make A a duplicate of B. Consider someone who is looking for a way to classify files as text or binary. Which is more useful: a “debug my script” question which happens to have a generic answer buried among other answers that are specific to that script, or a generic “how do I classify fiels as text or binary?”?

– Gilles
Apr 10 '16 at 21:05




1




1





@Gilles - depends on how you read it. I actually see the question there as a typical case of XY problem: OP there wants to check if a file is a text file - and thinks piping file output to cut is the solution - sure, there's a missing space which makes it fail and that has made most people there address the Y instead of the X but Stéphane's comments and answer show the proper way to determine whether the file is text or not.

– don_crissti
Apr 10 '16 at 21:15






@Gilles - depends on how you read it. I actually see the question there as a typical case of XY problem: OP there wants to check if a file is a text file - and thinks piping file output to cut is the solution - sure, there's a missing space which makes it fail and that has made most people there address the Y instead of the X but Stéphane's comments and answer show the proper way to determine whether the file is text or not.

– don_crissti
Apr 10 '16 at 21:15











10 Answers
10






active

oldest

votes


















26














If you ask file for just the mime-type you'll get many different ones like text/x-shellscript, and application/x-executable etc, but I imagine if you just check for the "text" part you should get good results. Eg (-b for no filename in output):



file -b --mime-type filename | sed 's|/.*||'





share|improve this answer




















  • 23





    Just remember, depending on your file, that you might miss some text formats: application/xml (and similar like RSS), application/ecmascript, application/json, image/svg+xml, ... You'd have to whitelist those.

    – Boldewyn
    Apr 11 '16 at 7:38











  • @Boldewyn wow, nice examples! So probably a better answer is just to accept any file that has only printable chars, but somehow also cope with utf-8 and similar encoding problems.

    – meuh
    Apr 11 '16 at 7:49











  • Yes, that's the gist of my answer below. Only problem is, that that solution has to look at the whole file...

    – Boldewyn
    Apr 11 '16 at 8:19






  • 7





    @Boldewyn In principle, application/* types are not intended for human consumption, even when they may be text-based to facilitate development and debugging. That's why there is both a text/xml and an application/xml. So the question whether to consider them as text depends on the OP's needs.

    – Tobia
    Apr 11 '16 at 8:46






  • 3





    Or cut -d/ -f1

    – Stéphane Chazelas
    Apr 11 '16 at 9:07


















20














Another approach would be to use isutf8 from the moreutils collection.



It exits with 0 if the file is valid UTF-8 or ASCII, or short circuits, prints an error message (silence with -q) and exits with 1 otherwise.






share|improve this answer




















  • 5





    Nice suggestion. I just noticed that giving a directory as arg makes it return 0. I would have preferred 1 at least. But then, garbage in, garbage out.

    – meuh
    Apr 11 '16 at 14:07


















12














If you like the heuristic used by GNU grep, you could use it:



isbinary() grep -q '^Binary'



It searches for NUL bytes in the first buffer read from the file (a few kilo-bytes for a regular file, but could be a lot less for a pipe or socket or some devices like /dev/random). In UTF-8 locales, it also flags on byte sequences that don't form valid UTF-8 characters. It assumes LC_ALL is not set to something where the language is not English.



The $1-$REPLY form allows you to use it as a zsh glob qualifier:



ls -ld -- *(.+isbinary)


would list the binary files.






share|improve this answer
































    7














    You can write a script that calls file, and use a case-statement to check for the cases you are interested in.



    For example



    #!/bin/sh
    case $(file "$1") in
    (*script*|* text|* text *)
    echo text
    ;;
    (*)
    echo binary
    ;;
    esac


    though of course there may be many special cases which are of interest. Just checking strings on a copy of libmagic, I see about 200 cases, e.g.,



    Konqueror cookie text
    Korn shell script text executable
    LaTeX 2e document text
    LaTeX document text
    Linux Software Map entry text
    Linux Software Map entry text (new format)
    Linux kernel symbol map text
    Lisp/Scheme program text
    Lua script text executable
    LyX document text
    M3U playlist text
    M4 macro processor script text


    Some use the string "text" as part of a different type, e.g.,



    SoftQuad troff Context intermediate 
    SoftQuad troff Context intermediate for AT&T 495 laser printer
    SoftQuad troff Context intermediate for HP LaserJet


    likewise script could be part of a word, but I see no problems in this case. But a script should check for "text" as a word, not a substring.



    As a reminder, file output does not use a precise description which would always have "script" or "text". Special cases are something to consider. A followup commented that the --mime-type works while this approach would not, for .svg files. However, in a test I see these results for svg-files:



    $ ls -l *.svg
    -r--r--r-- 1 tom users 6679 Jul 26 2012 pumpkin_48x48.svg
    -r--r--r-- 1 tom users 17372 Jul 30 2012 sink_48x48.svg
    -r--r--r-- 1 tom users 5929 Jul 25 2012 vile_48x48.svg
    -r--r--r-- 1 tom users 3553 Jul 28 2012 vile-mini.svg
    $ file *.svg
    pumpkin_48x48.svg: SVG Scalable Vector Graphics image
    sink_48x48.svg: SVG Scalable Vector Graphics image
    vile-mini.svg: SVG Scalable Vector Graphics image
    vile_48x48.svg: SVG Scalable Vector Graphics image
    $ file --mime-type *.svg
    pumpkin_48x48.svg: image/svg+xml
    sink_48x48.svg: image/svg+xml
    vile-mini.svg: image/svg+xml
    vile_48x48.svg: image/svg+xml


    which I selected after seeing a thousand files show only 6 with "text"
    in the mime-type output. Arguably, matching the "xml" on the end of the mime-type output could be more useful, say, than matching "SVG", but using a script to do that takes you back to the suggestion made here.



    The output of file requires some tuning in either scenario, and is not 100% reliable (it is confused by several of my Perl scripts, calling them "data").



    There is more than one implementation of file. The one most commonly used does its work in libmagic, which can be used from different programs (perhaps not directly from zsh, though python can).



    According to File test comparison table for shell, Perl, Ruby, and Python , Perl has a -T option which it can use to provide this information. But it lists no comparable feature for zsh.



    Further reading:



    • zsh glob qualifier to exclude binary files





    share|improve this answer

























    • Unfortunately GNU file's output for svg files: SVG Scalable Vector Graphics image doesn't contain the word text. I thought this approach would be better than the accepted answer of checking the mime-type, but it still misses some types.

      – Peter Cordes
      Apr 11 '16 at 23:34











    • It still misses, with the mime-type; for xterm's svg file I get image/svg+xml. Actually - just checked a 1000-file same, only 6 came out as "text" according to the mime-type alone. I'll stick with a script, which at least can be made to work as needed.

      – Thomas Dickey
      Apr 11 '16 at 23:39



















    6














    You could try determining if iconv can read the file. This is less performing than file (which just reads a couple bytes from the beginning), but will give you more reliable results:



    ENCODING=utf-8
    if iconv --from-code="$ENCODING" --to-code="$ENCODING" your_file.ext > /dev/null 2>&1; then
    echo text
    else
    echo binary
    fi


    This makes iconv basically a no-op, but if it encounters invalid data (invalid UTF-8 in this example), it will barf and exit.






    share|improve this answer




















    • 4





      Using -f and -t instead of the GNU long options would make it more portable. Note that it will call "binary" the files it can't open. It will call empty files "text".

      – Stéphane Chazelas
      Apr 11 '16 at 9:12











    • Agreed. I used the long forms for ad hoc documentation, for people who don't know iconv. But -f and -t are usually better.

      – Boldewyn
      Apr 11 '16 at 10:54


















    3














    file has an option --mime-encoding that attempts to detect the encoding of a file.



     $file --mime-encoding Documents/poster2.pdf 
    Documents/poster2.pdf: binary
     $file --mime-encoding projects/linux/history-torvalds/Makefile
    projects/linux/history-torvalds/Makefile: us-ascii
     $file --mime-encoding graphe.tex
    Dgraphe.tex: us-ascii
     $file --mime-encoding software.tex
    software.tex: utf-8


    You can use file --mime-encoding | grep binary to detect if a file is a binary file. It works reliably although it can get confused by a single invalid character in a long text file.



    For example, I alias cat to the following shell script to avoid ruining my terminal by inadvertently opening a binary file:



    #! /bin/sh -

    [ ! -t 1 ] && exec /bin/cat "$@"
    for i
    do
    if file --mime-encoding -- "$i" | grep -q binary
    then
    hexdump -C -- "$i"
    else
    /bin/cat -- "$i"
    fi
    done





    share|improve this answer
































      3














      Categories are arbitrary. Before answer how to make a classification, you need a (strict) definition. In order to have a definition, you need a purpose.



      So, what do you want to do with that classification?



      • If you want to select ascii/binary in FTP, it's important do not transfer a binary file as ascii (or it will be corrupted). So you shuld test if the file is plain texts, html, rtf, and some others. But in doubt, select binary. And maybe you also want to test that the file only have a subset like 0x0A, 0x0D, and 0x20-0x7F.

      • If you want to transfer the file in some protocol (POP3,SMTP) you need to test to choose if encode in base64 or just plain. In this case, you should test if there are unsupported characters.

      • Any other case… may have any other definition.





      share|improve this answer






























        3














        perl -e'chomp(my$f=<>);print "binary$/" if -B $f;print "text$/" if -T _'


        will do it. See documentation for -B and -T (search in that page for the string The -T and -B switches work as follows).






        share|improve this answer























        • perl -le 'print -B $ARGV[0] ? "binary" : "text"' -- might be clearer. Or even perl -le 'print -B $_ ? "binary" : "text", @ARGV > 1 ? "t$_" : "" for @ARGV' --

          – jrw32982
          Apr 21 '17 at 12:20


















        1














        I contributed to https://github.com/audreyr/binaryornot
        It does not have a command line wrapper (yet) but this is a simple Python library easy enough to call even from the CLI.
        It uses a fairly efficient heuristic to determine if a file is text or binary.






        share|improve this answer






























          1














          I now this answer is a bit old, but I think my friend taught me a great "hack" to do this.



          You use the diff command and check your file against a test text file:



          $ diff filetocheck testfile.txt



          Now if filetocheck is a binary file, the output would be:



          Binary files filetocheck and testfile.txt differ



          This way you could leverage the diff command and e.g. write a function which does the check in a script.






          share|improve this answer

























            Your Answer








            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "106"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f275516%2fis-there-a-convenient-way-to-classify-files-as-binary-or-text%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            10 Answers
            10






            active

            oldest

            votes








            10 Answers
            10






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            26














            If you ask file for just the mime-type you'll get many different ones like text/x-shellscript, and application/x-executable etc, but I imagine if you just check for the "text" part you should get good results. Eg (-b for no filename in output):



            file -b --mime-type filename | sed 's|/.*||'





            share|improve this answer




















            • 23





              Just remember, depending on your file, that you might miss some text formats: application/xml (and similar like RSS), application/ecmascript, application/json, image/svg+xml, ... You'd have to whitelist those.

              – Boldewyn
              Apr 11 '16 at 7:38











            • @Boldewyn wow, nice examples! So probably a better answer is just to accept any file that has only printable chars, but somehow also cope with utf-8 and similar encoding problems.

              – meuh
              Apr 11 '16 at 7:49











            • Yes, that's the gist of my answer below. Only problem is, that that solution has to look at the whole file...

              – Boldewyn
              Apr 11 '16 at 8:19






            • 7





              @Boldewyn In principle, application/* types are not intended for human consumption, even when they may be text-based to facilitate development and debugging. That's why there is both a text/xml and an application/xml. So the question whether to consider them as text depends on the OP's needs.

              – Tobia
              Apr 11 '16 at 8:46






            • 3





              Or cut -d/ -f1

              – Stéphane Chazelas
              Apr 11 '16 at 9:07















            26














            If you ask file for just the mime-type you'll get many different ones like text/x-shellscript, and application/x-executable etc, but I imagine if you just check for the "text" part you should get good results. Eg (-b for no filename in output):



            file -b --mime-type filename | sed 's|/.*||'





            share|improve this answer




















            • 23





              Just remember, depending on your file, that you might miss some text formats: application/xml (and similar like RSS), application/ecmascript, application/json, image/svg+xml, ... You'd have to whitelist those.

              – Boldewyn
              Apr 11 '16 at 7:38











            • @Boldewyn wow, nice examples! So probably a better answer is just to accept any file that has only printable chars, but somehow also cope with utf-8 and similar encoding problems.

              – meuh
              Apr 11 '16 at 7:49











            • Yes, that's the gist of my answer below. Only problem is, that that solution has to look at the whole file...

              – Boldewyn
              Apr 11 '16 at 8:19






            • 7





              @Boldewyn In principle, application/* types are not intended for human consumption, even when they may be text-based to facilitate development and debugging. That's why there is both a text/xml and an application/xml. So the question whether to consider them as text depends on the OP's needs.

              – Tobia
              Apr 11 '16 at 8:46






            • 3





              Or cut -d/ -f1

              – Stéphane Chazelas
              Apr 11 '16 at 9:07













            26












            26








            26







            If you ask file for just the mime-type you'll get many different ones like text/x-shellscript, and application/x-executable etc, but I imagine if you just check for the "text" part you should get good results. Eg (-b for no filename in output):



            file -b --mime-type filename | sed 's|/.*||'





            share|improve this answer















            If you ask file for just the mime-type you'll get many different ones like text/x-shellscript, and application/x-executable etc, but I imagine if you just check for the "text" part you should get good results. Eg (-b for no filename in output):



            file -b --mime-type filename | sed 's|/.*||'






            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Apr 11 '16 at 2:22









            heemayl

            36.4k378108




            36.4k378108










            answered Apr 10 '16 at 17:44









            meuhmeuh

            32.5k12255




            32.5k12255







            • 23





              Just remember, depending on your file, that you might miss some text formats: application/xml (and similar like RSS), application/ecmascript, application/json, image/svg+xml, ... You'd have to whitelist those.

              – Boldewyn
              Apr 11 '16 at 7:38











            • @Boldewyn wow, nice examples! So probably a better answer is just to accept any file that has only printable chars, but somehow also cope with utf-8 and similar encoding problems.

              – meuh
              Apr 11 '16 at 7:49











            • Yes, that's the gist of my answer below. Only problem is, that that solution has to look at the whole file...

              – Boldewyn
              Apr 11 '16 at 8:19






            • 7





              @Boldewyn In principle, application/* types are not intended for human consumption, even when they may be text-based to facilitate development and debugging. That's why there is both a text/xml and an application/xml. So the question whether to consider them as text depends on the OP's needs.

              – Tobia
              Apr 11 '16 at 8:46






            • 3





              Or cut -d/ -f1

              – Stéphane Chazelas
              Apr 11 '16 at 9:07












            • 23





              Just remember, depending on your file, that you might miss some text formats: application/xml (and similar like RSS), application/ecmascript, application/json, image/svg+xml, ... You'd have to whitelist those.

              – Boldewyn
              Apr 11 '16 at 7:38











            • @Boldewyn wow, nice examples! So probably a better answer is just to accept any file that has only printable chars, but somehow also cope with utf-8 and similar encoding problems.

              – meuh
              Apr 11 '16 at 7:49











            • Yes, that's the gist of my answer below. Only problem is, that that solution has to look at the whole file...

              – Boldewyn
              Apr 11 '16 at 8:19






            • 7





              @Boldewyn In principle, application/* types are not intended for human consumption, even when they may be text-based to facilitate development and debugging. That's why there is both a text/xml and an application/xml. So the question whether to consider them as text depends on the OP's needs.

              – Tobia
              Apr 11 '16 at 8:46






            • 3





              Or cut -d/ -f1

              – Stéphane Chazelas
              Apr 11 '16 at 9:07







            23




            23





            Just remember, depending on your file, that you might miss some text formats: application/xml (and similar like RSS), application/ecmascript, application/json, image/svg+xml, ... You'd have to whitelist those.

            – Boldewyn
            Apr 11 '16 at 7:38





            Just remember, depending on your file, that you might miss some text formats: application/xml (and similar like RSS), application/ecmascript, application/json, image/svg+xml, ... You'd have to whitelist those.

            – Boldewyn
            Apr 11 '16 at 7:38













            @Boldewyn wow, nice examples! So probably a better answer is just to accept any file that has only printable chars, but somehow also cope with utf-8 and similar encoding problems.

            – meuh
            Apr 11 '16 at 7:49





            @Boldewyn wow, nice examples! So probably a better answer is just to accept any file that has only printable chars, but somehow also cope with utf-8 and similar encoding problems.

            – meuh
            Apr 11 '16 at 7:49













            Yes, that's the gist of my answer below. Only problem is, that that solution has to look at the whole file...

            – Boldewyn
            Apr 11 '16 at 8:19





            Yes, that's the gist of my answer below. Only problem is, that that solution has to look at the whole file...

            – Boldewyn
            Apr 11 '16 at 8:19




            7




            7





            @Boldewyn In principle, application/* types are not intended for human consumption, even when they may be text-based to facilitate development and debugging. That's why there is both a text/xml and an application/xml. So the question whether to consider them as text depends on the OP's needs.

            – Tobia
            Apr 11 '16 at 8:46





            @Boldewyn In principle, application/* types are not intended for human consumption, even when they may be text-based to facilitate development and debugging. That's why there is both a text/xml and an application/xml. So the question whether to consider them as text depends on the OP's needs.

            – Tobia
            Apr 11 '16 at 8:46




            3




            3





            Or cut -d/ -f1

            – Stéphane Chazelas
            Apr 11 '16 at 9:07





            Or cut -d/ -f1

            – Stéphane Chazelas
            Apr 11 '16 at 9:07













            20














            Another approach would be to use isutf8 from the moreutils collection.



            It exits with 0 if the file is valid UTF-8 or ASCII, or short circuits, prints an error message (silence with -q) and exits with 1 otherwise.






            share|improve this answer




















            • 5





              Nice suggestion. I just noticed that giving a directory as arg makes it return 0. I would have preferred 1 at least. But then, garbage in, garbage out.

              – meuh
              Apr 11 '16 at 14:07















            20














            Another approach would be to use isutf8 from the moreutils collection.



            It exits with 0 if the file is valid UTF-8 or ASCII, or short circuits, prints an error message (silence with -q) and exits with 1 otherwise.






            share|improve this answer




















            • 5





              Nice suggestion. I just noticed that giving a directory as arg makes it return 0. I would have preferred 1 at least. But then, garbage in, garbage out.

              – meuh
              Apr 11 '16 at 14:07













            20












            20








            20







            Another approach would be to use isutf8 from the moreutils collection.



            It exits with 0 if the file is valid UTF-8 or ASCII, or short circuits, prints an error message (silence with -q) and exits with 1 otherwise.






            share|improve this answer















            Another approach would be to use isutf8 from the moreutils collection.



            It exits with 0 if the file is valid UTF-8 or ASCII, or short circuits, prints an error message (silence with -q) and exits with 1 otherwise.







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Apr 11 '16 at 10:49









            techraf

            4,303102243




            4,303102243










            answered Apr 11 '16 at 10:21









            Wander NautaWander Nauta

            30113




            30113







            • 5





              Nice suggestion. I just noticed that giving a directory as arg makes it return 0. I would have preferred 1 at least. But then, garbage in, garbage out.

              – meuh
              Apr 11 '16 at 14:07












            • 5





              Nice suggestion. I just noticed that giving a directory as arg makes it return 0. I would have preferred 1 at least. But then, garbage in, garbage out.

              – meuh
              Apr 11 '16 at 14:07







            5




            5





            Nice suggestion. I just noticed that giving a directory as arg makes it return 0. I would have preferred 1 at least. But then, garbage in, garbage out.

            – meuh
            Apr 11 '16 at 14:07





            Nice suggestion. I just noticed that giving a directory as arg makes it return 0. I would have preferred 1 at least. But then, garbage in, garbage out.

            – meuh
            Apr 11 '16 at 14:07











            12














            If you like the heuristic used by GNU grep, you could use it:



            isbinary() grep -q '^Binary'



            It searches for NUL bytes in the first buffer read from the file (a few kilo-bytes for a regular file, but could be a lot less for a pipe or socket or some devices like /dev/random). In UTF-8 locales, it also flags on byte sequences that don't form valid UTF-8 characters. It assumes LC_ALL is not set to something where the language is not English.



            The $1-$REPLY form allows you to use it as a zsh glob qualifier:



            ls -ld -- *(.+isbinary)


            would list the binary files.






            share|improve this answer





























              12














              If you like the heuristic used by GNU grep, you could use it:



              isbinary() grep -q '^Binary'



              It searches for NUL bytes in the first buffer read from the file (a few kilo-bytes for a regular file, but could be a lot less for a pipe or socket or some devices like /dev/random). In UTF-8 locales, it also flags on byte sequences that don't form valid UTF-8 characters. It assumes LC_ALL is not set to something where the language is not English.



              The $1-$REPLY form allows you to use it as a zsh glob qualifier:



              ls -ld -- *(.+isbinary)


              would list the binary files.






              share|improve this answer



























                12












                12








                12







                If you like the heuristic used by GNU grep, you could use it:



                isbinary() grep -q '^Binary'



                It searches for NUL bytes in the first buffer read from the file (a few kilo-bytes for a regular file, but could be a lot less for a pipe or socket or some devices like /dev/random). In UTF-8 locales, it also flags on byte sequences that don't form valid UTF-8 characters. It assumes LC_ALL is not set to something where the language is not English.



                The $1-$REPLY form allows you to use it as a zsh glob qualifier:



                ls -ld -- *(.+isbinary)


                would list the binary files.






                share|improve this answer















                If you like the heuristic used by GNU grep, you could use it:



                isbinary() grep -q '^Binary'



                It searches for NUL bytes in the first buffer read from the file (a few kilo-bytes for a regular file, but could be a lot less for a pipe or socket or some devices like /dev/random). In UTF-8 locales, it also flags on byte sequences that don't form valid UTF-8 characters. It assumes LC_ALL is not set to something where the language is not English.



                The $1-$REPLY form allows you to use it as a zsh glob qualifier:



                ls -ld -- *(.+isbinary)


                would list the binary files.







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Apr 13 '16 at 13:11

























                answered Apr 11 '16 at 11:21









                Stéphane ChazelasStéphane Chazelas

                315k57597955




                315k57597955





















                    7














                    You can write a script that calls file, and use a case-statement to check for the cases you are interested in.



                    For example



                    #!/bin/sh
                    case $(file "$1") in
                    (*script*|* text|* text *)
                    echo text
                    ;;
                    (*)
                    echo binary
                    ;;
                    esac


                    though of course there may be many special cases which are of interest. Just checking strings on a copy of libmagic, I see about 200 cases, e.g.,



                    Konqueror cookie text
                    Korn shell script text executable
                    LaTeX 2e document text
                    LaTeX document text
                    Linux Software Map entry text
                    Linux Software Map entry text (new format)
                    Linux kernel symbol map text
                    Lisp/Scheme program text
                    Lua script text executable
                    LyX document text
                    M3U playlist text
                    M4 macro processor script text


                    Some use the string "text" as part of a different type, e.g.,



                    SoftQuad troff Context intermediate 
                    SoftQuad troff Context intermediate for AT&T 495 laser printer
                    SoftQuad troff Context intermediate for HP LaserJet


                    likewise script could be part of a word, but I see no problems in this case. But a script should check for "text" as a word, not a substring.



                    As a reminder, file output does not use a precise description which would always have "script" or "text". Special cases are something to consider. A followup commented that the --mime-type works while this approach would not, for .svg files. However, in a test I see these results for svg-files:



                    $ ls -l *.svg
                    -r--r--r-- 1 tom users 6679 Jul 26 2012 pumpkin_48x48.svg
                    -r--r--r-- 1 tom users 17372 Jul 30 2012 sink_48x48.svg
                    -r--r--r-- 1 tom users 5929 Jul 25 2012 vile_48x48.svg
                    -r--r--r-- 1 tom users 3553 Jul 28 2012 vile-mini.svg
                    $ file *.svg
                    pumpkin_48x48.svg: SVG Scalable Vector Graphics image
                    sink_48x48.svg: SVG Scalable Vector Graphics image
                    vile-mini.svg: SVG Scalable Vector Graphics image
                    vile_48x48.svg: SVG Scalable Vector Graphics image
                    $ file --mime-type *.svg
                    pumpkin_48x48.svg: image/svg+xml
                    sink_48x48.svg: image/svg+xml
                    vile-mini.svg: image/svg+xml
                    vile_48x48.svg: image/svg+xml


                    which I selected after seeing a thousand files show only 6 with "text"
                    in the mime-type output. Arguably, matching the "xml" on the end of the mime-type output could be more useful, say, than matching "SVG", but using a script to do that takes you back to the suggestion made here.



                    The output of file requires some tuning in either scenario, and is not 100% reliable (it is confused by several of my Perl scripts, calling them "data").



                    There is more than one implementation of file. The one most commonly used does its work in libmagic, which can be used from different programs (perhaps not directly from zsh, though python can).



                    According to File test comparison table for shell, Perl, Ruby, and Python , Perl has a -T option which it can use to provide this information. But it lists no comparable feature for zsh.



                    Further reading:



                    • zsh glob qualifier to exclude binary files





                    share|improve this answer

























                    • Unfortunately GNU file's output for svg files: SVG Scalable Vector Graphics image doesn't contain the word text. I thought this approach would be better than the accepted answer of checking the mime-type, but it still misses some types.

                      – Peter Cordes
                      Apr 11 '16 at 23:34











                    • It still misses, with the mime-type; for xterm's svg file I get image/svg+xml. Actually - just checked a 1000-file same, only 6 came out as "text" according to the mime-type alone. I'll stick with a script, which at least can be made to work as needed.

                      – Thomas Dickey
                      Apr 11 '16 at 23:39
















                    7














                    You can write a script that calls file, and use a case-statement to check for the cases you are interested in.



                    For example



                    #!/bin/sh
                    case $(file "$1") in
                    (*script*|* text|* text *)
                    echo text
                    ;;
                    (*)
                    echo binary
                    ;;
                    esac


                    though of course there may be many special cases which are of interest. Just checking strings on a copy of libmagic, I see about 200 cases, e.g.,



                    Konqueror cookie text
                    Korn shell script text executable
                    LaTeX 2e document text
                    LaTeX document text
                    Linux Software Map entry text
                    Linux Software Map entry text (new format)
                    Linux kernel symbol map text
                    Lisp/Scheme program text
                    Lua script text executable
                    LyX document text
                    M3U playlist text
                    M4 macro processor script text


                    Some use the string "text" as part of a different type, e.g.,



                    SoftQuad troff Context intermediate 
                    SoftQuad troff Context intermediate for AT&T 495 laser printer
                    SoftQuad troff Context intermediate for HP LaserJet


                    likewise script could be part of a word, but I see no problems in this case. But a script should check for "text" as a word, not a substring.



                    As a reminder, file output does not use a precise description which would always have "script" or "text". Special cases are something to consider. A followup commented that the --mime-type works while this approach would not, for .svg files. However, in a test I see these results for svg-files:



                    $ ls -l *.svg
                    -r--r--r-- 1 tom users 6679 Jul 26 2012 pumpkin_48x48.svg
                    -r--r--r-- 1 tom users 17372 Jul 30 2012 sink_48x48.svg
                    -r--r--r-- 1 tom users 5929 Jul 25 2012 vile_48x48.svg
                    -r--r--r-- 1 tom users 3553 Jul 28 2012 vile-mini.svg
                    $ file *.svg
                    pumpkin_48x48.svg: SVG Scalable Vector Graphics image
                    sink_48x48.svg: SVG Scalable Vector Graphics image
                    vile-mini.svg: SVG Scalable Vector Graphics image
                    vile_48x48.svg: SVG Scalable Vector Graphics image
                    $ file --mime-type *.svg
                    pumpkin_48x48.svg: image/svg+xml
                    sink_48x48.svg: image/svg+xml
                    vile-mini.svg: image/svg+xml
                    vile_48x48.svg: image/svg+xml


                    which I selected after seeing a thousand files show only 6 with "text"
                    in the mime-type output. Arguably, matching the "xml" on the end of the mime-type output could be more useful, say, than matching "SVG", but using a script to do that takes you back to the suggestion made here.



                    The output of file requires some tuning in either scenario, and is not 100% reliable (it is confused by several of my Perl scripts, calling them "data").



                    There is more than one implementation of file. The one most commonly used does its work in libmagic, which can be used from different programs (perhaps not directly from zsh, though python can).



                    According to File test comparison table for shell, Perl, Ruby, and Python , Perl has a -T option which it can use to provide this information. But it lists no comparable feature for zsh.



                    Further reading:



                    • zsh glob qualifier to exclude binary files





                    share|improve this answer

























                    • Unfortunately GNU file's output for svg files: SVG Scalable Vector Graphics image doesn't contain the word text. I thought this approach would be better than the accepted answer of checking the mime-type, but it still misses some types.

                      – Peter Cordes
                      Apr 11 '16 at 23:34











                    • It still misses, with the mime-type; for xterm's svg file I get image/svg+xml. Actually - just checked a 1000-file same, only 6 came out as "text" according to the mime-type alone. I'll stick with a script, which at least can be made to work as needed.

                      – Thomas Dickey
                      Apr 11 '16 at 23:39














                    7












                    7








                    7







                    You can write a script that calls file, and use a case-statement to check for the cases you are interested in.



                    For example



                    #!/bin/sh
                    case $(file "$1") in
                    (*script*|* text|* text *)
                    echo text
                    ;;
                    (*)
                    echo binary
                    ;;
                    esac


                    though of course there may be many special cases which are of interest. Just checking strings on a copy of libmagic, I see about 200 cases, e.g.,



                    Konqueror cookie text
                    Korn shell script text executable
                    LaTeX 2e document text
                    LaTeX document text
                    Linux Software Map entry text
                    Linux Software Map entry text (new format)
                    Linux kernel symbol map text
                    Lisp/Scheme program text
                    Lua script text executable
                    LyX document text
                    M3U playlist text
                    M4 macro processor script text


                    Some use the string "text" as part of a different type, e.g.,



                    SoftQuad troff Context intermediate 
                    SoftQuad troff Context intermediate for AT&T 495 laser printer
                    SoftQuad troff Context intermediate for HP LaserJet


                    likewise script could be part of a word, but I see no problems in this case. But a script should check for "text" as a word, not a substring.



                    As a reminder, file output does not use a precise description which would always have "script" or "text". Special cases are something to consider. A followup commented that the --mime-type works while this approach would not, for .svg files. However, in a test I see these results for svg-files:



                    $ ls -l *.svg
                    -r--r--r-- 1 tom users 6679 Jul 26 2012 pumpkin_48x48.svg
                    -r--r--r-- 1 tom users 17372 Jul 30 2012 sink_48x48.svg
                    -r--r--r-- 1 tom users 5929 Jul 25 2012 vile_48x48.svg
                    -r--r--r-- 1 tom users 3553 Jul 28 2012 vile-mini.svg
                    $ file *.svg
                    pumpkin_48x48.svg: SVG Scalable Vector Graphics image
                    sink_48x48.svg: SVG Scalable Vector Graphics image
                    vile-mini.svg: SVG Scalable Vector Graphics image
                    vile_48x48.svg: SVG Scalable Vector Graphics image
                    $ file --mime-type *.svg
                    pumpkin_48x48.svg: image/svg+xml
                    sink_48x48.svg: image/svg+xml
                    vile-mini.svg: image/svg+xml
                    vile_48x48.svg: image/svg+xml


                    which I selected after seeing a thousand files show only 6 with "text"
                    in the mime-type output. Arguably, matching the "xml" on the end of the mime-type output could be more useful, say, than matching "SVG", but using a script to do that takes you back to the suggestion made here.



                    The output of file requires some tuning in either scenario, and is not 100% reliable (it is confused by several of my Perl scripts, calling them "data").



                    There is more than one implementation of file. The one most commonly used does its work in libmagic, which can be used from different programs (perhaps not directly from zsh, though python can).



                    According to File test comparison table for shell, Perl, Ruby, and Python , Perl has a -T option which it can use to provide this information. But it lists no comparable feature for zsh.



                    Further reading:



                    • zsh glob qualifier to exclude binary files





                    share|improve this answer















                    You can write a script that calls file, and use a case-statement to check for the cases you are interested in.



                    For example



                    #!/bin/sh
                    case $(file "$1") in
                    (*script*|* text|* text *)
                    echo text
                    ;;
                    (*)
                    echo binary
                    ;;
                    esac


                    though of course there may be many special cases which are of interest. Just checking strings on a copy of libmagic, I see about 200 cases, e.g.,



                    Konqueror cookie text
                    Korn shell script text executable
                    LaTeX 2e document text
                    LaTeX document text
                    Linux Software Map entry text
                    Linux Software Map entry text (new format)
                    Linux kernel symbol map text
                    Lisp/Scheme program text
                    Lua script text executable
                    LyX document text
                    M3U playlist text
                    M4 macro processor script text


                    Some use the string "text" as part of a different type, e.g.,



                    SoftQuad troff Context intermediate 
                    SoftQuad troff Context intermediate for AT&T 495 laser printer
                    SoftQuad troff Context intermediate for HP LaserJet


                    likewise script could be part of a word, but I see no problems in this case. But a script should check for "text" as a word, not a substring.



                    As a reminder, file output does not use a precise description which would always have "script" or "text". Special cases are something to consider. A followup commented that the --mime-type works while this approach would not, for .svg files. However, in a test I see these results for svg-files:



                    $ ls -l *.svg
                    -r--r--r-- 1 tom users 6679 Jul 26 2012 pumpkin_48x48.svg
                    -r--r--r-- 1 tom users 17372 Jul 30 2012 sink_48x48.svg
                    -r--r--r-- 1 tom users 5929 Jul 25 2012 vile_48x48.svg
                    -r--r--r-- 1 tom users 3553 Jul 28 2012 vile-mini.svg
                    $ file *.svg
                    pumpkin_48x48.svg: SVG Scalable Vector Graphics image
                    sink_48x48.svg: SVG Scalable Vector Graphics image
                    vile-mini.svg: SVG Scalable Vector Graphics image
                    vile_48x48.svg: SVG Scalable Vector Graphics image
                    $ file --mime-type *.svg
                    pumpkin_48x48.svg: image/svg+xml
                    sink_48x48.svg: image/svg+xml
                    vile-mini.svg: image/svg+xml
                    vile_48x48.svg: image/svg+xml


                    which I selected after seeing a thousand files show only 6 with "text"
                    in the mime-type output. Arguably, matching the "xml" on the end of the mime-type output could be more useful, say, than matching "SVG", but using a script to do that takes you back to the suggestion made here.



                    The output of file requires some tuning in either scenario, and is not 100% reliable (it is confused by several of my Perl scripts, calling them "data").



                    There is more than one implementation of file. The one most commonly used does its work in libmagic, which can be used from different programs (perhaps not directly from zsh, though python can).



                    According to File test comparison table for shell, Perl, Ruby, and Python , Perl has a -T option which it can use to provide this information. But it lists no comparable feature for zsh.



                    Further reading:



                    • zsh glob qualifier to exclude binary files






                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited May 23 '17 at 12:40









                    Community

                    1




                    1










                    answered Apr 10 '16 at 16:59









                    Thomas DickeyThomas Dickey

                    54.3k5106181




                    54.3k5106181












                    • Unfortunately GNU file's output for svg files: SVG Scalable Vector Graphics image doesn't contain the word text. I thought this approach would be better than the accepted answer of checking the mime-type, but it still misses some types.

                      – Peter Cordes
                      Apr 11 '16 at 23:34











                    • It still misses, with the mime-type; for xterm's svg file I get image/svg+xml. Actually - just checked a 1000-file same, only 6 came out as "text" according to the mime-type alone. I'll stick with a script, which at least can be made to work as needed.

                      – Thomas Dickey
                      Apr 11 '16 at 23:39


















                    • Unfortunately GNU file's output for svg files: SVG Scalable Vector Graphics image doesn't contain the word text. I thought this approach would be better than the accepted answer of checking the mime-type, but it still misses some types.

                      – Peter Cordes
                      Apr 11 '16 at 23:34











                    • It still misses, with the mime-type; for xterm's svg file I get image/svg+xml. Actually - just checked a 1000-file same, only 6 came out as "text" according to the mime-type alone. I'll stick with a script, which at least can be made to work as needed.

                      – Thomas Dickey
                      Apr 11 '16 at 23:39

















                    Unfortunately GNU file's output for svg files: SVG Scalable Vector Graphics image doesn't contain the word text. I thought this approach would be better than the accepted answer of checking the mime-type, but it still misses some types.

                    – Peter Cordes
                    Apr 11 '16 at 23:34





                    Unfortunately GNU file's output for svg files: SVG Scalable Vector Graphics image doesn't contain the word text. I thought this approach would be better than the accepted answer of checking the mime-type, but it still misses some types.

                    – Peter Cordes
                    Apr 11 '16 at 23:34













                    It still misses, with the mime-type; for xterm's svg file I get image/svg+xml. Actually - just checked a 1000-file same, only 6 came out as "text" according to the mime-type alone. I'll stick with a script, which at least can be made to work as needed.

                    – Thomas Dickey
                    Apr 11 '16 at 23:39






                    It still misses, with the mime-type; for xterm's svg file I get image/svg+xml. Actually - just checked a 1000-file same, only 6 came out as "text" according to the mime-type alone. I'll stick with a script, which at least can be made to work as needed.

                    – Thomas Dickey
                    Apr 11 '16 at 23:39












                    6














                    You could try determining if iconv can read the file. This is less performing than file (which just reads a couple bytes from the beginning), but will give you more reliable results:



                    ENCODING=utf-8
                    if iconv --from-code="$ENCODING" --to-code="$ENCODING" your_file.ext > /dev/null 2>&1; then
                    echo text
                    else
                    echo binary
                    fi


                    This makes iconv basically a no-op, but if it encounters invalid data (invalid UTF-8 in this example), it will barf and exit.






                    share|improve this answer




















                    • 4





                      Using -f and -t instead of the GNU long options would make it more portable. Note that it will call "binary" the files it can't open. It will call empty files "text".

                      – Stéphane Chazelas
                      Apr 11 '16 at 9:12











                    • Agreed. I used the long forms for ad hoc documentation, for people who don't know iconv. But -f and -t are usually better.

                      – Boldewyn
                      Apr 11 '16 at 10:54















                    6














                    You could try determining if iconv can read the file. This is less performing than file (which just reads a couple bytes from the beginning), but will give you more reliable results:



                    ENCODING=utf-8
                    if iconv --from-code="$ENCODING" --to-code="$ENCODING" your_file.ext > /dev/null 2>&1; then
                    echo text
                    else
                    echo binary
                    fi


                    This makes iconv basically a no-op, but if it encounters invalid data (invalid UTF-8 in this example), it will barf and exit.






                    share|improve this answer




















                    • 4





                      Using -f and -t instead of the GNU long options would make it more portable. Note that it will call "binary" the files it can't open. It will call empty files "text".

                      – Stéphane Chazelas
                      Apr 11 '16 at 9:12











                    • Agreed. I used the long forms for ad hoc documentation, for people who don't know iconv. But -f and -t are usually better.

                      – Boldewyn
                      Apr 11 '16 at 10:54













                    6












                    6








                    6







                    You could try determining if iconv can read the file. This is less performing than file (which just reads a couple bytes from the beginning), but will give you more reliable results:



                    ENCODING=utf-8
                    if iconv --from-code="$ENCODING" --to-code="$ENCODING" your_file.ext > /dev/null 2>&1; then
                    echo text
                    else
                    echo binary
                    fi


                    This makes iconv basically a no-op, but if it encounters invalid data (invalid UTF-8 in this example), it will barf and exit.






                    share|improve this answer















                    You could try determining if iconv can read the file. This is less performing than file (which just reads a couple bytes from the beginning), but will give you more reliable results:



                    ENCODING=utf-8
                    if iconv --from-code="$ENCODING" --to-code="$ENCODING" your_file.ext > /dev/null 2>&1; then
                    echo text
                    else
                    echo binary
                    fi


                    This makes iconv basically a no-op, but if it encounters invalid data (invalid UTF-8 in this example), it will barf and exit.







                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited Apr 11 '16 at 9:10









                    Stéphane Chazelas

                    315k57597955




                    315k57597955










                    answered Apr 11 '16 at 7:46









                    BoldewynBoldewyn

                    43949




                    43949







                    • 4





                      Using -f and -t instead of the GNU long options would make it more portable. Note that it will call "binary" the files it can't open. It will call empty files "text".

                      – Stéphane Chazelas
                      Apr 11 '16 at 9:12











                    • Agreed. I used the long forms for ad hoc documentation, for people who don't know iconv. But -f and -t are usually better.

                      – Boldewyn
                      Apr 11 '16 at 10:54












                    • 4





                      Using -f and -t instead of the GNU long options would make it more portable. Note that it will call "binary" the files it can't open. It will call empty files "text".

                      – Stéphane Chazelas
                      Apr 11 '16 at 9:12











                    • Agreed. I used the long forms for ad hoc documentation, for people who don't know iconv. But -f and -t are usually better.

                      – Boldewyn
                      Apr 11 '16 at 10:54







                    4




                    4





                    Using -f and -t instead of the GNU long options would make it more portable. Note that it will call "binary" the files it can't open. It will call empty files "text".

                    – Stéphane Chazelas
                    Apr 11 '16 at 9:12





                    Using -f and -t instead of the GNU long options would make it more portable. Note that it will call "binary" the files it can't open. It will call empty files "text".

                    – Stéphane Chazelas
                    Apr 11 '16 at 9:12













                    Agreed. I used the long forms for ad hoc documentation, for people who don't know iconv. But -f and -t are usually better.

                    – Boldewyn
                    Apr 11 '16 at 10:54





                    Agreed. I used the long forms for ad hoc documentation, for people who don't know iconv. But -f and -t are usually better.

                    – Boldewyn
                    Apr 11 '16 at 10:54











                    3














                    file has an option --mime-encoding that attempts to detect the encoding of a file.



                     $file --mime-encoding Documents/poster2.pdf 
                    Documents/poster2.pdf: binary
                     $file --mime-encoding projects/linux/history-torvalds/Makefile
                    projects/linux/history-torvalds/Makefile: us-ascii
                     $file --mime-encoding graphe.tex
                    Dgraphe.tex: us-ascii
                     $file --mime-encoding software.tex
                    software.tex: utf-8


                    You can use file --mime-encoding | grep binary to detect if a file is a binary file. It works reliably although it can get confused by a single invalid character in a long text file.



                    For example, I alias cat to the following shell script to avoid ruining my terminal by inadvertently opening a binary file:



                    #! /bin/sh -

                    [ ! -t 1 ] && exec /bin/cat "$@"
                    for i
                    do
                    if file --mime-encoding -- "$i" | grep -q binary
                    then
                    hexdump -C -- "$i"
                    else
                    /bin/cat -- "$i"
                    fi
                    done





                    share|improve this answer





























                      3














                      file has an option --mime-encoding that attempts to detect the encoding of a file.



                       $file --mime-encoding Documents/poster2.pdf 
                      Documents/poster2.pdf: binary
                       $file --mime-encoding projects/linux/history-torvalds/Makefile
                      projects/linux/history-torvalds/Makefile: us-ascii
                       $file --mime-encoding graphe.tex
                      Dgraphe.tex: us-ascii
                       $file --mime-encoding software.tex
                      software.tex: utf-8


                      You can use file --mime-encoding | grep binary to detect if a file is a binary file. It works reliably although it can get confused by a single invalid character in a long text file.



                      For example, I alias cat to the following shell script to avoid ruining my terminal by inadvertently opening a binary file:



                      #! /bin/sh -

                      [ ! -t 1 ] && exec /bin/cat "$@"
                      for i
                      do
                      if file --mime-encoding -- "$i" | grep -q binary
                      then
                      hexdump -C -- "$i"
                      else
                      /bin/cat -- "$i"
                      fi
                      done





                      share|improve this answer



























                        3












                        3








                        3







                        file has an option --mime-encoding that attempts to detect the encoding of a file.



                         $file --mime-encoding Documents/poster2.pdf 
                        Documents/poster2.pdf: binary
                         $file --mime-encoding projects/linux/history-torvalds/Makefile
                        projects/linux/history-torvalds/Makefile: us-ascii
                         $file --mime-encoding graphe.tex
                        Dgraphe.tex: us-ascii
                         $file --mime-encoding software.tex
                        software.tex: utf-8


                        You can use file --mime-encoding | grep binary to detect if a file is a binary file. It works reliably although it can get confused by a single invalid character in a long text file.



                        For example, I alias cat to the following shell script to avoid ruining my terminal by inadvertently opening a binary file:



                        #! /bin/sh -

                        [ ! -t 1 ] && exec /bin/cat "$@"
                        for i
                        do
                        if file --mime-encoding -- "$i" | grep -q binary
                        then
                        hexdump -C -- "$i"
                        else
                        /bin/cat -- "$i"
                        fi
                        done





                        share|improve this answer















                        file has an option --mime-encoding that attempts to detect the encoding of a file.



                         $file --mime-encoding Documents/poster2.pdf 
                        Documents/poster2.pdf: binary
                         $file --mime-encoding projects/linux/history-torvalds/Makefile
                        projects/linux/history-torvalds/Makefile: us-ascii
                         $file --mime-encoding graphe.tex
                        Dgraphe.tex: us-ascii
                         $file --mime-encoding software.tex
                        software.tex: utf-8


                        You can use file --mime-encoding | grep binary to detect if a file is a binary file. It works reliably although it can get confused by a single invalid character in a long text file.



                        For example, I alias cat to the following shell script to avoid ruining my terminal by inadvertently opening a binary file:



                        #! /bin/sh -

                        [ ! -t 1 ] && exec /bin/cat "$@"
                        for i
                        do
                        if file --mime-encoding -- "$i" | grep -q binary
                        then
                        hexdump -C -- "$i"
                        else
                        /bin/cat -- "$i"
                        fi
                        done






                        share|improve this answer














                        share|improve this answer



                        share|improve this answer








                        edited Apr 11 '16 at 9:32









                        Stéphane Chazelas

                        315k57597955




                        315k57597955










                        answered Apr 11 '16 at 8:17









                        lgeorgetlgeorget

                        9,14622754




                        9,14622754





















                            3














                            Categories are arbitrary. Before answer how to make a classification, you need a (strict) definition. In order to have a definition, you need a purpose.



                            So, what do you want to do with that classification?



                            • If you want to select ascii/binary in FTP, it's important do not transfer a binary file as ascii (or it will be corrupted). So you shuld test if the file is plain texts, html, rtf, and some others. But in doubt, select binary. And maybe you also want to test that the file only have a subset like 0x0A, 0x0D, and 0x20-0x7F.

                            • If you want to transfer the file in some protocol (POP3,SMTP) you need to test to choose if encode in base64 or just plain. In this case, you should test if there are unsupported characters.

                            • Any other case… may have any other definition.





                            share|improve this answer



























                              3














                              Categories are arbitrary. Before answer how to make a classification, you need a (strict) definition. In order to have a definition, you need a purpose.



                              So, what do you want to do with that classification?



                              • If you want to select ascii/binary in FTP, it's important do not transfer a binary file as ascii (or it will be corrupted). So you shuld test if the file is plain texts, html, rtf, and some others. But in doubt, select binary. And maybe you also want to test that the file only have a subset like 0x0A, 0x0D, and 0x20-0x7F.

                              • If you want to transfer the file in some protocol (POP3,SMTP) you need to test to choose if encode in base64 or just plain. In this case, you should test if there are unsupported characters.

                              • Any other case… may have any other definition.





                              share|improve this answer

























                                3












                                3








                                3







                                Categories are arbitrary. Before answer how to make a classification, you need a (strict) definition. In order to have a definition, you need a purpose.



                                So, what do you want to do with that classification?



                                • If you want to select ascii/binary in FTP, it's important do not transfer a binary file as ascii (or it will be corrupted). So you shuld test if the file is plain texts, html, rtf, and some others. But in doubt, select binary. And maybe you also want to test that the file only have a subset like 0x0A, 0x0D, and 0x20-0x7F.

                                • If you want to transfer the file in some protocol (POP3,SMTP) you need to test to choose if encode in base64 or just plain. In this case, you should test if there are unsupported characters.

                                • Any other case… may have any other definition.





                                share|improve this answer













                                Categories are arbitrary. Before answer how to make a classification, you need a (strict) definition. In order to have a definition, you need a purpose.



                                So, what do you want to do with that classification?



                                • If you want to select ascii/binary in FTP, it's important do not transfer a binary file as ascii (or it will be corrupted). So you shuld test if the file is plain texts, html, rtf, and some others. But in doubt, select binary. And maybe you also want to test that the file only have a subset like 0x0A, 0x0D, and 0x20-0x7F.

                                • If you want to transfer the file in some protocol (POP3,SMTP) you need to test to choose if encode in base64 or just plain. In this case, you should test if there are unsupported characters.

                                • Any other case… may have any other definition.






                                share|improve this answer












                                share|improve this answer



                                share|improve this answer










                                answered Apr 11 '16 at 16:10









                                ESLESL

                                1464




                                1464





















                                    3














                                    perl -e'chomp(my$f=<>);print "binary$/" if -B $f;print "text$/" if -T _'


                                    will do it. See documentation for -B and -T (search in that page for the string The -T and -B switches work as follows).






                                    share|improve this answer























                                    • perl -le 'print -B $ARGV[0] ? "binary" : "text"' -- might be clearer. Or even perl -le 'print -B $_ ? "binary" : "text", @ARGV > 1 ? "t$_" : "" for @ARGV' --

                                      – jrw32982
                                      Apr 21 '17 at 12:20















                                    3














                                    perl -e'chomp(my$f=<>);print "binary$/" if -B $f;print "text$/" if -T _'


                                    will do it. See documentation for -B and -T (search in that page for the string The -T and -B switches work as follows).






                                    share|improve this answer























                                    • perl -le 'print -B $ARGV[0] ? "binary" : "text"' -- might be clearer. Or even perl -le 'print -B $_ ? "binary" : "text", @ARGV > 1 ? "t$_" : "" for @ARGV' --

                                      – jrw32982
                                      Apr 21 '17 at 12:20













                                    3












                                    3








                                    3







                                    perl -e'chomp(my$f=<>);print "binary$/" if -B $f;print "text$/" if -T _'


                                    will do it. See documentation for -B and -T (search in that page for the string The -T and -B switches work as follows).






                                    share|improve this answer













                                    perl -e'chomp(my$f=<>);print "binary$/" if -B $f;print "text$/" if -T _'


                                    will do it. See documentation for -B and -T (search in that page for the string The -T and -B switches work as follows).







                                    share|improve this answer












                                    share|improve this answer



                                    share|improve this answer










                                    answered Apr 11 '16 at 19:31









                                    msh210msh210

                                    1636




                                    1636












                                    • perl -le 'print -B $ARGV[0] ? "binary" : "text"' -- might be clearer. Or even perl -le 'print -B $_ ? "binary" : "text", @ARGV > 1 ? "t$_" : "" for @ARGV' --

                                      – jrw32982
                                      Apr 21 '17 at 12:20

















                                    • perl -le 'print -B $ARGV[0] ? "binary" : "text"' -- might be clearer. Or even perl -le 'print -B $_ ? "binary" : "text", @ARGV > 1 ? "t$_" : "" for @ARGV' --

                                      – jrw32982
                                      Apr 21 '17 at 12:20
















                                    perl -le 'print -B $ARGV[0] ? "binary" : "text"' -- might be clearer. Or even perl -le 'print -B $_ ? "binary" : "text", @ARGV > 1 ? "t$_" : "" for @ARGV' --

                                    – jrw32982
                                    Apr 21 '17 at 12:20





                                    perl -le 'print -B $ARGV[0] ? "binary" : "text"' -- might be clearer. Or even perl -le 'print -B $_ ? "binary" : "text", @ARGV > 1 ? "t$_" : "" for @ARGV' --

                                    – jrw32982
                                    Apr 21 '17 at 12:20











                                    1














                                    I contributed to https://github.com/audreyr/binaryornot
                                    It does not have a command line wrapper (yet) but this is a simple Python library easy enough to call even from the CLI.
                                    It uses a fairly efficient heuristic to determine if a file is text or binary.






                                    share|improve this answer



























                                      1














                                      I contributed to https://github.com/audreyr/binaryornot
                                      It does not have a command line wrapper (yet) but this is a simple Python library easy enough to call even from the CLI.
                                      It uses a fairly efficient heuristic to determine if a file is text or binary.






                                      share|improve this answer

























                                        1












                                        1








                                        1







                                        I contributed to https://github.com/audreyr/binaryornot
                                        It does not have a command line wrapper (yet) but this is a simple Python library easy enough to call even from the CLI.
                                        It uses a fairly efficient heuristic to determine if a file is text or binary.






                                        share|improve this answer













                                        I contributed to https://github.com/audreyr/binaryornot
                                        It does not have a command line wrapper (yet) but this is a simple Python library easy enough to call even from the CLI.
                                        It uses a fairly efficient heuristic to determine if a file is text or binary.







                                        share|improve this answer












                                        share|improve this answer



                                        share|improve this answer










                                        answered Aug 21 '16 at 22:12









                                        Philippe OmbredannePhilippe Ombredanne

                                        1112




                                        1112





















                                            1














                                            I now this answer is a bit old, but I think my friend taught me a great "hack" to do this.



                                            You use the diff command and check your file against a test text file:



                                            $ diff filetocheck testfile.txt



                                            Now if filetocheck is a binary file, the output would be:



                                            Binary files filetocheck and testfile.txt differ



                                            This way you could leverage the diff command and e.g. write a function which does the check in a script.






                                            share|improve this answer





























                                              1














                                              I now this answer is a bit old, but I think my friend taught me a great "hack" to do this.



                                              You use the diff command and check your file against a test text file:



                                              $ diff filetocheck testfile.txt



                                              Now if filetocheck is a binary file, the output would be:



                                              Binary files filetocheck and testfile.txt differ



                                              This way you could leverage the diff command and e.g. write a function which does the check in a script.






                                              share|improve this answer



























                                                1












                                                1








                                                1







                                                I now this answer is a bit old, but I think my friend taught me a great "hack" to do this.



                                                You use the diff command and check your file against a test text file:



                                                $ diff filetocheck testfile.txt



                                                Now if filetocheck is a binary file, the output would be:



                                                Binary files filetocheck and testfile.txt differ



                                                This way you could leverage the diff command and e.g. write a function which does the check in a script.






                                                share|improve this answer















                                                I now this answer is a bit old, but I think my friend taught me a great "hack" to do this.



                                                You use the diff command and check your file against a test text file:



                                                $ diff filetocheck testfile.txt



                                                Now if filetocheck is a binary file, the output would be:



                                                Binary files filetocheck and testfile.txt differ



                                                This way you could leverage the diff command and e.g. write a function which does the check in a script.







                                                share|improve this answer














                                                share|improve this answer



                                                share|improve this answer








                                                edited 9 hours ago









                                                Rui F Ribeiro

                                                42.1k1484142




                                                42.1k1484142










                                                answered Nov 6 '17 at 16:43









                                                user3019105user3019105

                                                1355




                                                1355



























                                                    draft saved

                                                    draft discarded
















































                                                    Thanks for contributing an answer to Unix & Linux Stack Exchange!


                                                    • Please be sure to answer the question. Provide details and share your research!

                                                    But avoid


                                                    • Asking for help, clarification, or responding to other answers.

                                                    • Making statements based on opinion; back them up with references or personal experience.

                                                    To learn more, see our tips on writing great answers.




                                                    draft saved


                                                    draft discarded














                                                    StackExchange.ready(
                                                    function ()
                                                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f275516%2fis-there-a-convenient-way-to-classify-files-as-binary-or-text%23new-answer', 'question_page');

                                                    );

                                                    Post as a guest















                                                    Required, but never shown





















































                                                    Required, but never shown














                                                    Required, but never shown












                                                    Required, but never shown







                                                    Required, but never shown

































                                                    Required, but never shown














                                                    Required, but never shown












                                                    Required, but never shown







                                                    Required, but never shown







                                                    -files, text

                                                    Popular posts from this blog

                                                    Frič See also Navigation menuinternal link

                                                    Identify plant with long narrow paired leaves and reddish stems Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) Announcing the arrival of Valued Associate #679: Cesar Manara Unicorn Meta Zoo #1: Why another podcast?What is this plant with long sharp leaves? Is it a weed?What is this 3ft high, stalky plant, with mid sized narrow leaves?What is this young shrub with opposite ovate, crenate leaves and reddish stems?What is this plant with large broad serrated leaves?Identify this upright branching weed with long leaves and reddish stemsPlease help me identify this bulbous plant with long, broad leaves and white flowersWhat is this small annual with narrow gray/green leaves and rust colored daisy-type flowers?What is this chilli plant?Does anyone know what type of chilli plant this is?Help identify this plant

                                                    fontconfig warning: “/etc/fonts/fonts.conf”, line 100: unknown “element blank” The 2019 Stack Overflow Developer Survey Results Are In“tar: unrecognized option --warning” during 'apt-get install'How to fix Fontconfig errorHow do I figure out which font file is chosen for a system generic font alias?Why are some apt-get-installed fonts being ignored by fc-list, xfontsel, etc?Reload settings in /etc/fonts/conf.dTaking 30 seconds longer to boot after upgrade from jessie to stretchHow to match multiple font names with a single <match> element?Adding a custom font to fontconfigRemoving fonts from fontconfig <match> resultsBroken fonts after upgrading Firefox ESR to latest Firefox