Is there a convenient way to classify files as “binary” or “text”? Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) 2019 Community Moderator Election Results Why I closed the “Why is Kali so hard” questionBash script: check if a file is a text fileShould I end my text/script files with a newline?Subtracting Binary filesWhich extension to use for text files? (Unix/Linux)Finding all “Non-Binary” filesIs there a way to merge two files smartly?md5sum command binary and text modeIs there an auto refreshing graphical text file reader?Convenient way to name files in LinuxText file being identified as binaryIs there a portable way to switch text case from the command line?
Can an alien society believe that their star system is the universe?
Using et al. for a last / senior author rather than for a first author
When do you get frequent flier miles - when you buy, or when you fly?
Identifying polygons that intersect with another layer using QGIS?
Why do we bend a book to keep it straight?
How can I make names more distinctive without making them longer?
What to do with chalk when deepwater soloing?
Why do people hide their license plates in the EU?
String `!23` is replaced with `docker` in command line
Why am I getting the error "non-boolean type specified in a context where a condition is expected" for this request?
What's the purpose of writing one's academic biography in the third person?
How widely used is the term Treppenwitz? Is it something that most Germans know?
What exactly is a "Meth" in Altered Carbon?
How to tell that you are a giant?
Can I cast Passwall to drop an enemy into a 20-foot pit?
How to bypass password on Windows XP account?
Why didn't this character "real die" when they blew their stack out in Altered Carbon?
Is there a (better) way to access $wpdb results?
How does debian/ubuntu knows a package has a updated version
What does "fit" mean in this sentence?
Overriding an object in memory with placement new
Coloring maths inside a tcolorbox
Output the ŋarâþ crîþ alphabet song without using (m)any letters
Why did the Falcon Heavy center core fall off the ASDS OCISLY barge?
Is there a convenient way to classify files as “binary” or “text”?
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)
2019 Community Moderator Election Results
Why I closed the “Why is Kali so hard” questionBash script: check if a file is a text fileShould I end my text/script files with a newline?Subtracting Binary filesWhich extension to use for text files? (Unix/Linux)Finding all “Non-Binary” filesIs there a way to merge two files smartly?md5sum command binary and text modeIs there an auto refreshing graphical text file reader?Convenient way to name files in LinuxText file being identified as binaryIs there a portable way to switch text case from the command line?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
Standard Unix utilities like grep
and diff
use some heuristic to classify files as "text" or "binary". (E.g. grep
's output may include lines like Binary file frobozz matches
.)
Is there a convenient test one can apply in a zsh
script to perform a similar "text/binary" classification? (Other than something like grep '' somefile | grep -q Binary
.)
(I realize that any such test would necessarily be heuristic, and therefore imperfect.)
files text
|
show 2 more comments
Standard Unix utilities like grep
and diff
use some heuristic to classify files as "text" or "binary". (E.g. grep
's output may include lines like Binary file frobozz matches
.)
Is there a convenient test one can apply in a zsh
script to perform a similar "text/binary" classification? (Other than something like grep '' somefile | grep -q Binary
.)
(I realize that any such test would necessarily be heuristic, and therefore imperfect.)
files text
10
file
is a standard utility and can run through the file magic for determining file types to the best of its abilities. It can tell most text formats and does a pretty decent job on binary formats. If all you're trying to do is find out if a file is text or not, that's the command you're interested in.
– Bratchley
Apr 10 '16 at 16:37
@Bratchley: some versions offile
will print, e.g.shell script
, for some files I would like classified as "text". Is there a way to getfile
to print justtext
orbinary
?
– kjo
Apr 10 '16 at 16:48
1
@don_crissti That question is about someone trying to get people to debug his bash script. Detecting text is just what the script is supposed to do. They ended up having an issue in one of theircut
commands.
– Bratchley
Apr 10 '16 at 17:18
1
@don_crissti The fact that there's an answer on question A that works for question B does not always make A a duplicate of B. Consider someone who is looking for a way to classify files as text or binary. Which is more useful: a “debug my script” question which happens to have a generic answer buried among other answers that are specific to that script, or a generic “how do I classify fiels as text or binary?”?
– Gilles
Apr 10 '16 at 21:05
1
@Gilles - depends on how you read it. I actually see the question there as a typical case of XY problem: OP there wants to check if a file is a text file - and thinks pipingfile
output tocut
is the solution - sure, there's a missing space which makes it fail and that has made most people there address the Y instead of the X but Stéphane's comments and answer show the proper way to determine whether the file is text or not.
– don_crissti
Apr 10 '16 at 21:15
|
show 2 more comments
Standard Unix utilities like grep
and diff
use some heuristic to classify files as "text" or "binary". (E.g. grep
's output may include lines like Binary file frobozz matches
.)
Is there a convenient test one can apply in a zsh
script to perform a similar "text/binary" classification? (Other than something like grep '' somefile | grep -q Binary
.)
(I realize that any such test would necessarily be heuristic, and therefore imperfect.)
files text
Standard Unix utilities like grep
and diff
use some heuristic to classify files as "text" or "binary". (E.g. grep
's output may include lines like Binary file frobozz matches
.)
Is there a convenient test one can apply in a zsh
script to perform a similar "text/binary" classification? (Other than something like grep '' somefile | grep -q Binary
.)
(I realize that any such test would necessarily be heuristic, and therefore imperfect.)
files text
files text
edited Apr 10 '16 at 21:03
Gilles
548k13011131631
548k13011131631
asked Apr 10 '16 at 16:16
kjokjo
4,238114070
4,238114070
10
file
is a standard utility and can run through the file magic for determining file types to the best of its abilities. It can tell most text formats and does a pretty decent job on binary formats. If all you're trying to do is find out if a file is text or not, that's the command you're interested in.
– Bratchley
Apr 10 '16 at 16:37
@Bratchley: some versions offile
will print, e.g.shell script
, for some files I would like classified as "text". Is there a way to getfile
to print justtext
orbinary
?
– kjo
Apr 10 '16 at 16:48
1
@don_crissti That question is about someone trying to get people to debug his bash script. Detecting text is just what the script is supposed to do. They ended up having an issue in one of theircut
commands.
– Bratchley
Apr 10 '16 at 17:18
1
@don_crissti The fact that there's an answer on question A that works for question B does not always make A a duplicate of B. Consider someone who is looking for a way to classify files as text or binary. Which is more useful: a “debug my script” question which happens to have a generic answer buried among other answers that are specific to that script, or a generic “how do I classify fiels as text or binary?”?
– Gilles
Apr 10 '16 at 21:05
1
@Gilles - depends on how you read it. I actually see the question there as a typical case of XY problem: OP there wants to check if a file is a text file - and thinks pipingfile
output tocut
is the solution - sure, there's a missing space which makes it fail and that has made most people there address the Y instead of the X but Stéphane's comments and answer show the proper way to determine whether the file is text or not.
– don_crissti
Apr 10 '16 at 21:15
|
show 2 more comments
10
file
is a standard utility and can run through the file magic for determining file types to the best of its abilities. It can tell most text formats and does a pretty decent job on binary formats. If all you're trying to do is find out if a file is text or not, that's the command you're interested in.
– Bratchley
Apr 10 '16 at 16:37
@Bratchley: some versions offile
will print, e.g.shell script
, for some files I would like classified as "text". Is there a way to getfile
to print justtext
orbinary
?
– kjo
Apr 10 '16 at 16:48
1
@don_crissti That question is about someone trying to get people to debug his bash script. Detecting text is just what the script is supposed to do. They ended up having an issue in one of theircut
commands.
– Bratchley
Apr 10 '16 at 17:18
1
@don_crissti The fact that there's an answer on question A that works for question B does not always make A a duplicate of B. Consider someone who is looking for a way to classify files as text or binary. Which is more useful: a “debug my script” question which happens to have a generic answer buried among other answers that are specific to that script, or a generic “how do I classify fiels as text or binary?”?
– Gilles
Apr 10 '16 at 21:05
1
@Gilles - depends on how you read it. I actually see the question there as a typical case of XY problem: OP there wants to check if a file is a text file - and thinks pipingfile
output tocut
is the solution - sure, there's a missing space which makes it fail and that has made most people there address the Y instead of the X but Stéphane's comments and answer show the proper way to determine whether the file is text or not.
– don_crissti
Apr 10 '16 at 21:15
10
10
file
is a standard utility and can run through the file magic for determining file types to the best of its abilities. It can tell most text formats and does a pretty decent job on binary formats. If all you're trying to do is find out if a file is text or not, that's the command you're interested in.– Bratchley
Apr 10 '16 at 16:37
file
is a standard utility and can run through the file magic for determining file types to the best of its abilities. It can tell most text formats and does a pretty decent job on binary formats. If all you're trying to do is find out if a file is text or not, that's the command you're interested in.– Bratchley
Apr 10 '16 at 16:37
@Bratchley: some versions of
file
will print, e.g. shell script
, for some files I would like classified as "text". Is there a way to get file
to print just text
or binary
?– kjo
Apr 10 '16 at 16:48
@Bratchley: some versions of
file
will print, e.g. shell script
, for some files I would like classified as "text". Is there a way to get file
to print just text
or binary
?– kjo
Apr 10 '16 at 16:48
1
1
@don_crissti That question is about someone trying to get people to debug his bash script. Detecting text is just what the script is supposed to do. They ended up having an issue in one of their
cut
commands.– Bratchley
Apr 10 '16 at 17:18
@don_crissti That question is about someone trying to get people to debug his bash script. Detecting text is just what the script is supposed to do. They ended up having an issue in one of their
cut
commands.– Bratchley
Apr 10 '16 at 17:18
1
1
@don_crissti The fact that there's an answer on question A that works for question B does not always make A a duplicate of B. Consider someone who is looking for a way to classify files as text or binary. Which is more useful: a “debug my script” question which happens to have a generic answer buried among other answers that are specific to that script, or a generic “how do I classify fiels as text or binary?”?
– Gilles
Apr 10 '16 at 21:05
@don_crissti The fact that there's an answer on question A that works for question B does not always make A a duplicate of B. Consider someone who is looking for a way to classify files as text or binary. Which is more useful: a “debug my script” question which happens to have a generic answer buried among other answers that are specific to that script, or a generic “how do I classify fiels as text or binary?”?
– Gilles
Apr 10 '16 at 21:05
1
1
@Gilles - depends on how you read it. I actually see the question there as a typical case of XY problem: OP there wants to check if a file is a text file - and thinks piping
file
output to cut
is the solution - sure, there's a missing space which makes it fail and that has made most people there address the Y instead of the X but Stéphane's comments and answer show the proper way to determine whether the file is text or not.– don_crissti
Apr 10 '16 at 21:15
@Gilles - depends on how you read it. I actually see the question there as a typical case of XY problem: OP there wants to check if a file is a text file - and thinks piping
file
output to cut
is the solution - sure, there's a missing space which makes it fail and that has made most people there address the Y instead of the X but Stéphane's comments and answer show the proper way to determine whether the file is text or not.– don_crissti
Apr 10 '16 at 21:15
|
show 2 more comments
10 Answers
10
active
oldest
votes
If you ask file
for just the mime-type you'll get many different ones like text/x-shellscript
, and application/x-executable
etc, but I imagine if you just check for the "text" part you should get good results. Eg (-b
for no filename in output):
file -b --mime-type filename | sed 's|/.*||'
23
Just remember, depending on yourfile
, that you might miss some text formats:application/xml
(and similar like RSS),application/ecmascript
,application/json
,image/svg+xml
, ... You'd have to whitelist those.
– Boldewyn
Apr 11 '16 at 7:38
@Boldewyn wow, nice examples! So probably a better answer is just to accept any file that has only printable chars, but somehow also cope with utf-8 and similar encoding problems.
– meuh
Apr 11 '16 at 7:49
Yes, that's the gist of my answer below. Only problem is, that that solution has to look at the whole file...
– Boldewyn
Apr 11 '16 at 8:19
7
@Boldewyn In principle,application/*
types are not intended for human consumption, even when they may be text-based to facilitate development and debugging. That's why there is both atext/xml
and anapplication/xml
. So the question whether to consider them as text depends on the OP's needs.
– Tobia
Apr 11 '16 at 8:46
3
Orcut -d/ -f1
– Stéphane Chazelas
Apr 11 '16 at 9:07
add a comment |
Another approach would be to use isutf8
from the moreutils collection.
It exits with 0 if the file is valid UTF-8 or ASCII, or short circuits, prints an error message (silence with -q
) and exits with 1 otherwise.
5
Nice suggestion. I just noticed that giving a directory as arg makes it return 0. I would have preferred 1 at least. But then, garbage in, garbage out.
– meuh
Apr 11 '16 at 14:07
add a comment |
If you like the heuristic used by GNU grep
, you could use it:
isbinary() grep -q '^Binary'
It searches for NUL bytes in the first buffer read from the file (a few kilo-bytes for a regular file, but could be a lot less for a pipe or socket or some devices like /dev/random
). In UTF-8 locales, it also flags on byte sequences that don't form valid UTF-8 characters. It assumes LC_ALL
is not set to something where the language is not English.
The $1-$REPLY
form allows you to use it as a zsh
glob qualifier:
ls -ld -- *(.+isbinary)
would list the binary files.
add a comment |
You can write a script that calls file
, and use a case-statement to check for the cases you are interested in.
For example
#!/bin/sh
case $(file "$1") in
(*script*|* text|* text *)
echo text
;;
(*)
echo binary
;;
esac
though of course there may be many special cases which are of interest. Just checking strings
on a copy of libmagic
, I see about 200 cases, e.g.,
Konqueror cookie text
Korn shell script text executable
LaTeX 2e document text
LaTeX document text
Linux Software Map entry text
Linux Software Map entry text (new format)
Linux kernel symbol map text
Lisp/Scheme program text
Lua script text executable
LyX document text
M3U playlist text
M4 macro processor script text
Some use the string "text" as part of a different type, e.g.,
SoftQuad troff Context intermediate
SoftQuad troff Context intermediate for AT&T 495 laser printer
SoftQuad troff Context intermediate for HP LaserJet
likewise script
could be part of a word, but I see no problems in this case. But a script should check for "text"
as a word, not a substring.
As a reminder, file
output does not use a precise description which would always have "script" or "text". Special cases are something to consider. A followup commented that the --mime-type
works while this approach would not, for .svg
files. However, in a test I see these results for svg-files:
$ ls -l *.svg
-r--r--r-- 1 tom users 6679 Jul 26 2012 pumpkin_48x48.svg
-r--r--r-- 1 tom users 17372 Jul 30 2012 sink_48x48.svg
-r--r--r-- 1 tom users 5929 Jul 25 2012 vile_48x48.svg
-r--r--r-- 1 tom users 3553 Jul 28 2012 vile-mini.svg
$ file *.svg
pumpkin_48x48.svg: SVG Scalable Vector Graphics image
sink_48x48.svg: SVG Scalable Vector Graphics image
vile-mini.svg: SVG Scalable Vector Graphics image
vile_48x48.svg: SVG Scalable Vector Graphics image
$ file --mime-type *.svg
pumpkin_48x48.svg: image/svg+xml
sink_48x48.svg: image/svg+xml
vile-mini.svg: image/svg+xml
vile_48x48.svg: image/svg+xml
which I selected after seeing a thousand files show only 6 with "text"
in the mime-type output. Arguably, matching the "xml" on the end of the mime-type output could be more useful, say, than matching "SVG", but using a script to do that takes you back to the suggestion made here.
The output of file
requires some tuning in either scenario, and is not 100% reliable (it is confused by several of my Perl scripts, calling them "data").
There is more than one implementation of file
. The one most commonly used does its work in libmagic
, which can be used from different programs (perhaps not directly from zsh
, though python
can).
According to File test comparison table for shell, Perl, Ruby, and Python , Perl has a -T
option which it can use to provide this information. But it lists no comparable feature for zsh
.
Further reading:
- zsh glob qualifier to exclude binary files
Unfortunately GNUfile
's output for svg files:SVG Scalable Vector Graphics image
doesn't contain the word text. I thought this approach would be better than the accepted answer of checking the mime-type, but it still misses some types.
– Peter Cordes
Apr 11 '16 at 23:34
It still misses, with the mime-type; for xterm's svg file I getimage/svg+xml
. Actually - just checked a 1000-file same, only 6 came out as "text" according to the mime-type alone. I'll stick with a script, which at least can be made to work as needed.
– Thomas Dickey
Apr 11 '16 at 23:39
add a comment |
You could try determining if iconv
can read the file. This is less performing than file
(which just reads a couple bytes from the beginning), but will give you more reliable results:
ENCODING=utf-8
if iconv --from-code="$ENCODING" --to-code="$ENCODING" your_file.ext > /dev/null 2>&1; then
echo text
else
echo binary
fi
This makes iconv
basically a no-op, but if it encounters invalid data (invalid UTF-8 in this example), it will barf and exit.
4
Using-f
and-t
instead of the GNU long options would make it more portable. Note that it will call "binary" the files it can't open. It will call empty files "text".
– Stéphane Chazelas
Apr 11 '16 at 9:12
Agreed. I used the long forms for ad hoc documentation, for people who don't knowiconv
. But-f
and-t
are usually better.
– Boldewyn
Apr 11 '16 at 10:54
add a comment |
file
has an option --mime-encoding
that attempts to detect the encoding of a file.
$file --mime-encoding Documents/poster2.pdf
Documents/poster2.pdf: binary
$file --mime-encoding projects/linux/history-torvalds/Makefile
projects/linux/history-torvalds/Makefile: us-ascii
$file --mime-encoding graphe.tex
Dgraphe.tex: us-ascii
$file --mime-encoding software.tex
software.tex: utf-8
You can use file --mime-encoding | grep binary
to detect if a file is a binary file. It works reliably although it can get confused by a single invalid character in a long text file.
For example, I alias cat
to the following shell script to avoid ruining my terminal by inadvertently opening a binary file:
#! /bin/sh -
[ ! -t 1 ] && exec /bin/cat "$@"
for i
do
if file --mime-encoding -- "$i" | grep -q binary
then
hexdump -C -- "$i"
else
/bin/cat -- "$i"
fi
done
add a comment |
Categories are arbitrary. Before answer how to make a classification, you need a (strict) definition. In order to have a definition, you need a purpose.
So, what do you want to do with that classification?
- If you want to select ascii/binary in FTP, it's important do not transfer a binary file as ascii (or it will be corrupted). So you shuld test if the file is plain texts, html, rtf, and some others. But in doubt, select binary. And maybe you also want to test that the file only have a subset like 0x0A, 0x0D, and 0x20-0x7F.
- If you want to transfer the file in some protocol (POP3,SMTP) you need to test to choose if encode in base64 or just plain. In this case, you should test if there are unsupported characters.
- Any other case… may have any other definition.
add a comment |
perl -e'chomp(my$f=<>);print "binary$/" if -B $f;print "text$/" if -T _'
will do it. See documentation for -B
and -T
(search in that page for the string The -T and -B switches work as follows
).
perl -le 'print -B $ARGV[0] ? "binary" : "text"' --
might be clearer. Or evenperl -le 'print -B $_ ? "binary" : "text", @ARGV > 1 ? "t$_" : "" for @ARGV' --
– jrw32982
Apr 21 '17 at 12:20
add a comment |
I contributed to https://github.com/audreyr/binaryornot
It does not have a command line wrapper (yet) but this is a simple Python library easy enough to call even from the CLI.
It uses a fairly efficient heuristic to determine if a file is text or binary.
add a comment |
I now this answer is a bit old, but I think my friend taught me a great "hack" to do this.
You use the diff
command and check your file against a test text file:
$ diff filetocheck testfile.txt
Now if filetocheck
is a binary file, the output would be:
Binary files filetocheck and testfile.txt differ
This way you could leverage the diff
command and e.g. write a function which does the check in a script.
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f275516%2fis-there-a-convenient-way-to-classify-files-as-binary-or-text%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
10 Answers
10
active
oldest
votes
10 Answers
10
active
oldest
votes
active
oldest
votes
active
oldest
votes
If you ask file
for just the mime-type you'll get many different ones like text/x-shellscript
, and application/x-executable
etc, but I imagine if you just check for the "text" part you should get good results. Eg (-b
for no filename in output):
file -b --mime-type filename | sed 's|/.*||'
23
Just remember, depending on yourfile
, that you might miss some text formats:application/xml
(and similar like RSS),application/ecmascript
,application/json
,image/svg+xml
, ... You'd have to whitelist those.
– Boldewyn
Apr 11 '16 at 7:38
@Boldewyn wow, nice examples! So probably a better answer is just to accept any file that has only printable chars, but somehow also cope with utf-8 and similar encoding problems.
– meuh
Apr 11 '16 at 7:49
Yes, that's the gist of my answer below. Only problem is, that that solution has to look at the whole file...
– Boldewyn
Apr 11 '16 at 8:19
7
@Boldewyn In principle,application/*
types are not intended for human consumption, even when they may be text-based to facilitate development and debugging. That's why there is both atext/xml
and anapplication/xml
. So the question whether to consider them as text depends on the OP's needs.
– Tobia
Apr 11 '16 at 8:46
3
Orcut -d/ -f1
– Stéphane Chazelas
Apr 11 '16 at 9:07
add a comment |
If you ask file
for just the mime-type you'll get many different ones like text/x-shellscript
, and application/x-executable
etc, but I imagine if you just check for the "text" part you should get good results. Eg (-b
for no filename in output):
file -b --mime-type filename | sed 's|/.*||'
23
Just remember, depending on yourfile
, that you might miss some text formats:application/xml
(and similar like RSS),application/ecmascript
,application/json
,image/svg+xml
, ... You'd have to whitelist those.
– Boldewyn
Apr 11 '16 at 7:38
@Boldewyn wow, nice examples! So probably a better answer is just to accept any file that has only printable chars, but somehow also cope with utf-8 and similar encoding problems.
– meuh
Apr 11 '16 at 7:49
Yes, that's the gist of my answer below. Only problem is, that that solution has to look at the whole file...
– Boldewyn
Apr 11 '16 at 8:19
7
@Boldewyn In principle,application/*
types are not intended for human consumption, even when they may be text-based to facilitate development and debugging. That's why there is both atext/xml
and anapplication/xml
. So the question whether to consider them as text depends on the OP's needs.
– Tobia
Apr 11 '16 at 8:46
3
Orcut -d/ -f1
– Stéphane Chazelas
Apr 11 '16 at 9:07
add a comment |
If you ask file
for just the mime-type you'll get many different ones like text/x-shellscript
, and application/x-executable
etc, but I imagine if you just check for the "text" part you should get good results. Eg (-b
for no filename in output):
file -b --mime-type filename | sed 's|/.*||'
If you ask file
for just the mime-type you'll get many different ones like text/x-shellscript
, and application/x-executable
etc, but I imagine if you just check for the "text" part you should get good results. Eg (-b
for no filename in output):
file -b --mime-type filename | sed 's|/.*||'
edited Apr 11 '16 at 2:22
heemayl
36.4k378108
36.4k378108
answered Apr 10 '16 at 17:44
meuhmeuh
32.5k12255
32.5k12255
23
Just remember, depending on yourfile
, that you might miss some text formats:application/xml
(and similar like RSS),application/ecmascript
,application/json
,image/svg+xml
, ... You'd have to whitelist those.
– Boldewyn
Apr 11 '16 at 7:38
@Boldewyn wow, nice examples! So probably a better answer is just to accept any file that has only printable chars, but somehow also cope with utf-8 and similar encoding problems.
– meuh
Apr 11 '16 at 7:49
Yes, that's the gist of my answer below. Only problem is, that that solution has to look at the whole file...
– Boldewyn
Apr 11 '16 at 8:19
7
@Boldewyn In principle,application/*
types are not intended for human consumption, even when they may be text-based to facilitate development and debugging. That's why there is both atext/xml
and anapplication/xml
. So the question whether to consider them as text depends on the OP's needs.
– Tobia
Apr 11 '16 at 8:46
3
Orcut -d/ -f1
– Stéphane Chazelas
Apr 11 '16 at 9:07
add a comment |
23
Just remember, depending on yourfile
, that you might miss some text formats:application/xml
(and similar like RSS),application/ecmascript
,application/json
,image/svg+xml
, ... You'd have to whitelist those.
– Boldewyn
Apr 11 '16 at 7:38
@Boldewyn wow, nice examples! So probably a better answer is just to accept any file that has only printable chars, but somehow also cope with utf-8 and similar encoding problems.
– meuh
Apr 11 '16 at 7:49
Yes, that's the gist of my answer below. Only problem is, that that solution has to look at the whole file...
– Boldewyn
Apr 11 '16 at 8:19
7
@Boldewyn In principle,application/*
types are not intended for human consumption, even when they may be text-based to facilitate development and debugging. That's why there is both atext/xml
and anapplication/xml
. So the question whether to consider them as text depends on the OP's needs.
– Tobia
Apr 11 '16 at 8:46
3
Orcut -d/ -f1
– Stéphane Chazelas
Apr 11 '16 at 9:07
23
23
Just remember, depending on your
file
, that you might miss some text formats: application/xml
(and similar like RSS), application/ecmascript
, application/json
, image/svg+xml
, ... You'd have to whitelist those.– Boldewyn
Apr 11 '16 at 7:38
Just remember, depending on your
file
, that you might miss some text formats: application/xml
(and similar like RSS), application/ecmascript
, application/json
, image/svg+xml
, ... You'd have to whitelist those.– Boldewyn
Apr 11 '16 at 7:38
@Boldewyn wow, nice examples! So probably a better answer is just to accept any file that has only printable chars, but somehow also cope with utf-8 and similar encoding problems.
– meuh
Apr 11 '16 at 7:49
@Boldewyn wow, nice examples! So probably a better answer is just to accept any file that has only printable chars, but somehow also cope with utf-8 and similar encoding problems.
– meuh
Apr 11 '16 at 7:49
Yes, that's the gist of my answer below. Only problem is, that that solution has to look at the whole file...
– Boldewyn
Apr 11 '16 at 8:19
Yes, that's the gist of my answer below. Only problem is, that that solution has to look at the whole file...
– Boldewyn
Apr 11 '16 at 8:19
7
7
@Boldewyn In principle,
application/*
types are not intended for human consumption, even when they may be text-based to facilitate development and debugging. That's why there is both a text/xml
and an application/xml
. So the question whether to consider them as text depends on the OP's needs.– Tobia
Apr 11 '16 at 8:46
@Boldewyn In principle,
application/*
types are not intended for human consumption, even when they may be text-based to facilitate development and debugging. That's why there is both a text/xml
and an application/xml
. So the question whether to consider them as text depends on the OP's needs.– Tobia
Apr 11 '16 at 8:46
3
3
Or
cut -d/ -f1
– Stéphane Chazelas
Apr 11 '16 at 9:07
Or
cut -d/ -f1
– Stéphane Chazelas
Apr 11 '16 at 9:07
add a comment |
Another approach would be to use isutf8
from the moreutils collection.
It exits with 0 if the file is valid UTF-8 or ASCII, or short circuits, prints an error message (silence with -q
) and exits with 1 otherwise.
5
Nice suggestion. I just noticed that giving a directory as arg makes it return 0. I would have preferred 1 at least. But then, garbage in, garbage out.
– meuh
Apr 11 '16 at 14:07
add a comment |
Another approach would be to use isutf8
from the moreutils collection.
It exits with 0 if the file is valid UTF-8 or ASCII, or short circuits, prints an error message (silence with -q
) and exits with 1 otherwise.
5
Nice suggestion. I just noticed that giving a directory as arg makes it return 0. I would have preferred 1 at least. But then, garbage in, garbage out.
– meuh
Apr 11 '16 at 14:07
add a comment |
Another approach would be to use isutf8
from the moreutils collection.
It exits with 0 if the file is valid UTF-8 or ASCII, or short circuits, prints an error message (silence with -q
) and exits with 1 otherwise.
Another approach would be to use isutf8
from the moreutils collection.
It exits with 0 if the file is valid UTF-8 or ASCII, or short circuits, prints an error message (silence with -q
) and exits with 1 otherwise.
edited Apr 11 '16 at 10:49
techraf
4,303102243
4,303102243
answered Apr 11 '16 at 10:21
Wander NautaWander Nauta
30113
30113
5
Nice suggestion. I just noticed that giving a directory as arg makes it return 0. I would have preferred 1 at least. But then, garbage in, garbage out.
– meuh
Apr 11 '16 at 14:07
add a comment |
5
Nice suggestion. I just noticed that giving a directory as arg makes it return 0. I would have preferred 1 at least. But then, garbage in, garbage out.
– meuh
Apr 11 '16 at 14:07
5
5
Nice suggestion. I just noticed that giving a directory as arg makes it return 0. I would have preferred 1 at least. But then, garbage in, garbage out.
– meuh
Apr 11 '16 at 14:07
Nice suggestion. I just noticed that giving a directory as arg makes it return 0. I would have preferred 1 at least. But then, garbage in, garbage out.
– meuh
Apr 11 '16 at 14:07
add a comment |
If you like the heuristic used by GNU grep
, you could use it:
isbinary() grep -q '^Binary'
It searches for NUL bytes in the first buffer read from the file (a few kilo-bytes for a regular file, but could be a lot less for a pipe or socket or some devices like /dev/random
). In UTF-8 locales, it also flags on byte sequences that don't form valid UTF-8 characters. It assumes LC_ALL
is not set to something where the language is not English.
The $1-$REPLY
form allows you to use it as a zsh
glob qualifier:
ls -ld -- *(.+isbinary)
would list the binary files.
add a comment |
If you like the heuristic used by GNU grep
, you could use it:
isbinary() grep -q '^Binary'
It searches for NUL bytes in the first buffer read from the file (a few kilo-bytes for a regular file, but could be a lot less for a pipe or socket or some devices like /dev/random
). In UTF-8 locales, it also flags on byte sequences that don't form valid UTF-8 characters. It assumes LC_ALL
is not set to something where the language is not English.
The $1-$REPLY
form allows you to use it as a zsh
glob qualifier:
ls -ld -- *(.+isbinary)
would list the binary files.
add a comment |
If you like the heuristic used by GNU grep
, you could use it:
isbinary() grep -q '^Binary'
It searches for NUL bytes in the first buffer read from the file (a few kilo-bytes for a regular file, but could be a lot less for a pipe or socket or some devices like /dev/random
). In UTF-8 locales, it also flags on byte sequences that don't form valid UTF-8 characters. It assumes LC_ALL
is not set to something where the language is not English.
The $1-$REPLY
form allows you to use it as a zsh
glob qualifier:
ls -ld -- *(.+isbinary)
would list the binary files.
If you like the heuristic used by GNU grep
, you could use it:
isbinary() grep -q '^Binary'
It searches for NUL bytes in the first buffer read from the file (a few kilo-bytes for a regular file, but could be a lot less for a pipe or socket or some devices like /dev/random
). In UTF-8 locales, it also flags on byte sequences that don't form valid UTF-8 characters. It assumes LC_ALL
is not set to something where the language is not English.
The $1-$REPLY
form allows you to use it as a zsh
glob qualifier:
ls -ld -- *(.+isbinary)
would list the binary files.
edited Apr 13 '16 at 13:11
answered Apr 11 '16 at 11:21
Stéphane ChazelasStéphane Chazelas
315k57597955
315k57597955
add a comment |
add a comment |
You can write a script that calls file
, and use a case-statement to check for the cases you are interested in.
For example
#!/bin/sh
case $(file "$1") in
(*script*|* text|* text *)
echo text
;;
(*)
echo binary
;;
esac
though of course there may be many special cases which are of interest. Just checking strings
on a copy of libmagic
, I see about 200 cases, e.g.,
Konqueror cookie text
Korn shell script text executable
LaTeX 2e document text
LaTeX document text
Linux Software Map entry text
Linux Software Map entry text (new format)
Linux kernel symbol map text
Lisp/Scheme program text
Lua script text executable
LyX document text
M3U playlist text
M4 macro processor script text
Some use the string "text" as part of a different type, e.g.,
SoftQuad troff Context intermediate
SoftQuad troff Context intermediate for AT&T 495 laser printer
SoftQuad troff Context intermediate for HP LaserJet
likewise script
could be part of a word, but I see no problems in this case. But a script should check for "text"
as a word, not a substring.
As a reminder, file
output does not use a precise description which would always have "script" or "text". Special cases are something to consider. A followup commented that the --mime-type
works while this approach would not, for .svg
files. However, in a test I see these results for svg-files:
$ ls -l *.svg
-r--r--r-- 1 tom users 6679 Jul 26 2012 pumpkin_48x48.svg
-r--r--r-- 1 tom users 17372 Jul 30 2012 sink_48x48.svg
-r--r--r-- 1 tom users 5929 Jul 25 2012 vile_48x48.svg
-r--r--r-- 1 tom users 3553 Jul 28 2012 vile-mini.svg
$ file *.svg
pumpkin_48x48.svg: SVG Scalable Vector Graphics image
sink_48x48.svg: SVG Scalable Vector Graphics image
vile-mini.svg: SVG Scalable Vector Graphics image
vile_48x48.svg: SVG Scalable Vector Graphics image
$ file --mime-type *.svg
pumpkin_48x48.svg: image/svg+xml
sink_48x48.svg: image/svg+xml
vile-mini.svg: image/svg+xml
vile_48x48.svg: image/svg+xml
which I selected after seeing a thousand files show only 6 with "text"
in the mime-type output. Arguably, matching the "xml" on the end of the mime-type output could be more useful, say, than matching "SVG", but using a script to do that takes you back to the suggestion made here.
The output of file
requires some tuning in either scenario, and is not 100% reliable (it is confused by several of my Perl scripts, calling them "data").
There is more than one implementation of file
. The one most commonly used does its work in libmagic
, which can be used from different programs (perhaps not directly from zsh
, though python
can).
According to File test comparison table for shell, Perl, Ruby, and Python , Perl has a -T
option which it can use to provide this information. But it lists no comparable feature for zsh
.
Further reading:
- zsh glob qualifier to exclude binary files
Unfortunately GNUfile
's output for svg files:SVG Scalable Vector Graphics image
doesn't contain the word text. I thought this approach would be better than the accepted answer of checking the mime-type, but it still misses some types.
– Peter Cordes
Apr 11 '16 at 23:34
It still misses, with the mime-type; for xterm's svg file I getimage/svg+xml
. Actually - just checked a 1000-file same, only 6 came out as "text" according to the mime-type alone. I'll stick with a script, which at least can be made to work as needed.
– Thomas Dickey
Apr 11 '16 at 23:39
add a comment |
You can write a script that calls file
, and use a case-statement to check for the cases you are interested in.
For example
#!/bin/sh
case $(file "$1") in
(*script*|* text|* text *)
echo text
;;
(*)
echo binary
;;
esac
though of course there may be many special cases which are of interest. Just checking strings
on a copy of libmagic
, I see about 200 cases, e.g.,
Konqueror cookie text
Korn shell script text executable
LaTeX 2e document text
LaTeX document text
Linux Software Map entry text
Linux Software Map entry text (new format)
Linux kernel symbol map text
Lisp/Scheme program text
Lua script text executable
LyX document text
M3U playlist text
M4 macro processor script text
Some use the string "text" as part of a different type, e.g.,
SoftQuad troff Context intermediate
SoftQuad troff Context intermediate for AT&T 495 laser printer
SoftQuad troff Context intermediate for HP LaserJet
likewise script
could be part of a word, but I see no problems in this case. But a script should check for "text"
as a word, not a substring.
As a reminder, file
output does not use a precise description which would always have "script" or "text". Special cases are something to consider. A followup commented that the --mime-type
works while this approach would not, for .svg
files. However, in a test I see these results for svg-files:
$ ls -l *.svg
-r--r--r-- 1 tom users 6679 Jul 26 2012 pumpkin_48x48.svg
-r--r--r-- 1 tom users 17372 Jul 30 2012 sink_48x48.svg
-r--r--r-- 1 tom users 5929 Jul 25 2012 vile_48x48.svg
-r--r--r-- 1 tom users 3553 Jul 28 2012 vile-mini.svg
$ file *.svg
pumpkin_48x48.svg: SVG Scalable Vector Graphics image
sink_48x48.svg: SVG Scalable Vector Graphics image
vile-mini.svg: SVG Scalable Vector Graphics image
vile_48x48.svg: SVG Scalable Vector Graphics image
$ file --mime-type *.svg
pumpkin_48x48.svg: image/svg+xml
sink_48x48.svg: image/svg+xml
vile-mini.svg: image/svg+xml
vile_48x48.svg: image/svg+xml
which I selected after seeing a thousand files show only 6 with "text"
in the mime-type output. Arguably, matching the "xml" on the end of the mime-type output could be more useful, say, than matching "SVG", but using a script to do that takes you back to the suggestion made here.
The output of file
requires some tuning in either scenario, and is not 100% reliable (it is confused by several of my Perl scripts, calling them "data").
There is more than one implementation of file
. The one most commonly used does its work in libmagic
, which can be used from different programs (perhaps not directly from zsh
, though python
can).
According to File test comparison table for shell, Perl, Ruby, and Python , Perl has a -T
option which it can use to provide this information. But it lists no comparable feature for zsh
.
Further reading:
- zsh glob qualifier to exclude binary files
Unfortunately GNUfile
's output for svg files:SVG Scalable Vector Graphics image
doesn't contain the word text. I thought this approach would be better than the accepted answer of checking the mime-type, but it still misses some types.
– Peter Cordes
Apr 11 '16 at 23:34
It still misses, with the mime-type; for xterm's svg file I getimage/svg+xml
. Actually - just checked a 1000-file same, only 6 came out as "text" according to the mime-type alone. I'll stick with a script, which at least can be made to work as needed.
– Thomas Dickey
Apr 11 '16 at 23:39
add a comment |
You can write a script that calls file
, and use a case-statement to check for the cases you are interested in.
For example
#!/bin/sh
case $(file "$1") in
(*script*|* text|* text *)
echo text
;;
(*)
echo binary
;;
esac
though of course there may be many special cases which are of interest. Just checking strings
on a copy of libmagic
, I see about 200 cases, e.g.,
Konqueror cookie text
Korn shell script text executable
LaTeX 2e document text
LaTeX document text
Linux Software Map entry text
Linux Software Map entry text (new format)
Linux kernel symbol map text
Lisp/Scheme program text
Lua script text executable
LyX document text
M3U playlist text
M4 macro processor script text
Some use the string "text" as part of a different type, e.g.,
SoftQuad troff Context intermediate
SoftQuad troff Context intermediate for AT&T 495 laser printer
SoftQuad troff Context intermediate for HP LaserJet
likewise script
could be part of a word, but I see no problems in this case. But a script should check for "text"
as a word, not a substring.
As a reminder, file
output does not use a precise description which would always have "script" or "text". Special cases are something to consider. A followup commented that the --mime-type
works while this approach would not, for .svg
files. However, in a test I see these results for svg-files:
$ ls -l *.svg
-r--r--r-- 1 tom users 6679 Jul 26 2012 pumpkin_48x48.svg
-r--r--r-- 1 tom users 17372 Jul 30 2012 sink_48x48.svg
-r--r--r-- 1 tom users 5929 Jul 25 2012 vile_48x48.svg
-r--r--r-- 1 tom users 3553 Jul 28 2012 vile-mini.svg
$ file *.svg
pumpkin_48x48.svg: SVG Scalable Vector Graphics image
sink_48x48.svg: SVG Scalable Vector Graphics image
vile-mini.svg: SVG Scalable Vector Graphics image
vile_48x48.svg: SVG Scalable Vector Graphics image
$ file --mime-type *.svg
pumpkin_48x48.svg: image/svg+xml
sink_48x48.svg: image/svg+xml
vile-mini.svg: image/svg+xml
vile_48x48.svg: image/svg+xml
which I selected after seeing a thousand files show only 6 with "text"
in the mime-type output. Arguably, matching the "xml" on the end of the mime-type output could be more useful, say, than matching "SVG", but using a script to do that takes you back to the suggestion made here.
The output of file
requires some tuning in either scenario, and is not 100% reliable (it is confused by several of my Perl scripts, calling them "data").
There is more than one implementation of file
. The one most commonly used does its work in libmagic
, which can be used from different programs (perhaps not directly from zsh
, though python
can).
According to File test comparison table for shell, Perl, Ruby, and Python , Perl has a -T
option which it can use to provide this information. But it lists no comparable feature for zsh
.
Further reading:
- zsh glob qualifier to exclude binary files
You can write a script that calls file
, and use a case-statement to check for the cases you are interested in.
For example
#!/bin/sh
case $(file "$1") in
(*script*|* text|* text *)
echo text
;;
(*)
echo binary
;;
esac
though of course there may be many special cases which are of interest. Just checking strings
on a copy of libmagic
, I see about 200 cases, e.g.,
Konqueror cookie text
Korn shell script text executable
LaTeX 2e document text
LaTeX document text
Linux Software Map entry text
Linux Software Map entry text (new format)
Linux kernel symbol map text
Lisp/Scheme program text
Lua script text executable
LyX document text
M3U playlist text
M4 macro processor script text
Some use the string "text" as part of a different type, e.g.,
SoftQuad troff Context intermediate
SoftQuad troff Context intermediate for AT&T 495 laser printer
SoftQuad troff Context intermediate for HP LaserJet
likewise script
could be part of a word, but I see no problems in this case. But a script should check for "text"
as a word, not a substring.
As a reminder, file
output does not use a precise description which would always have "script" or "text". Special cases are something to consider. A followup commented that the --mime-type
works while this approach would not, for .svg
files. However, in a test I see these results for svg-files:
$ ls -l *.svg
-r--r--r-- 1 tom users 6679 Jul 26 2012 pumpkin_48x48.svg
-r--r--r-- 1 tom users 17372 Jul 30 2012 sink_48x48.svg
-r--r--r-- 1 tom users 5929 Jul 25 2012 vile_48x48.svg
-r--r--r-- 1 tom users 3553 Jul 28 2012 vile-mini.svg
$ file *.svg
pumpkin_48x48.svg: SVG Scalable Vector Graphics image
sink_48x48.svg: SVG Scalable Vector Graphics image
vile-mini.svg: SVG Scalable Vector Graphics image
vile_48x48.svg: SVG Scalable Vector Graphics image
$ file --mime-type *.svg
pumpkin_48x48.svg: image/svg+xml
sink_48x48.svg: image/svg+xml
vile-mini.svg: image/svg+xml
vile_48x48.svg: image/svg+xml
which I selected after seeing a thousand files show only 6 with "text"
in the mime-type output. Arguably, matching the "xml" on the end of the mime-type output could be more useful, say, than matching "SVG", but using a script to do that takes you back to the suggestion made here.
The output of file
requires some tuning in either scenario, and is not 100% reliable (it is confused by several of my Perl scripts, calling them "data").
There is more than one implementation of file
. The one most commonly used does its work in libmagic
, which can be used from different programs (perhaps not directly from zsh
, though python
can).
According to File test comparison table for shell, Perl, Ruby, and Python , Perl has a -T
option which it can use to provide this information. But it lists no comparable feature for zsh
.
Further reading:
- zsh glob qualifier to exclude binary files
edited May 23 '17 at 12:40
Community♦
1
1
answered Apr 10 '16 at 16:59
Thomas DickeyThomas Dickey
54.3k5106181
54.3k5106181
Unfortunately GNUfile
's output for svg files:SVG Scalable Vector Graphics image
doesn't contain the word text. I thought this approach would be better than the accepted answer of checking the mime-type, but it still misses some types.
– Peter Cordes
Apr 11 '16 at 23:34
It still misses, with the mime-type; for xterm's svg file I getimage/svg+xml
. Actually - just checked a 1000-file same, only 6 came out as "text" according to the mime-type alone. I'll stick with a script, which at least can be made to work as needed.
– Thomas Dickey
Apr 11 '16 at 23:39
add a comment |
Unfortunately GNUfile
's output for svg files:SVG Scalable Vector Graphics image
doesn't contain the word text. I thought this approach would be better than the accepted answer of checking the mime-type, but it still misses some types.
– Peter Cordes
Apr 11 '16 at 23:34
It still misses, with the mime-type; for xterm's svg file I getimage/svg+xml
. Actually - just checked a 1000-file same, only 6 came out as "text" according to the mime-type alone. I'll stick with a script, which at least can be made to work as needed.
– Thomas Dickey
Apr 11 '16 at 23:39
Unfortunately GNU
file
's output for svg files: SVG Scalable Vector Graphics image
doesn't contain the word text. I thought this approach would be better than the accepted answer of checking the mime-type, but it still misses some types.– Peter Cordes
Apr 11 '16 at 23:34
Unfortunately GNU
file
's output for svg files: SVG Scalable Vector Graphics image
doesn't contain the word text. I thought this approach would be better than the accepted answer of checking the mime-type, but it still misses some types.– Peter Cordes
Apr 11 '16 at 23:34
It still misses, with the mime-type; for xterm's svg file I get
image/svg+xml
. Actually - just checked a 1000-file same, only 6 came out as "text" according to the mime-type alone. I'll stick with a script, which at least can be made to work as needed.– Thomas Dickey
Apr 11 '16 at 23:39
It still misses, with the mime-type; for xterm's svg file I get
image/svg+xml
. Actually - just checked a 1000-file same, only 6 came out as "text" according to the mime-type alone. I'll stick with a script, which at least can be made to work as needed.– Thomas Dickey
Apr 11 '16 at 23:39
add a comment |
You could try determining if iconv
can read the file. This is less performing than file
(which just reads a couple bytes from the beginning), but will give you more reliable results:
ENCODING=utf-8
if iconv --from-code="$ENCODING" --to-code="$ENCODING" your_file.ext > /dev/null 2>&1; then
echo text
else
echo binary
fi
This makes iconv
basically a no-op, but if it encounters invalid data (invalid UTF-8 in this example), it will barf and exit.
4
Using-f
and-t
instead of the GNU long options would make it more portable. Note that it will call "binary" the files it can't open. It will call empty files "text".
– Stéphane Chazelas
Apr 11 '16 at 9:12
Agreed. I used the long forms for ad hoc documentation, for people who don't knowiconv
. But-f
and-t
are usually better.
– Boldewyn
Apr 11 '16 at 10:54
add a comment |
You could try determining if iconv
can read the file. This is less performing than file
(which just reads a couple bytes from the beginning), but will give you more reliable results:
ENCODING=utf-8
if iconv --from-code="$ENCODING" --to-code="$ENCODING" your_file.ext > /dev/null 2>&1; then
echo text
else
echo binary
fi
This makes iconv
basically a no-op, but if it encounters invalid data (invalid UTF-8 in this example), it will barf and exit.
4
Using-f
and-t
instead of the GNU long options would make it more portable. Note that it will call "binary" the files it can't open. It will call empty files "text".
– Stéphane Chazelas
Apr 11 '16 at 9:12
Agreed. I used the long forms for ad hoc documentation, for people who don't knowiconv
. But-f
and-t
are usually better.
– Boldewyn
Apr 11 '16 at 10:54
add a comment |
You could try determining if iconv
can read the file. This is less performing than file
(which just reads a couple bytes from the beginning), but will give you more reliable results:
ENCODING=utf-8
if iconv --from-code="$ENCODING" --to-code="$ENCODING" your_file.ext > /dev/null 2>&1; then
echo text
else
echo binary
fi
This makes iconv
basically a no-op, but if it encounters invalid data (invalid UTF-8 in this example), it will barf and exit.
You could try determining if iconv
can read the file. This is less performing than file
(which just reads a couple bytes from the beginning), but will give you more reliable results:
ENCODING=utf-8
if iconv --from-code="$ENCODING" --to-code="$ENCODING" your_file.ext > /dev/null 2>&1; then
echo text
else
echo binary
fi
This makes iconv
basically a no-op, but if it encounters invalid data (invalid UTF-8 in this example), it will barf and exit.
edited Apr 11 '16 at 9:10
Stéphane Chazelas
315k57597955
315k57597955
answered Apr 11 '16 at 7:46
BoldewynBoldewyn
43949
43949
4
Using-f
and-t
instead of the GNU long options would make it more portable. Note that it will call "binary" the files it can't open. It will call empty files "text".
– Stéphane Chazelas
Apr 11 '16 at 9:12
Agreed. I used the long forms for ad hoc documentation, for people who don't knowiconv
. But-f
and-t
are usually better.
– Boldewyn
Apr 11 '16 at 10:54
add a comment |
4
Using-f
and-t
instead of the GNU long options would make it more portable. Note that it will call "binary" the files it can't open. It will call empty files "text".
– Stéphane Chazelas
Apr 11 '16 at 9:12
Agreed. I used the long forms for ad hoc documentation, for people who don't knowiconv
. But-f
and-t
are usually better.
– Boldewyn
Apr 11 '16 at 10:54
4
4
Using
-f
and -t
instead of the GNU long options would make it more portable. Note that it will call "binary" the files it can't open. It will call empty files "text".– Stéphane Chazelas
Apr 11 '16 at 9:12
Using
-f
and -t
instead of the GNU long options would make it more portable. Note that it will call "binary" the files it can't open. It will call empty files "text".– Stéphane Chazelas
Apr 11 '16 at 9:12
Agreed. I used the long forms for ad hoc documentation, for people who don't know
iconv
. But -f
and -t
are usually better.– Boldewyn
Apr 11 '16 at 10:54
Agreed. I used the long forms for ad hoc documentation, for people who don't know
iconv
. But -f
and -t
are usually better.– Boldewyn
Apr 11 '16 at 10:54
add a comment |
file
has an option --mime-encoding
that attempts to detect the encoding of a file.
$file --mime-encoding Documents/poster2.pdf
Documents/poster2.pdf: binary
$file --mime-encoding projects/linux/history-torvalds/Makefile
projects/linux/history-torvalds/Makefile: us-ascii
$file --mime-encoding graphe.tex
Dgraphe.tex: us-ascii
$file --mime-encoding software.tex
software.tex: utf-8
You can use file --mime-encoding | grep binary
to detect if a file is a binary file. It works reliably although it can get confused by a single invalid character in a long text file.
For example, I alias cat
to the following shell script to avoid ruining my terminal by inadvertently opening a binary file:
#! /bin/sh -
[ ! -t 1 ] && exec /bin/cat "$@"
for i
do
if file --mime-encoding -- "$i" | grep -q binary
then
hexdump -C -- "$i"
else
/bin/cat -- "$i"
fi
done
add a comment |
file
has an option --mime-encoding
that attempts to detect the encoding of a file.
$file --mime-encoding Documents/poster2.pdf
Documents/poster2.pdf: binary
$file --mime-encoding projects/linux/history-torvalds/Makefile
projects/linux/history-torvalds/Makefile: us-ascii
$file --mime-encoding graphe.tex
Dgraphe.tex: us-ascii
$file --mime-encoding software.tex
software.tex: utf-8
You can use file --mime-encoding | grep binary
to detect if a file is a binary file. It works reliably although it can get confused by a single invalid character in a long text file.
For example, I alias cat
to the following shell script to avoid ruining my terminal by inadvertently opening a binary file:
#! /bin/sh -
[ ! -t 1 ] && exec /bin/cat "$@"
for i
do
if file --mime-encoding -- "$i" | grep -q binary
then
hexdump -C -- "$i"
else
/bin/cat -- "$i"
fi
done
add a comment |
file
has an option --mime-encoding
that attempts to detect the encoding of a file.
$file --mime-encoding Documents/poster2.pdf
Documents/poster2.pdf: binary
$file --mime-encoding projects/linux/history-torvalds/Makefile
projects/linux/history-torvalds/Makefile: us-ascii
$file --mime-encoding graphe.tex
Dgraphe.tex: us-ascii
$file --mime-encoding software.tex
software.tex: utf-8
You can use file --mime-encoding | grep binary
to detect if a file is a binary file. It works reliably although it can get confused by a single invalid character in a long text file.
For example, I alias cat
to the following shell script to avoid ruining my terminal by inadvertently opening a binary file:
#! /bin/sh -
[ ! -t 1 ] && exec /bin/cat "$@"
for i
do
if file --mime-encoding -- "$i" | grep -q binary
then
hexdump -C -- "$i"
else
/bin/cat -- "$i"
fi
done
file
has an option --mime-encoding
that attempts to detect the encoding of a file.
$file --mime-encoding Documents/poster2.pdf
Documents/poster2.pdf: binary
$file --mime-encoding projects/linux/history-torvalds/Makefile
projects/linux/history-torvalds/Makefile: us-ascii
$file --mime-encoding graphe.tex
Dgraphe.tex: us-ascii
$file --mime-encoding software.tex
software.tex: utf-8
You can use file --mime-encoding | grep binary
to detect if a file is a binary file. It works reliably although it can get confused by a single invalid character in a long text file.
For example, I alias cat
to the following shell script to avoid ruining my terminal by inadvertently opening a binary file:
#! /bin/sh -
[ ! -t 1 ] && exec /bin/cat "$@"
for i
do
if file --mime-encoding -- "$i" | grep -q binary
then
hexdump -C -- "$i"
else
/bin/cat -- "$i"
fi
done
edited Apr 11 '16 at 9:32
Stéphane Chazelas
315k57597955
315k57597955
answered Apr 11 '16 at 8:17
lgeorgetlgeorget
9,14622754
9,14622754
add a comment |
add a comment |
Categories are arbitrary. Before answer how to make a classification, you need a (strict) definition. In order to have a definition, you need a purpose.
So, what do you want to do with that classification?
- If you want to select ascii/binary in FTP, it's important do not transfer a binary file as ascii (or it will be corrupted). So you shuld test if the file is plain texts, html, rtf, and some others. But in doubt, select binary. And maybe you also want to test that the file only have a subset like 0x0A, 0x0D, and 0x20-0x7F.
- If you want to transfer the file in some protocol (POP3,SMTP) you need to test to choose if encode in base64 or just plain. In this case, you should test if there are unsupported characters.
- Any other case… may have any other definition.
add a comment |
Categories are arbitrary. Before answer how to make a classification, you need a (strict) definition. In order to have a definition, you need a purpose.
So, what do you want to do with that classification?
- If you want to select ascii/binary in FTP, it's important do not transfer a binary file as ascii (or it will be corrupted). So you shuld test if the file is plain texts, html, rtf, and some others. But in doubt, select binary. And maybe you also want to test that the file only have a subset like 0x0A, 0x0D, and 0x20-0x7F.
- If you want to transfer the file in some protocol (POP3,SMTP) you need to test to choose if encode in base64 or just plain. In this case, you should test if there are unsupported characters.
- Any other case… may have any other definition.
add a comment |
Categories are arbitrary. Before answer how to make a classification, you need a (strict) definition. In order to have a definition, you need a purpose.
So, what do you want to do with that classification?
- If you want to select ascii/binary in FTP, it's important do not transfer a binary file as ascii (or it will be corrupted). So you shuld test if the file is plain texts, html, rtf, and some others. But in doubt, select binary. And maybe you also want to test that the file only have a subset like 0x0A, 0x0D, and 0x20-0x7F.
- If you want to transfer the file in some protocol (POP3,SMTP) you need to test to choose if encode in base64 or just plain. In this case, you should test if there are unsupported characters.
- Any other case… may have any other definition.
Categories are arbitrary. Before answer how to make a classification, you need a (strict) definition. In order to have a definition, you need a purpose.
So, what do you want to do with that classification?
- If you want to select ascii/binary in FTP, it's important do not transfer a binary file as ascii (or it will be corrupted). So you shuld test if the file is plain texts, html, rtf, and some others. But in doubt, select binary. And maybe you also want to test that the file only have a subset like 0x0A, 0x0D, and 0x20-0x7F.
- If you want to transfer the file in some protocol (POP3,SMTP) you need to test to choose if encode in base64 or just plain. In this case, you should test if there are unsupported characters.
- Any other case… may have any other definition.
answered Apr 11 '16 at 16:10
ESLESL
1464
1464
add a comment |
add a comment |
perl -e'chomp(my$f=<>);print "binary$/" if -B $f;print "text$/" if -T _'
will do it. See documentation for -B
and -T
(search in that page for the string The -T and -B switches work as follows
).
perl -le 'print -B $ARGV[0] ? "binary" : "text"' --
might be clearer. Or evenperl -le 'print -B $_ ? "binary" : "text", @ARGV > 1 ? "t$_" : "" for @ARGV' --
– jrw32982
Apr 21 '17 at 12:20
add a comment |
perl -e'chomp(my$f=<>);print "binary$/" if -B $f;print "text$/" if -T _'
will do it. See documentation for -B
and -T
(search in that page for the string The -T and -B switches work as follows
).
perl -le 'print -B $ARGV[0] ? "binary" : "text"' --
might be clearer. Or evenperl -le 'print -B $_ ? "binary" : "text", @ARGV > 1 ? "t$_" : "" for @ARGV' --
– jrw32982
Apr 21 '17 at 12:20
add a comment |
perl -e'chomp(my$f=<>);print "binary$/" if -B $f;print "text$/" if -T _'
will do it. See documentation for -B
and -T
(search in that page for the string The -T and -B switches work as follows
).
perl -e'chomp(my$f=<>);print "binary$/" if -B $f;print "text$/" if -T _'
will do it. See documentation for -B
and -T
(search in that page for the string The -T and -B switches work as follows
).
answered Apr 11 '16 at 19:31
msh210msh210
1636
1636
perl -le 'print -B $ARGV[0] ? "binary" : "text"' --
might be clearer. Or evenperl -le 'print -B $_ ? "binary" : "text", @ARGV > 1 ? "t$_" : "" for @ARGV' --
– jrw32982
Apr 21 '17 at 12:20
add a comment |
perl -le 'print -B $ARGV[0] ? "binary" : "text"' --
might be clearer. Or evenperl -le 'print -B $_ ? "binary" : "text", @ARGV > 1 ? "t$_" : "" for @ARGV' --
– jrw32982
Apr 21 '17 at 12:20
perl -le 'print -B $ARGV[0] ? "binary" : "text"' --
might be clearer. Or even perl -le 'print -B $_ ? "binary" : "text", @ARGV > 1 ? "t$_" : "" for @ARGV' --
– jrw32982
Apr 21 '17 at 12:20
perl -le 'print -B $ARGV[0] ? "binary" : "text"' --
might be clearer. Or even perl -le 'print -B $_ ? "binary" : "text", @ARGV > 1 ? "t$_" : "" for @ARGV' --
– jrw32982
Apr 21 '17 at 12:20
add a comment |
I contributed to https://github.com/audreyr/binaryornot
It does not have a command line wrapper (yet) but this is a simple Python library easy enough to call even from the CLI.
It uses a fairly efficient heuristic to determine if a file is text or binary.
add a comment |
I contributed to https://github.com/audreyr/binaryornot
It does not have a command line wrapper (yet) but this is a simple Python library easy enough to call even from the CLI.
It uses a fairly efficient heuristic to determine if a file is text or binary.
add a comment |
I contributed to https://github.com/audreyr/binaryornot
It does not have a command line wrapper (yet) but this is a simple Python library easy enough to call even from the CLI.
It uses a fairly efficient heuristic to determine if a file is text or binary.
I contributed to https://github.com/audreyr/binaryornot
It does not have a command line wrapper (yet) but this is a simple Python library easy enough to call even from the CLI.
It uses a fairly efficient heuristic to determine if a file is text or binary.
answered Aug 21 '16 at 22:12
Philippe OmbredannePhilippe Ombredanne
1112
1112
add a comment |
add a comment |
I now this answer is a bit old, but I think my friend taught me a great "hack" to do this.
You use the diff
command and check your file against a test text file:
$ diff filetocheck testfile.txt
Now if filetocheck
is a binary file, the output would be:
Binary files filetocheck and testfile.txt differ
This way you could leverage the diff
command and e.g. write a function which does the check in a script.
add a comment |
I now this answer is a bit old, but I think my friend taught me a great "hack" to do this.
You use the diff
command and check your file against a test text file:
$ diff filetocheck testfile.txt
Now if filetocheck
is a binary file, the output would be:
Binary files filetocheck and testfile.txt differ
This way you could leverage the diff
command and e.g. write a function which does the check in a script.
add a comment |
I now this answer is a bit old, but I think my friend taught me a great "hack" to do this.
You use the diff
command and check your file against a test text file:
$ diff filetocheck testfile.txt
Now if filetocheck
is a binary file, the output would be:
Binary files filetocheck and testfile.txt differ
This way you could leverage the diff
command and e.g. write a function which does the check in a script.
I now this answer is a bit old, but I think my friend taught me a great "hack" to do this.
You use the diff
command and check your file against a test text file:
$ diff filetocheck testfile.txt
Now if filetocheck
is a binary file, the output would be:
Binary files filetocheck and testfile.txt differ
This way you could leverage the diff
command and e.g. write a function which does the check in a script.
edited 9 hours ago
Rui F Ribeiro
42.1k1484142
42.1k1484142
answered Nov 6 '17 at 16:43
user3019105user3019105
1355
1355
add a comment |
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f275516%2fis-there-a-convenient-way-to-classify-files-as-binary-or-text%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
-files, text
10
file
is a standard utility and can run through the file magic for determining file types to the best of its abilities. It can tell most text formats and does a pretty decent job on binary formats. If all you're trying to do is find out if a file is text or not, that's the command you're interested in.– Bratchley
Apr 10 '16 at 16:37
@Bratchley: some versions of
file
will print, e.g.shell script
, for some files I would like classified as "text". Is there a way to getfile
to print justtext
orbinary
?– kjo
Apr 10 '16 at 16:48
1
@don_crissti That question is about someone trying to get people to debug his bash script. Detecting text is just what the script is supposed to do. They ended up having an issue in one of their
cut
commands.– Bratchley
Apr 10 '16 at 17:18
1
@don_crissti The fact that there's an answer on question A that works for question B does not always make A a duplicate of B. Consider someone who is looking for a way to classify files as text or binary. Which is more useful: a “debug my script” question which happens to have a generic answer buried among other answers that are specific to that script, or a generic “how do I classify fiels as text or binary?”?
– Gilles
Apr 10 '16 at 21:05
1
@Gilles - depends on how you read it. I actually see the question there as a typical case of XY problem: OP there wants to check if a file is a text file - and thinks piping
file
output tocut
is the solution - sure, there's a missing space which makes it fail and that has made most people there address the Y instead of the X but Stéphane's comments and answer show the proper way to determine whether the file is text or not.– don_crissti
Apr 10 '16 at 21:15