uniq -c Equivalent for Groups of Lines of Arbitrary CountGet lines with maximum values in the column using awk, uniq and sortuniq and sed, delete lines with pattern similar in multiple filesuniq showing duplicate linesWhy does this command not sort based on the uniq count?Count unique lines only to a set patternCount lines preserving headerUsing Uniq -c with a regular expression or counting the number of lines removedCount uniq instances of blocks of 2 linesExtracting “count value” after using “uniq -c”How do you count the first column generated from uniq -c
Is ipsum/ipsa/ipse a third person pronoun, or can it serve other functions?
Symmetry in quantum mechanics
Lied on resume at previous job
Ideas for 3rd eye abilities
Is Social Media Science Fiction?
Is it legal to have the "// (c) 2019 John Smith" header in all files when there are hundreds of contributors?
Extreme, but not acceptable situation and I can't start the work tomorrow morning
How do I create uniquely male characters?
What causes the sudden spool-up sound from an F-16 when enabling afterburner?
How to make payment on the internet without leaving a money trail?
aging parents with no investments
Can I find out the caloric content of bread by dehydrating it?
What is it called when one voice type sings a 'solo'?
Pristine Bit Checking
What is the meaning of "of trouble" in the following sentence?
If a centaur druid Wild Shapes into a Giant Elk, do their Charge features stack?
How to manage monthly salary
What is GPS' 19 year rollover and does it present a cybersecurity issue?
New order #4: World
Are objects structures and/or vice versa?
Could a US political party gain complete control over the government by removing checks & balances?
Does the average primeness of natural numbers tend to zero?
How can I plot a Farey diagram?
What happens when a metallic dragon and a chromatic dragon mate?
uniq -c Equivalent for Groups of Lines of Arbitrary Count
Get lines with maximum values in the column using awk, uniq and sortuniq and sed, delete lines with pattern similar in multiple filesuniq showing duplicate linesWhy does this command not sort based on the uniq count?Count unique lines only to a set patternCount lines preserving headerUsing Uniq -c with a regular expression or counting the number of lines removedCount uniq instances of blocks of 2 linesExtracting “count value” after using “uniq -c”How do you count the first column generated from uniq -c
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
I've got a file of ~1-2 million lines that I'm trying to reduce down by counting duplicate groups of lines, preserving order.
uniq -c
works okay :
$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | uniq -c
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
In my use-case (but not in the following foo-bar-baz example), counting pairs of lines is ~20% more efficient, and looks like :
$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)'
| sed 's/^/__STARTOFSTRINGDELIMITER__/'
| paste - -
| uniq -c
| sed -r 's/__STARTOFSTRINGDELIMITER__//; s/__STARTOFSTRINGDELIMITER__/nt/;'
2 foo
foo
2 bar
bar
1 baz
foo
1 bar
baz
1 foo
bar
1 baz
foo
1 bar
baz
1 foo
bar
1 baz
(That format is acceptable to me.)
How can I reduce duplicate groups of arbitrary numbers of lines (well, keeping a sane buffer count like 2-10 lines) down to a single copy + count of them ?
Following the above example, I would want output similar to :
4 foo
4 bar
1 baz
4 foo
bar
baz
awk perl uniq
add a comment |
I've got a file of ~1-2 million lines that I'm trying to reduce down by counting duplicate groups of lines, preserving order.
uniq -c
works okay :
$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | uniq -c
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
In my use-case (but not in the following foo-bar-baz example), counting pairs of lines is ~20% more efficient, and looks like :
$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)'
| sed 's/^/__STARTOFSTRINGDELIMITER__/'
| paste - -
| uniq -c
| sed -r 's/__STARTOFSTRINGDELIMITER__//; s/__STARTOFSTRINGDELIMITER__/nt/;'
2 foo
foo
2 bar
bar
1 baz
foo
1 bar
baz
1 foo
bar
1 baz
foo
1 bar
baz
1 foo
bar
1 baz
(That format is acceptable to me.)
How can I reduce duplicate groups of arbitrary numbers of lines (well, keeping a sane buffer count like 2-10 lines) down to a single copy + count of them ?
Following the above example, I would want output similar to :
4 foo
4 bar
1 baz
4 foo
bar
baz
awk perl uniq
That's similar to what some compression algorithms do. Maybe some avenue worth exploring.
– Stéphane Chazelas
Mar 28 at 18:32
The issue seems to be finding the groups of lines. Your output may as well say that the combination offoo
followed bybar
occur 5 times.
– Kusalananda♦
Mar 29 at 6:34
@Kusalananda Do you meanfoo
followed bybar
4 times ? (The first two sets of four each). You would be correct then, yes, and either output would be acceptable for me (eitherfoo
x4 thenbar
x4, or (foo
,bar
) x4). I assume it would depend on the buffer length - 10 lines of buffer would produce the latter, less than 8 lines of buffer would produce the former. It's not really an issue as you say, just a consideration.
– robut
Mar 29 at 12:25
add a comment |
I've got a file of ~1-2 million lines that I'm trying to reduce down by counting duplicate groups of lines, preserving order.
uniq -c
works okay :
$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | uniq -c
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
In my use-case (but not in the following foo-bar-baz example), counting pairs of lines is ~20% more efficient, and looks like :
$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)'
| sed 's/^/__STARTOFSTRINGDELIMITER__/'
| paste - -
| uniq -c
| sed -r 's/__STARTOFSTRINGDELIMITER__//; s/__STARTOFSTRINGDELIMITER__/nt/;'
2 foo
foo
2 bar
bar
1 baz
foo
1 bar
baz
1 foo
bar
1 baz
foo
1 bar
baz
1 foo
bar
1 baz
(That format is acceptable to me.)
How can I reduce duplicate groups of arbitrary numbers of lines (well, keeping a sane buffer count like 2-10 lines) down to a single copy + count of them ?
Following the above example, I would want output similar to :
4 foo
4 bar
1 baz
4 foo
bar
baz
awk perl uniq
I've got a file of ~1-2 million lines that I'm trying to reduce down by counting duplicate groups of lines, preserving order.
uniq -c
works okay :
$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | uniq -c
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
In my use-case (but not in the following foo-bar-baz example), counting pairs of lines is ~20% more efficient, and looks like :
$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)'
| sed 's/^/__STARTOFSTRINGDELIMITER__/'
| paste - -
| uniq -c
| sed -r 's/__STARTOFSTRINGDELIMITER__//; s/__STARTOFSTRINGDELIMITER__/nt/;'
2 foo
foo
2 bar
bar
1 baz
foo
1 bar
baz
1 foo
bar
1 baz
foo
1 bar
baz
1 foo
bar
1 baz
(That format is acceptable to me.)
How can I reduce duplicate groups of arbitrary numbers of lines (well, keeping a sane buffer count like 2-10 lines) down to a single copy + count of them ?
Following the above example, I would want output similar to :
4 foo
4 bar
1 baz
4 foo
bar
baz
awk perl uniq
awk perl uniq
edited Mar 28 at 22:54
Rui F Ribeiro
42k1483142
42k1483142
asked Mar 28 at 16:36
robutrobut
9818
9818
That's similar to what some compression algorithms do. Maybe some avenue worth exploring.
– Stéphane Chazelas
Mar 28 at 18:32
The issue seems to be finding the groups of lines. Your output may as well say that the combination offoo
followed bybar
occur 5 times.
– Kusalananda♦
Mar 29 at 6:34
@Kusalananda Do you meanfoo
followed bybar
4 times ? (The first two sets of four each). You would be correct then, yes, and either output would be acceptable for me (eitherfoo
x4 thenbar
x4, or (foo
,bar
) x4). I assume it would depend on the buffer length - 10 lines of buffer would produce the latter, less than 8 lines of buffer would produce the former. It's not really an issue as you say, just a consideration.
– robut
Mar 29 at 12:25
add a comment |
That's similar to what some compression algorithms do. Maybe some avenue worth exploring.
– Stéphane Chazelas
Mar 28 at 18:32
The issue seems to be finding the groups of lines. Your output may as well say that the combination offoo
followed bybar
occur 5 times.
– Kusalananda♦
Mar 29 at 6:34
@Kusalananda Do you meanfoo
followed bybar
4 times ? (The first two sets of four each). You would be correct then, yes, and either output would be acceptable for me (eitherfoo
x4 thenbar
x4, or (foo
,bar
) x4). I assume it would depend on the buffer length - 10 lines of buffer would produce the latter, less than 8 lines of buffer would produce the former. It's not really an issue as you say, just a consideration.
– robut
Mar 29 at 12:25
That's similar to what some compression algorithms do. Maybe some avenue worth exploring.
– Stéphane Chazelas
Mar 28 at 18:32
That's similar to what some compression algorithms do. Maybe some avenue worth exploring.
– Stéphane Chazelas
Mar 28 at 18:32
The issue seems to be finding the groups of lines. Your output may as well say that the combination of
foo
followed by bar
occur 5 times.– Kusalananda♦
Mar 29 at 6:34
The issue seems to be finding the groups of lines. Your output may as well say that the combination of
foo
followed by bar
occur 5 times.– Kusalananda♦
Mar 29 at 6:34
@Kusalananda Do you mean
foo
followed by bar
4 times ? (The first two sets of four each). You would be correct then, yes, and either output would be acceptable for me (either foo
x4 then bar
x4, or (foo
, bar
) x4). I assume it would depend on the buffer length - 10 lines of buffer would produce the latter, less than 8 lines of buffer would produce the former. It's not really an issue as you say, just a consideration.– robut
Mar 29 at 12:25
@Kusalananda Do you mean
foo
followed by bar
4 times ? (The first two sets of four each). You would be correct then, yes, and either output would be acceptable for me (either foo
x4 then bar
x4, or (foo
, bar
) x4). I assume it would depend on the buffer length - 10 lines of buffer would produce the latter, less than 8 lines of buffer would produce the former. It's not really an issue as you say, just a consideration.– robut
Mar 29 at 12:25
add a comment |
1 Answer
1
active
oldest
votes
I don't have such a huge dataset for benchmarking. Give this a try:
$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | awk 'NR == 1 word=$0; count=1; next $0 != word print count,word; word=$0; count=1; next count++ END print count,word '
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
Using mawk
instead of awk
may improve performance.
Can this be adapted to work with multi-word lines ?echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word '
for example only countsa
. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.
– robut
Mar 29 at 13:02
Just replace the$1
with$0
to compare whole lines. I've edited my answer.
– finswimmer
Mar 29 at 15:22
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f509266%2funiq-c-equivalent-for-groups-of-lines-of-arbitrary-count%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
I don't have such a huge dataset for benchmarking. Give this a try:
$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | awk 'NR == 1 word=$0; count=1; next $0 != word print count,word; word=$0; count=1; next count++ END print count,word '
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
Using mawk
instead of awk
may improve performance.
Can this be adapted to work with multi-word lines ?echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word '
for example only countsa
. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.
– robut
Mar 29 at 13:02
Just replace the$1
with$0
to compare whole lines. I've edited my answer.
– finswimmer
Mar 29 at 15:22
add a comment |
I don't have such a huge dataset for benchmarking. Give this a try:
$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | awk 'NR == 1 word=$0; count=1; next $0 != word print count,word; word=$0; count=1; next count++ END print count,word '
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
Using mawk
instead of awk
may improve performance.
Can this be adapted to work with multi-word lines ?echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word '
for example only countsa
. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.
– robut
Mar 29 at 13:02
Just replace the$1
with$0
to compare whole lines. I've edited my answer.
– finswimmer
Mar 29 at 15:22
add a comment |
I don't have such a huge dataset for benchmarking. Give this a try:
$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | awk 'NR == 1 word=$0; count=1; next $0 != word print count,word; word=$0; count=1; next count++ END print count,word '
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
Using mawk
instead of awk
may improve performance.
I don't have such a huge dataset for benchmarking. Give this a try:
$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | awk 'NR == 1 word=$0; count=1; next $0 != word print count,word; word=$0; count=1; next count++ END print count,word '
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
Using mawk
instead of awk
may improve performance.
edited Mar 29 at 15:21
answered Mar 29 at 5:57
finswimmerfinswimmer
72918
72918
Can this be adapted to work with multi-word lines ?echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word '
for example only countsa
. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.
– robut
Mar 29 at 13:02
Just replace the$1
with$0
to compare whole lines. I've edited my answer.
– finswimmer
Mar 29 at 15:22
add a comment |
Can this be adapted to work with multi-word lines ?echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word '
for example only countsa
. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.
– robut
Mar 29 at 13:02
Just replace the$1
with$0
to compare whole lines. I've edited my answer.
– finswimmer
Mar 29 at 15:22
Can this be adapted to work with multi-word lines ?
echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word '
for example only counts a
. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.– robut
Mar 29 at 13:02
Can this be adapted to work with multi-word lines ?
echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word '
for example only counts a
. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.– robut
Mar 29 at 13:02
Just replace the
$1
with $0
to compare whole lines. I've edited my answer.– finswimmer
Mar 29 at 15:22
Just replace the
$1
with $0
to compare whole lines. I've edited my answer.– finswimmer
Mar 29 at 15:22
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f509266%2funiq-c-equivalent-for-groups-of-lines-of-arbitrary-count%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
-awk, perl, uniq
That's similar to what some compression algorithms do. Maybe some avenue worth exploring.
– Stéphane Chazelas
Mar 28 at 18:32
The issue seems to be finding the groups of lines. Your output may as well say that the combination of
foo
followed bybar
occur 5 times.– Kusalananda♦
Mar 29 at 6:34
@Kusalananda Do you mean
foo
followed bybar
4 times ? (The first two sets of four each). You would be correct then, yes, and either output would be acceptable for me (eitherfoo
x4 thenbar
x4, or (foo
,bar
) x4). I assume it would depend on the buffer length - 10 lines of buffer would produce the latter, less than 8 lines of buffer would produce the former. It's not really an issue as you say, just a consideration.– robut
Mar 29 at 12:25