uniq -c Equivalent for Groups of Lines of Arbitrary CountGet lines with maximum values in the column using awk, uniq and sortuniq and sed, delete lines with pattern similar in multiple filesuniq showing duplicate linesWhy does this command not sort based on the uniq count?Count unique lines only to a set patternCount lines preserving headerUsing Uniq -c with a regular expression or counting the number of lines removedCount uniq instances of blocks of 2 linesExtracting “count value” after using “uniq -c”How do you count the first column generated from uniq -c

Is ipsum/ipsa/ipse a third person pronoun, or can it serve other functions?

Symmetry in quantum mechanics

Lied on resume at previous job

Ideas for 3rd eye abilities

Is Social Media Science Fiction?

Is it legal to have the "// (c) 2019 John Smith" header in all files when there are hundreds of contributors?

Extreme, but not acceptable situation and I can't start the work tomorrow morning

How do I create uniquely male characters?

What causes the sudden spool-up sound from an F-16 when enabling afterburner?

How to make payment on the internet without leaving a money trail?

aging parents with no investments

Can I find out the caloric content of bread by dehydrating it?

What is it called when one voice type sings a 'solo'?

Pristine Bit Checking

What is the meaning of "of trouble" in the following sentence?

If a centaur druid Wild Shapes into a Giant Elk, do their Charge features stack?

How to manage monthly salary

What is GPS' 19 year rollover and does it present a cybersecurity issue?

New order #4: World

Are objects structures and/or vice versa?

Could a US political party gain complete control over the government by removing checks & balances?

Does the average primeness of natural numbers tend to zero?

How can I plot a Farey diagram?

What happens when a metallic dragon and a chromatic dragon mate?

uniq -c Equivalent for Groups of Lines of Arbitrary Count

Get lines with maximum values in the column using awk, uniq and sortuniq and sed, delete lines with pattern similar in multiple filesuniq showing duplicate linesWhy does this command not sort based on the uniq count?Count unique lines only to a set patternCount lines preserving headerUsing Uniq -c with a regular expression or counting the number of lines removedCount uniq instances of blocks of 2 linesExtracting “count value” after using “uniq -c”How do you count the first column generated from uniq -c

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

I've got a file of ~1-2 million lines that I'm trying to reduce down by counting duplicate groups of lines, preserving order.

uniq -c works okay :

$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | uniq -c
 4 foo
 4 bar
 1 baz
 1 foo
 1 bar
 1 baz
 1 foo
 1 bar
 1 baz
 1 foo
 1 bar
 1 baz
 1 foo
 1 bar
 1 baz

In my use-case (but not in the following foo-bar-baz example), counting pairs of lines is ~20% more efficient, and looks like :

$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' 
 | sed 's/^/__STARTOFSTRINGDELIMITER__/' 
 | paste - - 
 | uniq -c 
 | sed -r 's/__STARTOFSTRINGDELIMITER__//; s/__STARTOFSTRINGDELIMITER__/nt/;'
 2 foo
 foo
 2 bar
 bar
 1 baz
 foo
 1 bar
 baz
 1 foo
 bar
 1 baz
 foo
 1 bar
 baz
 1 foo
 bar
 1 baz

(That format is acceptable to me.)

How can I reduce duplicate groups of arbitrary numbers of lines (well, keeping a sane buffer count like 2-10 lines) down to a single copy + count of them ?

Following the above example, I would want output similar to :

4 foo
4 bar
1 baz
4 foo
 bar
 baz

edited Mar 28 at 22:54

Rui F Ribeiro

42k1483142

asked Mar 28 at 16:36

robut

9818

That's similar to what some compression algorithms do. Maybe some avenue worth exploring.

– Stéphane Chazelas
Mar 28 at 18:32

The issue seems to be finding the groups of lines. Your output may as well say that the combination of foo followed by bar occur 5 times.

– Kusalananda♦
Mar 29 at 6:34

@Kusalananda Do you mean foo followed by bar 4 times ? (The first two sets of four each). You would be correct then, yes, and either output would be acceptable for me (either foo x4 then bar x4, or (foo, bar) x4). I assume it would depend on the buffer length - 10 lines of buffer would produce the latter, less than 8 lines of buffer would produce the former. It's not really an issue as you say, just a consideration.

– robut
Mar 29 at 12:25

add a comment |

I've got a file of ~1-2 million lines that I'm trying to reduce down by counting duplicate groups of lines, preserving order.

uniq -c works okay :

$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | uniq -c
 4 foo
 4 bar
 1 baz
 1 foo
 1 bar
 1 baz
 1 foo
 1 bar
 1 baz
 1 foo
 1 bar
 1 baz
 1 foo
 1 bar
 1 baz

In my use-case (but not in the following foo-bar-baz example), counting pairs of lines is ~20% more efficient, and looks like :

$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' 
 | sed 's/^/__STARTOFSTRINGDELIMITER__/' 
 | paste - - 
 | uniq -c 
 | sed -r 's/__STARTOFSTRINGDELIMITER__//; s/__STARTOFSTRINGDELIMITER__/nt/;'
 2 foo
 foo
 2 bar
 bar
 1 baz
 foo
 1 bar
 baz
 1 foo
 bar
 1 baz
 foo
 1 bar
 baz
 1 foo
 bar
 1 baz

(That format is acceptable to me.)

How can I reduce duplicate groups of arbitrary numbers of lines (well, keeping a sane buffer count like 2-10 lines) down to a single copy + count of them ?

Following the above example, I would want output similar to :

4 foo
4 bar
1 baz
4 foo
 bar
 baz

edited Mar 28 at 22:54

Rui F Ribeiro

42k1483142

asked Mar 28 at 16:36

robut

9818

That's similar to what some compression algorithms do. Maybe some avenue worth exploring.

– Stéphane Chazelas
Mar 28 at 18:32

The issue seems to be finding the groups of lines. Your output may as well say that the combination of foo followed by bar occur 5 times.

– Kusalananda♦
Mar 29 at 6:34

@Kusalananda Do you mean foo followed by bar 4 times ? (The first two sets of four each). You would be correct then, yes, and either output would be acceptable for me (either foo x4 then bar x4, or (foo, bar) x4). I assume it would depend on the buffer length - 10 lines of buffer would produce the latter, less than 8 lines of buffer would produce the former. It's not really an issue as you say, just a consideration.

– robut
Mar 29 at 12:25

add a comment |

I've got a file of ~1-2 million lines that I'm trying to reduce down by counting duplicate groups of lines, preserving order.

uniq -c works okay :

$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | uniq -c
 4 foo
 4 bar
 1 baz
 1 foo
 1 bar
 1 baz
 1 foo
 1 bar
 1 baz
 1 foo
 1 bar
 1 baz
 1 foo
 1 bar
 1 baz

In my use-case (but not in the following foo-bar-baz example), counting pairs of lines is ~20% more efficient, and looks like :

$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' 
 | sed 's/^/__STARTOFSTRINGDELIMITER__/' 
 | paste - - 
 | uniq -c 
 | sed -r 's/__STARTOFSTRINGDELIMITER__//; s/__STARTOFSTRINGDELIMITER__/nt/;'
 2 foo
 foo
 2 bar
 bar
 1 baz
 foo
 1 bar
 baz
 1 foo
 bar
 1 baz
 foo
 1 bar
 baz
 1 foo
 bar
 1 baz

(That format is acceptable to me.)

How can I reduce duplicate groups of arbitrary numbers of lines (well, keeping a sane buffer count like 2-10 lines) down to a single copy + count of them ?

Following the above example, I would want output similar to :

4 foo
4 bar
1 baz
4 foo
 bar
 baz

edited Mar 28 at 22:54

Rui F Ribeiro

42k1483142

asked Mar 28 at 16:36

robut

9818

I've got a file of ~1-2 million lines that I'm trying to reduce down by counting duplicate groups of lines, preserving order.

uniq -c works okay :

$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | uniq -c
 4 foo
 4 bar
 1 baz
 1 foo
 1 bar
 1 baz
 1 foo
 1 bar
 1 baz
 1 foo
 1 bar
 1 baz
 1 foo
 1 bar
 1 baz

In my use-case (but not in the following foo-bar-baz example), counting pairs of lines is ~20% more efficient, and looks like :

$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' 
 | sed 's/^/__STARTOFSTRINGDELIMITER__/' 
 | paste - - 
 | uniq -c 
 | sed -r 's/__STARTOFSTRINGDELIMITER__//; s/__STARTOFSTRINGDELIMITER__/nt/;'
 2 foo
 foo
 2 bar
 bar
 1 baz
 foo
 1 bar
 baz
 1 foo
 bar
 1 baz
 foo
 1 bar
 baz
 1 foo
 bar
 1 baz

(That format is acceptable to me.)

How can I reduce duplicate groups of arbitrary numbers of lines (well, keeping a sane buffer count like 2-10 lines) down to a single copy + count of them ?

Following the above example, I would want output similar to :

4 foo
4 bar
1 baz
4 foo
 bar
 baz

awk perl uniq

edited Mar 28 at 22:54

Rui F Ribeiro

42k1483142

asked Mar 28 at 16:36

robut

9818

edited Mar 28 at 22:54

Rui F Ribeiro

42k1483142

asked Mar 28 at 16:36

robut

9818

edited Mar 28 at 22:54

Rui F Ribeiro

42k1483142

edited Mar 28 at 22:54

Rui F Ribeiro

42k1483142

edited Mar 28 at 22:54

Rui F Ribeiro

42k1483142

asked Mar 28 at 16:36

robut

9818

asked Mar 28 at 16:36

robut

9818

asked Mar 28 at 16:36

robut

9818

That's similar to what some compression algorithms do. Maybe some avenue worth exploring.

– Stéphane Chazelas
Mar 28 at 18:32

The issue seems to be finding the groups of lines. Your output may as well say that the combination of foo followed by bar occur 5 times.

– Kusalananda♦
Mar 29 at 6:34

@Kusalananda Do you mean foo followed by bar 4 times ? (The first two sets of four each). You would be correct then, yes, and either output would be acceptable for me (either foo x4 then bar x4, or (foo, bar) x4). I assume it would depend on the buffer length - 10 lines of buffer would produce the latter, less than 8 lines of buffer would produce the former. It's not really an issue as you say, just a consideration.

– robut
Mar 29 at 12:25

add a comment |

That's similar to what some compression algorithms do. Maybe some avenue worth exploring.

– Stéphane Chazelas
Mar 28 at 18:32

The issue seems to be finding the groups of lines. Your output may as well say that the combination of foo followed by bar occur 5 times.

– Kusalananda♦
Mar 29 at 6:34

@Kusalananda Do you mean foo followed by bar 4 times ? (The first two sets of four each). You would be correct then, yes, and either output would be acceptable for me (either foo x4 then bar x4, or (foo, bar) x4). I assume it would depend on the buffer length - 10 lines of buffer would produce the latter, less than 8 lines of buffer would produce the former. It's not really an issue as you say, just a consideration.

– robut
Mar 29 at 12:25

That's similar to what some compression algorithms do. Maybe some avenue worth exploring.

– Stéphane Chazelas
Mar 28 at 18:32

The issue seems to be finding the groups of lines. Your output may as well say that the combination of foo followed by bar occur 5 times.

– Kusalananda♦
Mar 29 at 6:34

@Kusalananda Do you mean foo followed by bar 4 times ? (The first two sets of four each). You would be correct then, yes, and either output would be acceptable for me (either foo x4 then bar x4, or (foo, bar) x4). I assume it would depend on the buffer length - 10 lines of buffer would produce the latter, less than 8 lines of buffer would produce the former. It's not really an issue as you say, just a consideration.

– robut
Mar 29 at 12:25

add a comment |

1 Answer
1

active

oldest

votes

I don't have such a huge dataset for benchmarking. Give this a try:

$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | awk 'NR == 1 word=$0; count=1; next $0 != word print count,word; word=$0; count=1; next count++ END print count,word '
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz

Using mawk instead of awk may improve performance.

edited Mar 29 at 15:21

answered Mar 29 at 5:57

finswimmer

72918

Can this be adapted to work with multi-word lines ? echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word ' for example only counts a. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.

– robut
Mar 29 at 13:02

Just replace the $1 with $0 to compare whole lines. I've edited my answer.

– finswimmer
Mar 29 at 15:22

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f509266%2funiq-c-equivalent-for-groups-of-lines-of-arbitrary-count%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

I don't have such a huge dataset for benchmarking. Give this a try:

$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | awk 'NR == 1 word=$0; count=1; next $0 != word print count,word; word=$0; count=1; next count++ END print count,word '
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz

Using mawk instead of awk may improve performance.

edited Mar 29 at 15:21

answered Mar 29 at 5:57

finswimmer

72918

Can this be adapted to work with multi-word lines ? echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word ' for example only counts a. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.

– robut
Mar 29 at 13:02

Just replace the $1 with $0 to compare whole lines. I've edited my answer.

– finswimmer
Mar 29 at 15:22

add a comment |

I don't have such a huge dataset for benchmarking. Give this a try:

$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | awk 'NR == 1 word=$0; count=1; next $0 != word print count,word; word=$0; count=1; next count++ END print count,word '
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz

Using mawk instead of awk may improve performance.

edited Mar 29 at 15:21

answered Mar 29 at 5:57

finswimmer

72918

Can this be adapted to work with multi-word lines ? echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word ' for example only counts a. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.

– robut
Mar 29 at 13:02

Just replace the $1 with $0 to compare whole lines. I've edited my answer.

– finswimmer
Mar 29 at 15:22

add a comment |

I don't have such a huge dataset for benchmarking. Give this a try:

$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | awk 'NR == 1 word=$0; count=1; next $0 != word print count,word; word=$0; count=1; next count++ END print count,word '
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz

Using mawk instead of awk may improve performance.

edited Mar 29 at 15:21

answered Mar 29 at 5:57

finswimmer

72918

I don't have such a huge dataset for benchmarking. Give this a try:

$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | awk 'NR == 1 word=$0; count=1; next $0 != word print count,word; word=$0; count=1; next count++ END print count,word '
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz

Using mawk instead of awk may improve performance.

edited Mar 29 at 15:21

answered Mar 29 at 5:57

finswimmer

72918

edited Mar 29 at 15:21

answered Mar 29 at 5:57

finswimmer

72918

answered Mar 29 at 5:57

finswimmer

72918

answered Mar 29 at 5:57

finswimmer

72918

Can this be adapted to work with multi-word lines ? echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word ' for example only counts a. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.

– robut
Mar 29 at 13:02

Just replace the $1 with $0 to compare whole lines. I've edited my answer.

– finswimmer
Mar 29 at 15:22

add a comment |

Can this be adapted to work with multi-word lines ? echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word ' for example only counts a. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.

– robut
Mar 29 at 13:02

Just replace the $1 with $0 to compare whole lines. I've edited my answer.

– finswimmer
Mar 29 at 15:22

Can this be adapted to work with multi-word lines ?

echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word '

for example only counts a. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.

– robut
Mar 29 at 13:02

Can this be adapted to work with multi-word lines ?

echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word '

for example only counts a. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.

– robut
Mar 29 at 13:02

Just replace the $1 with $0 to compare whole lines. I've edited my answer.

– finswimmer
Mar 29 at 15:22

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

-awk, perl, uniqzyDIh9KaW,vMcGz2ritz c4Q LhDGwEwgac2,6Zq1rj0

搜尋此網誌

Ttyjfyk

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

1 Answer
1

1 Answer
1

1 Answer
1