uniq -c Equivalent for Groups of Lines of Arbitrary CountGet lines with maximum values in the column using awk, uniq and sortuniq and sed, delete lines with pattern similar in multiple filesuniq showing duplicate linesWhy does this command not sort based on the uniq count?Count unique lines only to a set patternCount lines preserving headerUsing Uniq -c with a regular expression or counting the number of lines removedCount uniq instances of blocks of 2 linesExtracting “count value” after using “uniq -c”How do you count the first column generated from uniq -c

Is ipsum/ipsa/ipse a third person pronoun, or can it serve other functions?

Symmetry in quantum mechanics

Lied on resume at previous job

Ideas for 3rd eye abilities

Is Social Media Science Fiction?

Is it legal to have the "// (c) 2019 John Smith" header in all files when there are hundreds of contributors?

Extreme, but not acceptable situation and I can't start the work tomorrow morning

How do I create uniquely male characters?

What causes the sudden spool-up sound from an F-16 when enabling afterburner?

How to make payment on the internet without leaving a money trail?

aging parents with no investments

Can I find out the caloric content of bread by dehydrating it?

What is it called when one voice type sings a 'solo'?

Pristine Bit Checking

What is the meaning of "of trouble" in the following sentence?

If a centaur druid Wild Shapes into a Giant Elk, do their Charge features stack?

How to manage monthly salary

What is GPS' 19 year rollover and does it present a cybersecurity issue?

New order #4: World

Are objects structures and/or vice versa?

Could a US political party gain complete control over the government by removing checks & balances?

Does the average primeness of natural numbers tend to zero?

How can I plot a Farey diagram?

What happens when a metallic dragon and a chromatic dragon mate?



uniq -c Equivalent for Groups of Lines of Arbitrary Count


Get lines with maximum values in the column using awk, uniq and sortuniq and sed, delete lines with pattern similar in multiple filesuniq showing duplicate linesWhy does this command not sort based on the uniq count?Count unique lines only to a set patternCount lines preserving headerUsing Uniq -c with a regular expression or counting the number of lines removedCount uniq instances of blocks of 2 linesExtracting “count value” after using “uniq -c”How do you count the first column generated from uniq -c






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








2















I've got a file of ~1-2 million lines that I'm trying to reduce down by counting duplicate groups of lines, preserving order.



uniq -c works okay :



$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | uniq -c
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz


In my use-case (but not in the following foo-bar-baz example), counting pairs of lines is ~20% more efficient, and looks like :



$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' 
| sed 's/^/__STARTOFSTRINGDELIMITER__/'
| paste - -
| uniq -c
| sed -r 's/__STARTOFSTRINGDELIMITER__//; s/__STARTOFSTRINGDELIMITER__/nt/;'
2 foo
foo
2 bar
bar
1 baz
foo
1 bar
baz
1 foo
bar
1 baz
foo
1 bar
baz
1 foo
bar
1 baz


(That format is acceptable to me.)



How can I reduce duplicate groups of arbitrary numbers of lines (well, keeping a sane buffer count like 2-10 lines) down to a single copy + count of them ?



Following the above example, I would want output similar to :



4 foo
4 bar
1 baz
4 foo
bar
baz









share|improve this question
























  • That's similar to what some compression algorithms do. Maybe some avenue worth exploring.

    – Stéphane Chazelas
    Mar 28 at 18:32












  • The issue seems to be finding the groups of lines. Your output may as well say that the combination of foo followed by bar occur 5 times.

    – Kusalananda
    Mar 29 at 6:34












  • @Kusalananda Do you mean foo followed by bar 4 times ? (The first two sets of four each). You would be correct then, yes, and either output would be acceptable for me (either foo x4 then bar x4, or (foo, bar) x4). I assume it would depend on the buffer length - 10 lines of buffer would produce the latter, less than 8 lines of buffer would produce the former. It's not really an issue as you say, just a consideration.

    – robut
    Mar 29 at 12:25

















2















I've got a file of ~1-2 million lines that I'm trying to reduce down by counting duplicate groups of lines, preserving order.



uniq -c works okay :



$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | uniq -c
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz


In my use-case (but not in the following foo-bar-baz example), counting pairs of lines is ~20% more efficient, and looks like :



$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' 
| sed 's/^/__STARTOFSTRINGDELIMITER__/'
| paste - -
| uniq -c
| sed -r 's/__STARTOFSTRINGDELIMITER__//; s/__STARTOFSTRINGDELIMITER__/nt/;'
2 foo
foo
2 bar
bar
1 baz
foo
1 bar
baz
1 foo
bar
1 baz
foo
1 bar
baz
1 foo
bar
1 baz


(That format is acceptable to me.)



How can I reduce duplicate groups of arbitrary numbers of lines (well, keeping a sane buffer count like 2-10 lines) down to a single copy + count of them ?



Following the above example, I would want output similar to :



4 foo
4 bar
1 baz
4 foo
bar
baz









share|improve this question
























  • That's similar to what some compression algorithms do. Maybe some avenue worth exploring.

    – Stéphane Chazelas
    Mar 28 at 18:32












  • The issue seems to be finding the groups of lines. Your output may as well say that the combination of foo followed by bar occur 5 times.

    – Kusalananda
    Mar 29 at 6:34












  • @Kusalananda Do you mean foo followed by bar 4 times ? (The first two sets of four each). You would be correct then, yes, and either output would be acceptable for me (either foo x4 then bar x4, or (foo, bar) x4). I assume it would depend on the buffer length - 10 lines of buffer would produce the latter, less than 8 lines of buffer would produce the former. It's not really an issue as you say, just a consideration.

    – robut
    Mar 29 at 12:25













2












2








2








I've got a file of ~1-2 million lines that I'm trying to reduce down by counting duplicate groups of lines, preserving order.



uniq -c works okay :



$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | uniq -c
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz


In my use-case (but not in the following foo-bar-baz example), counting pairs of lines is ~20% more efficient, and looks like :



$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' 
| sed 's/^/__STARTOFSTRINGDELIMITER__/'
| paste - -
| uniq -c
| sed -r 's/__STARTOFSTRINGDELIMITER__//; s/__STARTOFSTRINGDELIMITER__/nt/;'
2 foo
foo
2 bar
bar
1 baz
foo
1 bar
baz
1 foo
bar
1 baz
foo
1 bar
baz
1 foo
bar
1 baz


(That format is acceptable to me.)



How can I reduce duplicate groups of arbitrary numbers of lines (well, keeping a sane buffer count like 2-10 lines) down to a single copy + count of them ?



Following the above example, I would want output similar to :



4 foo
4 bar
1 baz
4 foo
bar
baz









share|improve this question
















I've got a file of ~1-2 million lines that I'm trying to reduce down by counting duplicate groups of lines, preserving order.



uniq -c works okay :



$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | uniq -c
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz


In my use-case (but not in the following foo-bar-baz example), counting pairs of lines is ~20% more efficient, and looks like :



$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' 
| sed 's/^/__STARTOFSTRINGDELIMITER__/'
| paste - -
| uniq -c
| sed -r 's/__STARTOFSTRINGDELIMITER__//; s/__STARTOFSTRINGDELIMITER__/nt/;'
2 foo
foo
2 bar
bar
1 baz
foo
1 bar
baz
1 foo
bar
1 baz
foo
1 bar
baz
1 foo
bar
1 baz


(That format is acceptable to me.)



How can I reduce duplicate groups of arbitrary numbers of lines (well, keeping a sane buffer count like 2-10 lines) down to a single copy + count of them ?



Following the above example, I would want output similar to :



4 foo
4 bar
1 baz
4 foo
bar
baz






awk perl uniq






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Mar 28 at 22:54









Rui F Ribeiro

42k1483142




42k1483142










asked Mar 28 at 16:36









robutrobut

9818




9818












  • That's similar to what some compression algorithms do. Maybe some avenue worth exploring.

    – Stéphane Chazelas
    Mar 28 at 18:32












  • The issue seems to be finding the groups of lines. Your output may as well say that the combination of foo followed by bar occur 5 times.

    – Kusalananda
    Mar 29 at 6:34












  • @Kusalananda Do you mean foo followed by bar 4 times ? (The first two sets of four each). You would be correct then, yes, and either output would be acceptable for me (either foo x4 then bar x4, or (foo, bar) x4). I assume it would depend on the buffer length - 10 lines of buffer would produce the latter, less than 8 lines of buffer would produce the former. It's not really an issue as you say, just a consideration.

    – robut
    Mar 29 at 12:25

















  • That's similar to what some compression algorithms do. Maybe some avenue worth exploring.

    – Stéphane Chazelas
    Mar 28 at 18:32












  • The issue seems to be finding the groups of lines. Your output may as well say that the combination of foo followed by bar occur 5 times.

    – Kusalananda
    Mar 29 at 6:34












  • @Kusalananda Do you mean foo followed by bar 4 times ? (The first two sets of four each). You would be correct then, yes, and either output would be acceptable for me (either foo x4 then bar x4, or (foo, bar) x4). I assume it would depend on the buffer length - 10 lines of buffer would produce the latter, less than 8 lines of buffer would produce the former. It's not really an issue as you say, just a consideration.

    – robut
    Mar 29 at 12:25
















That's similar to what some compression algorithms do. Maybe some avenue worth exploring.

– Stéphane Chazelas
Mar 28 at 18:32






That's similar to what some compression algorithms do. Maybe some avenue worth exploring.

– Stéphane Chazelas
Mar 28 at 18:32














The issue seems to be finding the groups of lines. Your output may as well say that the combination of foo followed by bar occur 5 times.

– Kusalananda
Mar 29 at 6:34






The issue seems to be finding the groups of lines. Your output may as well say that the combination of foo followed by bar occur 5 times.

– Kusalananda
Mar 29 at 6:34














@Kusalananda Do you mean foo followed by bar 4 times ? (The first two sets of four each). You would be correct then, yes, and either output would be acceptable for me (either foo x4 then bar x4, or (foo, bar) x4). I assume it would depend on the buffer length - 10 lines of buffer would produce the latter, less than 8 lines of buffer would produce the former. It's not really an issue as you say, just a consideration.

– robut
Mar 29 at 12:25





@Kusalananda Do you mean foo followed by bar 4 times ? (The first two sets of four each). You would be correct then, yes, and either output would be acceptable for me (either foo x4 then bar x4, or (foo, bar) x4). I assume it would depend on the buffer length - 10 lines of buffer would produce the latter, less than 8 lines of buffer would produce the former. It's not really an issue as you say, just a consideration.

– robut
Mar 29 at 12:25










1 Answer
1






active

oldest

votes


















0














I don't have such a huge dataset for benchmarking. Give this a try:



$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | awk 'NR == 1 word=$0; count=1; next $0 != word print count,word; word=$0; count=1; next count++ END print count,word '
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz


Using mawk instead of awk may improve performance.






share|improve this answer

























  • Can this be adapted to work with multi-word lines ? echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word ' for example only counts a. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.

    – robut
    Mar 29 at 13:02












  • Just replace the $1 with $0 to compare whole lines. I've edited my answer.

    – finswimmer
    Mar 29 at 15:22











Your Answer








StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f509266%2funiq-c-equivalent-for-groups-of-lines-of-arbitrary-count%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














I don't have such a huge dataset for benchmarking. Give this a try:



$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | awk 'NR == 1 word=$0; count=1; next $0 != word print count,word; word=$0; count=1; next count++ END print count,word '
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz


Using mawk instead of awk may improve performance.






share|improve this answer

























  • Can this be adapted to work with multi-word lines ? echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word ' for example only counts a. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.

    – robut
    Mar 29 at 13:02












  • Just replace the $1 with $0 to compare whole lines. I've edited my answer.

    – finswimmer
    Mar 29 at 15:22















0














I don't have such a huge dataset for benchmarking. Give this a try:



$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | awk 'NR == 1 word=$0; count=1; next $0 != word print count,word; word=$0; count=1; next count++ END print count,word '
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz


Using mawk instead of awk may improve performance.






share|improve this answer

























  • Can this be adapted to work with multi-word lines ? echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word ' for example only counts a. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.

    – robut
    Mar 29 at 13:02












  • Just replace the $1 with $0 to compare whole lines. I've edited my answer.

    – finswimmer
    Mar 29 at 15:22













0












0








0







I don't have such a huge dataset for benchmarking. Give this a try:



$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | awk 'NR == 1 word=$0; count=1; next $0 != word print count,word; word=$0; count=1; next count++ END print count,word '
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz


Using mawk instead of awk may improve performance.






share|improve this answer















I don't have such a huge dataset for benchmarking. Give this a try:



$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | awk 'NR == 1 word=$0; count=1; next $0 != word print count,word; word=$0; count=1; next count++ END print count,word '
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz


Using mawk instead of awk may improve performance.







share|improve this answer














share|improve this answer



share|improve this answer








edited Mar 29 at 15:21

























answered Mar 29 at 5:57









finswimmerfinswimmer

72918




72918












  • Can this be adapted to work with multi-word lines ? echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word ' for example only counts a. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.

    – robut
    Mar 29 at 13:02












  • Just replace the $1 with $0 to compare whole lines. I've edited my answer.

    – finswimmer
    Mar 29 at 15:22

















  • Can this be adapted to work with multi-word lines ? echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word ' for example only counts a. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.

    – robut
    Mar 29 at 13:02












  • Just replace the $1 with $0 to compare whole lines. I've edited my answer.

    – finswimmer
    Mar 29 at 15:22
















Can this be adapted to work with multi-word lines ? echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word ' for example only counts a. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.

– robut
Mar 29 at 13:02






Can this be adapted to work with multi-word lines ? echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word ' for example only counts a. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.

– robut
Mar 29 at 13:02














Just replace the $1 with $0 to compare whole lines. I've edited my answer.

– finswimmer
Mar 29 at 15:22





Just replace the $1 with $0 to compare whole lines. I've edited my answer.

– finswimmer
Mar 29 at 15:22

















draft saved

draft discarded
















































Thanks for contributing an answer to Unix & Linux Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f509266%2funiq-c-equivalent-for-groups-of-lines-of-arbitrary-count%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







-awk, perl, uniq

Popular posts from this blog

Frič See also Navigation menuinternal link

Identify plant with long narrow paired leaves and reddish stems Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) Announcing the arrival of Valued Associate #679: Cesar Manara Unicorn Meta Zoo #1: Why another podcast?What is this plant with long sharp leaves? Is it a weed?What is this 3ft high, stalky plant, with mid sized narrow leaves?What is this young shrub with opposite ovate, crenate leaves and reddish stems?What is this plant with large broad serrated leaves?Identify this upright branching weed with long leaves and reddish stemsPlease help me identify this bulbous plant with long, broad leaves and white flowersWhat is this small annual with narrow gray/green leaves and rust colored daisy-type flowers?What is this chilli plant?Does anyone know what type of chilli plant this is?Help identify this plant

fontconfig warning: “/etc/fonts/fonts.conf”, line 100: unknown “element blank” The 2019 Stack Overflow Developer Survey Results Are In“tar: unrecognized option --warning” during 'apt-get install'How to fix Fontconfig errorHow do I figure out which font file is chosen for a system generic font alias?Why are some apt-get-installed fonts being ignored by fc-list, xfontsel, etc?Reload settings in /etc/fonts/conf.dTaking 30 seconds longer to boot after upgrade from jessie to stretchHow to match multiple font names with a single <match> element?Adding a custom font to fontconfigRemoving fonts from fontconfig <match> resultsBroken fonts after upgrading Firefox ESR to latest Firefox