uniq -c Equivalent for Groups of Lines of Arbitrary CountGet lines with maximum values in the column using awk, uniq and sortuniq and sed, delete lines with pattern similar in multiple filesuniq showing duplicate linesWhy does this command not sort based on the uniq count?Count unique lines only to a set patternCount lines preserving headerUsing Uniq -c with a regular expression or counting the number of lines removedCount uniq instances of blocks of 2 linesExtracting “count value” after using “uniq -c”How do you count the first column generated from uniq -c

Is ipsum/ipsa/ipse a third person pronoun, or can it serve other functions?

Symmetry in quantum mechanics

Lied on resume at previous job

Ideas for 3rd eye abilities

Is Social Media Science Fiction?

Is it legal to have the "// (c) 2019 John Smith" header in all files when there are hundreds of contributors?

Extreme, but not acceptable situation and I can't start the work tomorrow morning

How do I create uniquely male characters?

What causes the sudden spool-up sound from an F-16 when enabling afterburner?

How to make payment on the internet without leaving a money trail?

aging parents with no investments

Can I find out the caloric content of bread by dehydrating it?

What is it called when one voice type sings a 'solo'?

Pristine Bit Checking

What is the meaning of "of trouble" in the following sentence?

If a centaur druid Wild Shapes into a Giant Elk, do their Charge features stack?

How to manage monthly salary

What is GPS' 19 year rollover and does it present a cybersecurity issue?

New order #4: World

Are objects structures and/or vice versa?

Could a US political party gain complete control over the government by removing checks & balances?

Does the average primeness of natural numbers tend to zero?

How can I plot a Farey diagram?

What happens when a metallic dragon and a chromatic dragon mate?



uniq -c Equivalent for Groups of Lines of Arbitrary Count


Get lines with maximum values in the column using awk, uniq and sortuniq and sed, delete lines with pattern similar in multiple filesuniq showing duplicate linesWhy does this command not sort based on the uniq count?Count unique lines only to a set patternCount lines preserving headerUsing Uniq -c with a regular expression or counting the number of lines removedCount uniq instances of blocks of 2 linesExtracting “count value” after using “uniq -c”How do you count the first column generated from uniq -c






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








2















I've got a file of ~1-2 million lines that I'm trying to reduce down by counting duplicate groups of lines, preserving order.



uniq -c works okay :



$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | uniq -c
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz


In my use-case (but not in the following foo-bar-baz example), counting pairs of lines is ~20% more efficient, and looks like :



$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' 
| sed 's/^/__STARTOFSTRINGDELIMITER__/'
| paste - -
| uniq -c
| sed -r 's/__STARTOFSTRINGDELIMITER__//; s/__STARTOFSTRINGDELIMITER__/nt/;'
2 foo
foo
2 bar
bar
1 baz
foo
1 bar
baz
1 foo
bar
1 baz
foo
1 bar
baz
1 foo
bar
1 baz


(That format is acceptable to me.)



How can I reduce duplicate groups of arbitrary numbers of lines (well, keeping a sane buffer count like 2-10 lines) down to a single copy + count of them ?



Following the above example, I would want output similar to :



4 foo
4 bar
1 baz
4 foo
bar
baz









share|improve this question
























  • That's similar to what some compression algorithms do. Maybe some avenue worth exploring.

    – Stéphane Chazelas
    Mar 28 at 18:32












  • The issue seems to be finding the groups of lines. Your output may as well say that the combination of foo followed by bar occur 5 times.

    – Kusalananda
    Mar 29 at 6:34












  • @Kusalananda Do you mean foo followed by bar 4 times ? (The first two sets of four each). You would be correct then, yes, and either output would be acceptable for me (either foo x4 then bar x4, or (foo, bar) x4). I assume it would depend on the buffer length - 10 lines of buffer would produce the latter, less than 8 lines of buffer would produce the former. It's not really an issue as you say, just a consideration.

    – robut
    Mar 29 at 12:25

















2















I've got a file of ~1-2 million lines that I'm trying to reduce down by counting duplicate groups of lines, preserving order.



uniq -c works okay :



$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | uniq -c
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz


In my use-case (but not in the following foo-bar-baz example), counting pairs of lines is ~20% more efficient, and looks like :



$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' 
| sed 's/^/__STARTOFSTRINGDELIMITER__/'
| paste - -
| uniq -c
| sed -r 's/__STARTOFSTRINGDELIMITER__//; s/__STARTOFSTRINGDELIMITER__/nt/;'
2 foo
foo
2 bar
bar
1 baz
foo
1 bar
baz
1 foo
bar
1 baz
foo
1 bar
baz
1 foo
bar
1 baz


(That format is acceptable to me.)



How can I reduce duplicate groups of arbitrary numbers of lines (well, keeping a sane buffer count like 2-10 lines) down to a single copy + count of them ?



Following the above example, I would want output similar to :



4 foo
4 bar
1 baz
4 foo
bar
baz









share|improve this question
























  • That's similar to what some compression algorithms do. Maybe some avenue worth exploring.

    – Stéphane Chazelas
    Mar 28 at 18:32












  • The issue seems to be finding the groups of lines. Your output may as well say that the combination of foo followed by bar occur 5 times.

    – Kusalananda
    Mar 29 at 6:34












  • @Kusalananda Do you mean foo followed by bar 4 times ? (The first two sets of four each). You would be correct then, yes, and either output would be acceptable for me (either foo x4 then bar x4, or (foo, bar) x4). I assume it would depend on the buffer length - 10 lines of buffer would produce the latter, less than 8 lines of buffer would produce the former. It's not really an issue as you say, just a consideration.

    – robut
    Mar 29 at 12:25













2












2








2








I've got a file of ~1-2 million lines that I'm trying to reduce down by counting duplicate groups of lines, preserving order.



uniq -c works okay :



$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | uniq -c
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz


In my use-case (but not in the following foo-bar-baz example), counting pairs of lines is ~20% more efficient, and looks like :



$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' 
| sed 's/^/__STARTOFSTRINGDELIMITER__/'
| paste - -
| uniq -c
| sed -r 's/__STARTOFSTRINGDELIMITER__//; s/__STARTOFSTRINGDELIMITER__/nt/;'
2 foo
foo
2 bar
bar
1 baz
foo
1 bar
baz
1 foo
bar
1 baz
foo
1 bar
baz
1 foo
bar
1 baz


(That format is acceptable to me.)



How can I reduce duplicate groups of arbitrary numbers of lines (well, keeping a sane buffer count like 2-10 lines) down to a single copy + count of them ?



Following the above example, I would want output similar to :



4 foo
4 bar
1 baz
4 foo
bar
baz









share|improve this question
















I've got a file of ~1-2 million lines that I'm trying to reduce down by counting duplicate groups of lines, preserving order.



uniq -c works okay :



$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | uniq -c
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz


In my use-case (but not in the following foo-bar-baz example), counting pairs of lines is ~20% more efficient, and looks like :



$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' 
| sed 's/^/__STARTOFSTRINGDELIMITER__/'
| paste - -
| uniq -c
| sed -r 's/__STARTOFSTRINGDELIMITER__//; s/__STARTOFSTRINGDELIMITER__/nt/;'
2 foo
foo
2 bar
bar
1 baz
foo
1 bar
baz
1 foo
bar
1 baz
foo
1 bar
baz
1 foo
bar
1 baz


(That format is acceptable to me.)



How can I reduce duplicate groups of arbitrary numbers of lines (well, keeping a sane buffer count like 2-10 lines) down to a single copy + count of them ?



Following the above example, I would want output similar to :



4 foo
4 bar
1 baz
4 foo
bar
baz






awk perl uniq






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Mar 28 at 22:54









Rui F Ribeiro

42k1483142




42k1483142










asked Mar 28 at 16:36









robutrobut

9818




9818












  • That's similar to what some compression algorithms do. Maybe some avenue worth exploring.

    – Stéphane Chazelas
    Mar 28 at 18:32












  • The issue seems to be finding the groups of lines. Your output may as well say that the combination of foo followed by bar occur 5 times.

    – Kusalananda
    Mar 29 at 6:34












  • @Kusalananda Do you mean foo followed by bar 4 times ? (The first two sets of four each). You would be correct then, yes, and either output would be acceptable for me (either foo x4 then bar x4, or (foo, bar) x4). I assume it would depend on the buffer length - 10 lines of buffer would produce the latter, less than 8 lines of buffer would produce the former. It's not really an issue as you say, just a consideration.

    – robut
    Mar 29 at 12:25

















  • That's similar to what some compression algorithms do. Maybe some avenue worth exploring.

    – Stéphane Chazelas
    Mar 28 at 18:32












  • The issue seems to be finding the groups of lines. Your output may as well say that the combination of foo followed by bar occur 5 times.

    – Kusalananda
    Mar 29 at 6:34












  • @Kusalananda Do you mean foo followed by bar 4 times ? (The first two sets of four each). You would be correct then, yes, and either output would be acceptable for me (either foo x4 then bar x4, or (foo, bar) x4). I assume it would depend on the buffer length - 10 lines of buffer would produce the latter, less than 8 lines of buffer would produce the former. It's not really an issue as you say, just a consideration.

    – robut
    Mar 29 at 12:25
















That's similar to what some compression algorithms do. Maybe some avenue worth exploring.

– Stéphane Chazelas
Mar 28 at 18:32






That's similar to what some compression algorithms do. Maybe some avenue worth exploring.

– Stéphane Chazelas
Mar 28 at 18:32














The issue seems to be finding the groups of lines. Your output may as well say that the combination of foo followed by bar occur 5 times.

– Kusalananda
Mar 29 at 6:34






The issue seems to be finding the groups of lines. Your output may as well say that the combination of foo followed by bar occur 5 times.

– Kusalananda
Mar 29 at 6:34














@Kusalananda Do you mean foo followed by bar 4 times ? (The first two sets of four each). You would be correct then, yes, and either output would be acceptable for me (either foo x4 then bar x4, or (foo, bar) x4). I assume it would depend on the buffer length - 10 lines of buffer would produce the latter, less than 8 lines of buffer would produce the former. It's not really an issue as you say, just a consideration.

– robut
Mar 29 at 12:25





@Kusalananda Do you mean foo followed by bar 4 times ? (The first two sets of four each). You would be correct then, yes, and either output would be acceptable for me (either foo x4 then bar x4, or (foo, bar) x4). I assume it would depend on the buffer length - 10 lines of buffer would produce the latter, less than 8 lines of buffer would produce the former. It's not really an issue as you say, just a consideration.

– robut
Mar 29 at 12:25










1 Answer
1






active

oldest

votes


















0














I don't have such a huge dataset for benchmarking. Give this a try:



$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | awk 'NR == 1 word=$0; count=1; next $0 != word print count,word; word=$0; count=1; next count++ END print count,word '
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz


Using mawk instead of awk may improve performance.






share|improve this answer

























  • Can this be adapted to work with multi-word lines ? echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word ' for example only counts a. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.

    – robut
    Mar 29 at 13:02












  • Just replace the $1 with $0 to compare whole lines. I've edited my answer.

    – finswimmer
    Mar 29 at 15:22











Your Answer








StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f509266%2funiq-c-equivalent-for-groups-of-lines-of-arbitrary-count%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














I don't have such a huge dataset for benchmarking. Give this a try:



$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | awk 'NR == 1 word=$0; count=1; next $0 != word print count,word; word=$0; count=1; next count++ END print count,word '
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz


Using mawk instead of awk may improve performance.






share|improve this answer

























  • Can this be adapted to work with multi-word lines ? echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word ' for example only counts a. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.

    – robut
    Mar 29 at 13:02












  • Just replace the $1 with $0 to compare whole lines. I've edited my answer.

    – finswimmer
    Mar 29 at 15:22















0














I don't have such a huge dataset for benchmarking. Give this a try:



$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | awk 'NR == 1 word=$0; count=1; next $0 != word print count,word; word=$0; count=1; next count++ END print count,word '
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz


Using mawk instead of awk may improve performance.






share|improve this answer

























  • Can this be adapted to work with multi-word lines ? echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word ' for example only counts a. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.

    – robut
    Mar 29 at 13:02












  • Just replace the $1 with $0 to compare whole lines. I've edited my answer.

    – finswimmer
    Mar 29 at 15:22













0












0








0







I don't have such a huge dataset for benchmarking. Give this a try:



$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | awk 'NR == 1 word=$0; count=1; next $0 != word print count,word; word=$0; count=1; next count++ END print count,word '
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz


Using mawk instead of awk may improve performance.






share|improve this answer















I don't have such a huge dataset for benchmarking. Give this a try:



$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | awk 'NR == 1 word=$0; count=1; next $0 != word print count,word; word=$0; count=1; next count++ END print count,word '
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz


Using mawk instead of awk may improve performance.







share|improve this answer














share|improve this answer



share|improve this answer








edited Mar 29 at 15:21

























answered Mar 29 at 5:57









finswimmerfinswimmer

72918




72918












  • Can this be adapted to work with multi-word lines ? echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word ' for example only counts a. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.

    – robut
    Mar 29 at 13:02












  • Just replace the $1 with $0 to compare whole lines. I've edited my answer.

    – finswimmer
    Mar 29 at 15:22

















  • Can this be adapted to work with multi-word lines ? echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word ' for example only counts a. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.

    – robut
    Mar 29 at 13:02












  • Just replace the $1 with $0 to compare whole lines. I've edited my answer.

    – finswimmer
    Mar 29 at 15:22
















Can this be adapted to work with multi-word lines ? echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word ' for example only counts a. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.

– robut
Mar 29 at 13:02






Can this be adapted to work with multi-word lines ? echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word ' for example only counts a. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.

– robut
Mar 29 at 13:02














Just replace the $1 with $0 to compare whole lines. I've edited my answer.

– finswimmer
Mar 29 at 15:22





Just replace the $1 with $0 to compare whole lines. I've edited my answer.

– finswimmer
Mar 29 at 15:22

















draft saved

draft discarded
















































Thanks for contributing an answer to Unix & Linux Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f509266%2funiq-c-equivalent-for-groups-of-lines-of-arbitrary-count%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







-awk, perl, uniq

Popular posts from this blog

Creating 100m^2 grid automatically using QGIS?Creating grid constrained within polygon in QGIS?Createing polygon layer from point data using QGIS?Creating vector grid using QGIS?Creating grid polygons from coordinates using R or PythonCreating grid from spatio temporal point data?Creating fields in attributes table using other layers using QGISCreate .shp vector grid in QGISQGIS Creating 4km point grid within polygonsCreate a vector grid over a raster layerVector Grid Creates just one grid

Nikolai Prilezhaev Bibliography References External links Navigation menuEarly Russian Organic Chemists and Their Legacy092774english translationRussian Biography

How to link a C library to an Assembly library on Mac with clangHow do you set, clear, and toggle a single bit?Find (and kill) process locking port 3000 on MacWho is listening on a given TCP port on Mac OS X?How to start PostgreSQL server on Mac OS X?Compile assembler in nasm on mac osHow do I install pip on macOS or OS X?AFNetworking 2.0 “_NSURLSessionTransferSizeUnknown” linking error on Mac OS X 10.8C++ code for testing the Collatz conjecture faster than hand-written assembly - why?How to link a NASM code and GCC in Mac OS X?How to run x86 .asm on macOS Sierra