Pick up successive lines containing keywords in order The 2019 Stack Overflow Developer Survey Results Are In Unicorn Meta Zoo #1: Why another podcast? Announcing the arrival of Valued Associate #679: Cesar Manara 2019 Community Moderator Election ResultsSingle record of a file getting splitted over multiple linesPick columns from a variable length csv fileBash to join columns from multiple filesFind files that contain multiple keywords anywhere in the fileText file containing filenames and hashes - extracting lines with duplicate hashesHow to cat all lines together in file/for all files in a directoryLooking for way to move even lines to the beginning of odd linescopy lines where a character occurs even number of timeschange and manipulate lines in a file using awkCompare two text files, extract matching rows of file2 plus additional rows
should truth entail possible truth
Button changing its text & action. Good or terrible?
Deal with toxic manager when you can't quit
Can the Right Ascension and Argument of Perigee of a spacecraft's orbit keep varying by themselves with time?
Do ℕ, mathbbN, BbbN, symbbN effectively differ, and is there a "canonical" specification of the naturals?
How to read αἱμύλιος or when to aspirate
Do warforged have souls?
Example of compact Riemannian manifold with only one geodesic.
What is the padding with red substance inside of steak packaging?
How to handle characters who are more educated than the author?
Did the UK government pay "millions and millions of dollars" to try to snag Julian Assange?
1960s short story making fun of James Bond-style spy fiction
"is" operation returns false with ndarray.data attribute, even though two array objects have same id
How did the audience guess the pentatonic scale in Bobby McFerrin's presentation?
The following signatures were invalid: EXPKEYSIG 1397BC53640DB551
Why are PDP-7-style microprogrammed instructions out of vogue?
Can we generate random numbers using irrational numbers like π and e?
Is it ok to offer lower paid work as a trial period before negotiating for a full-time job?
What information about me do stores get via my credit card?
How to determine omitted units in a publication
What can I do if neighbor is blocking my solar panels intentionally?
Was credit for the black hole image misappropriated?
What was the last x86 CPU that did not have the x87 floating-point unit built in?
Are there continuous functions who are the same in an interval but differ in at least one other point?
Pick up successive lines containing keywords in order
The 2019 Stack Overflow Developer Survey Results Are In
Unicorn Meta Zoo #1: Why another podcast?
Announcing the arrival of Valued Associate #679: Cesar Manara
2019 Community Moderator Election ResultsSingle record of a file getting splitted over multiple linesPick columns from a variable length csv fileBash to join columns from multiple filesFind files that contain multiple keywords anywhere in the fileText file containing filenames and hashes - extracting lines with duplicate hashesHow to cat all lines together in file/for all files in a directoryLooking for way to move even lines to the beginning of odd linescopy lines where a character occurs even number of timeschange and manipulate lines in a file using awkCompare two text files, extract matching rows of file2 plus additional rows
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
I have a tab-separated file that looks as follows:
$ cat file
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011558474.1 1159543 1160595 -4330977 polyketide synthase [Mycobacterium]
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011558475.1 1160607 1161116 12 isoprenylcysteine carboxyl methyltransferase [Mycobacterium]
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011558476.1 1161113 1162129 -3 NAD(P)/FAD-dependent oxidoreductase [Mycobacterium]
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011559726.1 2496640 2497560 1334511 polyketide synthase [Mycobacterium]
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011559727.1 2497568 2498122 8 isoprenylcysteine carboxyl methyltransferase [Mycobacterium]
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011562574.1 5526997 5528142 3028875 NAD(P)/FAD-dependent oxidoreductase [Mycobacterium]
I need to pick up successive lines that contain the keywords 'polyketide synthase', 'methyltransferase', and 'oxidoreductase' in that order, and write each of these sets into separate files for further analysis.
In this case, the input file would yield 2 output files which would look as follows:
$ cat file_1
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011558474.1 1159543 1160595 -4330977 polyketide synthase [Mycobacterium]
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011558475.1 1160607 1161116 12 isoprenylcysteine carboxyl methyltransferase [Mycobacterium]
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011558476.1 1161113 1162129 -3 NAD(P)/FAD-dependent oxidoreductase [Mycobacterium]
$ cat file_2
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011559726.1 2496640 2497560 1334511 polyketide synthase [Mycobacterium]
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011559727.1 2497568 2498122 8 isoprenylcysteine carboxyl methyltransferase [Mycobacterium]
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011562574.1 5526997 5528142 3028875 NAD(P)/FAD-dependent oxidoreductase [Mycobacterium]
I am having a hard time doing this using awk. Any suggestions?
P.S. I have other input files that contain variable number of instances of the keywords in successive lines. This is where I am getting stuck.
text-processing awk
add a comment |
I have a tab-separated file that looks as follows:
$ cat file
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011558474.1 1159543 1160595 -4330977 polyketide synthase [Mycobacterium]
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011558475.1 1160607 1161116 12 isoprenylcysteine carboxyl methyltransferase [Mycobacterium]
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011558476.1 1161113 1162129 -3 NAD(P)/FAD-dependent oxidoreductase [Mycobacterium]
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011559726.1 2496640 2497560 1334511 polyketide synthase [Mycobacterium]
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011559727.1 2497568 2498122 8 isoprenylcysteine carboxyl methyltransferase [Mycobacterium]
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011562574.1 5526997 5528142 3028875 NAD(P)/FAD-dependent oxidoreductase [Mycobacterium]
I need to pick up successive lines that contain the keywords 'polyketide synthase', 'methyltransferase', and 'oxidoreductase' in that order, and write each of these sets into separate files for further analysis.
In this case, the input file would yield 2 output files which would look as follows:
$ cat file_1
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011558474.1 1159543 1160595 -4330977 polyketide synthase [Mycobacterium]
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011558475.1 1160607 1161116 12 isoprenylcysteine carboxyl methyltransferase [Mycobacterium]
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011558476.1 1161113 1162129 -3 NAD(P)/FAD-dependent oxidoreductase [Mycobacterium]
$ cat file_2
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011559726.1 2496640 2497560 1334511 polyketide synthase [Mycobacterium]
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011559727.1 2497568 2498122 8 isoprenylcysteine carboxyl methyltransferase [Mycobacterium]
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011562574.1 5526997 5528142 3028875 NAD(P)/FAD-dependent oxidoreductase [Mycobacterium]
I am having a hard time doing this using awk. Any suggestions?
P.S. I have other input files that contain variable number of instances of the keywords in successive lines. This is where I am getting stuck.
text-processing awk
What make those two output filesfile_1
&file_2
different fro each other? what other files you are talking aboutother files that contain variable number of instances of the keywords in successive lines
? please edit your question and make it a little more clear.
– αғsнιη
yesterday
@αғsнιη Sorry if I was unclear in my question. file_1 and file_2 would contain different sets of the keyword instances in successive lines (you could look at the intended output files in the question for further clarification). Also, I have made the requested edit in the question.
– BhushanDhamale
21 hours ago
add a comment |
I have a tab-separated file that looks as follows:
$ cat file
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011558474.1 1159543 1160595 -4330977 polyketide synthase [Mycobacterium]
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011558475.1 1160607 1161116 12 isoprenylcysteine carboxyl methyltransferase [Mycobacterium]
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011558476.1 1161113 1162129 -3 NAD(P)/FAD-dependent oxidoreductase [Mycobacterium]
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011559726.1 2496640 2497560 1334511 polyketide synthase [Mycobacterium]
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011559727.1 2497568 2498122 8 isoprenylcysteine carboxyl methyltransferase [Mycobacterium]
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011562574.1 5526997 5528142 3028875 NAD(P)/FAD-dependent oxidoreductase [Mycobacterium]
I need to pick up successive lines that contain the keywords 'polyketide synthase', 'methyltransferase', and 'oxidoreductase' in that order, and write each of these sets into separate files for further analysis.
In this case, the input file would yield 2 output files which would look as follows:
$ cat file_1
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011558474.1 1159543 1160595 -4330977 polyketide synthase [Mycobacterium]
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011558475.1 1160607 1161116 12 isoprenylcysteine carboxyl methyltransferase [Mycobacterium]
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011558476.1 1161113 1162129 -3 NAD(P)/FAD-dependent oxidoreductase [Mycobacterium]
$ cat file_2
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011559726.1 2496640 2497560 1334511 polyketide synthase [Mycobacterium]
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011559727.1 2497568 2498122 8 isoprenylcysteine carboxyl methyltransferase [Mycobacterium]
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011562574.1 5526997 5528142 3028875 NAD(P)/FAD-dependent oxidoreductase [Mycobacterium]
I am having a hard time doing this using awk. Any suggestions?
P.S. I have other input files that contain variable number of instances of the keywords in successive lines. This is where I am getting stuck.
text-processing awk
I have a tab-separated file that looks as follows:
$ cat file
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011558474.1 1159543 1160595 -4330977 polyketide synthase [Mycobacterium]
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011558475.1 1160607 1161116 12 isoprenylcysteine carboxyl methyltransferase [Mycobacterium]
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011558476.1 1161113 1162129 -3 NAD(P)/FAD-dependent oxidoreductase [Mycobacterium]
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011559726.1 2496640 2497560 1334511 polyketide synthase [Mycobacterium]
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011559727.1 2497568 2498122 8 isoprenylcysteine carboxyl methyltransferase [Mycobacterium]
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011562574.1 5526997 5528142 3028875 NAD(P)/FAD-dependent oxidoreductase [Mycobacterium]
I need to pick up successive lines that contain the keywords 'polyketide synthase', 'methyltransferase', and 'oxidoreductase' in that order, and write each of these sets into separate files for further analysis.
In this case, the input file would yield 2 output files which would look as follows:
$ cat file_1
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011558474.1 1159543 1160595 -4330977 polyketide synthase [Mycobacterium]
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011558475.1 1160607 1161116 12 isoprenylcysteine carboxyl methyltransferase [Mycobacterium]
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011558476.1 1161113 1162129 -3 NAD(P)/FAD-dependent oxidoreductase [Mycobacterium]
$ cat file_2
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011559726.1 2496640 2497560 1334511 polyketide synthase [Mycobacterium]
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011559727.1 2497568 2498122 8 isoprenylcysteine carboxyl methyltransferase [Mycobacterium]
GCF_000015405.1_ASM1540v1.dist_nbr_anntn WP_011562574.1 5526997 5528142 3028875 NAD(P)/FAD-dependent oxidoreductase [Mycobacterium]
I am having a hard time doing this using awk. Any suggestions?
P.S. I have other input files that contain variable number of instances of the keywords in successive lines. This is where I am getting stuck.
text-processing awk
text-processing awk
edited 21 hours ago
BhushanDhamale
asked yesterday
BhushanDhamaleBhushanDhamale
1664
1664
What make those two output filesfile_1
&file_2
different fro each other? what other files you are talking aboutother files that contain variable number of instances of the keywords in successive lines
? please edit your question and make it a little more clear.
– αғsнιη
yesterday
@αғsнιη Sorry if I was unclear in my question. file_1 and file_2 would contain different sets of the keyword instances in successive lines (you could look at the intended output files in the question for further clarification). Also, I have made the requested edit in the question.
– BhushanDhamale
21 hours ago
add a comment |
What make those two output filesfile_1
&file_2
different fro each other? what other files you are talking aboutother files that contain variable number of instances of the keywords in successive lines
? please edit your question and make it a little more clear.
– αғsнιη
yesterday
@αғsнιη Sorry if I was unclear in my question. file_1 and file_2 would contain different sets of the keyword instances in successive lines (you could look at the intended output files in the question for further clarification). Also, I have made the requested edit in the question.
– BhushanDhamale
21 hours ago
What make those two output files
file_1
& file_2
different fro each other? what other files you are talking about other files that contain variable number of instances of the keywords in successive lines
? please edit your question and make it a little more clear.– αғsнιη
yesterday
What make those two output files
file_1
& file_2
different fro each other? what other files you are talking about other files that contain variable number of instances of the keywords in successive lines
? please edit your question and make it a little more clear.– αғsнιη
yesterday
@αғsнιη Sorry if I was unclear in my question. file_1 and file_2 would contain different sets of the keyword instances in successive lines (you could look at the intended output files in the question for further clarification). Also, I have made the requested edit in the question.
– BhushanDhamale
21 hours ago
@αғsнιη Sorry if I was unclear in my question. file_1 and file_2 would contain different sets of the keyword instances in successive lines (you could look at the intended output files in the question for further clarification). Also, I have made the requested edit in the question.
– BhushanDhamale
21 hours ago
add a comment |
2 Answers
2
active
oldest
votes
You can change what you are searching for as the script progresses and change where you write to each time you cycle through your terms
awk 'BEGIN
result_file = 1;
term_id = 1;
search_terms[1] = "polyketide synthase";
search_terms[2] = "methyltransferase";
search_terms[3] = "oxidoreductase"
$0 ~ search_terms[term_id]
print $0 >> FILENAME "_" result_file;
term_id = term_id + 1;
if (term_id > 3)
result_file = result_file + 1;
term_id = 1
' input_file
This will write to input_file_1
, input_file_2
...
add a comment |
You might test the following code, where I split your keywords into an awk
array named keys
with N
elements. everything starts with keys[1] and we set up a flag to check the next 1 to N-1
lines if they matches the corresponding values in the array keys [index from 2 to N], any mismatches before the N-1 line will reset this flag, if it reaches the N-1
line, then all are good for output (we also reset flag=0 here so a consecutive run of flag==1 never exceeds N-1
lines):
$ cat t24.awk
BEGIN
FS = OFS = "t";
keywords = "polyketide synthase,methyltransferase,oxidoreductase";
N = split(keywords, keys, ",")
# flag==1 means we are doing regex_match the next N-1 lines
# against corresponding array element in keys from [2:N]
# once a unmatched found, turn off flag immediately
# if the flag==1 reached N-1 lines, then print the good match
flag
if($NF ~ keys[NR - start_line + 1])
F = F ORS $0;
if (NR == start_line+N-1) print F > "out_" f++; flag = 0
next
else
flag = 0;
# set up the flag/start_line and reset F
$NF ~ keys[1] flag = 1; F = $0; start_line= NR;
Run the above code with awk -f t24.awk file.txt
. You can set up keywords
(comma delimited) from your shell(instead of hard-coded in the BEGIN
block), and then use -v keywords="..."
to make it more flexible.
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f511906%2fpick-up-successive-lines-containing-keywords-in-order%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
You can change what you are searching for as the script progresses and change where you write to each time you cycle through your terms
awk 'BEGIN
result_file = 1;
term_id = 1;
search_terms[1] = "polyketide synthase";
search_terms[2] = "methyltransferase";
search_terms[3] = "oxidoreductase"
$0 ~ search_terms[term_id]
print $0 >> FILENAME "_" result_file;
term_id = term_id + 1;
if (term_id > 3)
result_file = result_file + 1;
term_id = 1
' input_file
This will write to input_file_1
, input_file_2
...
add a comment |
You can change what you are searching for as the script progresses and change where you write to each time you cycle through your terms
awk 'BEGIN
result_file = 1;
term_id = 1;
search_terms[1] = "polyketide synthase";
search_terms[2] = "methyltransferase";
search_terms[3] = "oxidoreductase"
$0 ~ search_terms[term_id]
print $0 >> FILENAME "_" result_file;
term_id = term_id + 1;
if (term_id > 3)
result_file = result_file + 1;
term_id = 1
' input_file
This will write to input_file_1
, input_file_2
...
add a comment |
You can change what you are searching for as the script progresses and change where you write to each time you cycle through your terms
awk 'BEGIN
result_file = 1;
term_id = 1;
search_terms[1] = "polyketide synthase";
search_terms[2] = "methyltransferase";
search_terms[3] = "oxidoreductase"
$0 ~ search_terms[term_id]
print $0 >> FILENAME "_" result_file;
term_id = term_id + 1;
if (term_id > 3)
result_file = result_file + 1;
term_id = 1
' input_file
This will write to input_file_1
, input_file_2
...
You can change what you are searching for as the script progresses and change where you write to each time you cycle through your terms
awk 'BEGIN
result_file = 1;
term_id = 1;
search_terms[1] = "polyketide synthase";
search_terms[2] = "methyltransferase";
search_terms[3] = "oxidoreductase"
$0 ~ search_terms[term_id]
print $0 >> FILENAME "_" result_file;
term_id = term_id + 1;
if (term_id > 3)
result_file = result_file + 1;
term_id = 1
' input_file
This will write to input_file_1
, input_file_2
...
edited 19 hours ago
answered yesterday
Philip CoulingPhilip Couling
2,5791123
2,5791123
add a comment |
add a comment |
You might test the following code, where I split your keywords into an awk
array named keys
with N
elements. everything starts with keys[1] and we set up a flag to check the next 1 to N-1
lines if they matches the corresponding values in the array keys [index from 2 to N], any mismatches before the N-1 line will reset this flag, if it reaches the N-1
line, then all are good for output (we also reset flag=0 here so a consecutive run of flag==1 never exceeds N-1
lines):
$ cat t24.awk
BEGIN
FS = OFS = "t";
keywords = "polyketide synthase,methyltransferase,oxidoreductase";
N = split(keywords, keys, ",")
# flag==1 means we are doing regex_match the next N-1 lines
# against corresponding array element in keys from [2:N]
# once a unmatched found, turn off flag immediately
# if the flag==1 reached N-1 lines, then print the good match
flag
if($NF ~ keys[NR - start_line + 1])
F = F ORS $0;
if (NR == start_line+N-1) print F > "out_" f++; flag = 0
next
else
flag = 0;
# set up the flag/start_line and reset F
$NF ~ keys[1] flag = 1; F = $0; start_line= NR;
Run the above code with awk -f t24.awk file.txt
. You can set up keywords
(comma delimited) from your shell(instead of hard-coded in the BEGIN
block), and then use -v keywords="..."
to make it more flexible.
add a comment |
You might test the following code, where I split your keywords into an awk
array named keys
with N
elements. everything starts with keys[1] and we set up a flag to check the next 1 to N-1
lines if they matches the corresponding values in the array keys [index from 2 to N], any mismatches before the N-1 line will reset this flag, if it reaches the N-1
line, then all are good for output (we also reset flag=0 here so a consecutive run of flag==1 never exceeds N-1
lines):
$ cat t24.awk
BEGIN
FS = OFS = "t";
keywords = "polyketide synthase,methyltransferase,oxidoreductase";
N = split(keywords, keys, ",")
# flag==1 means we are doing regex_match the next N-1 lines
# against corresponding array element in keys from [2:N]
# once a unmatched found, turn off flag immediately
# if the flag==1 reached N-1 lines, then print the good match
flag
if($NF ~ keys[NR - start_line + 1])
F = F ORS $0;
if (NR == start_line+N-1) print F > "out_" f++; flag = 0
next
else
flag = 0;
# set up the flag/start_line and reset F
$NF ~ keys[1] flag = 1; F = $0; start_line= NR;
Run the above code with awk -f t24.awk file.txt
. You can set up keywords
(comma delimited) from your shell(instead of hard-coded in the BEGIN
block), and then use -v keywords="..."
to make it more flexible.
add a comment |
You might test the following code, where I split your keywords into an awk
array named keys
with N
elements. everything starts with keys[1] and we set up a flag to check the next 1 to N-1
lines if they matches the corresponding values in the array keys [index from 2 to N], any mismatches before the N-1 line will reset this flag, if it reaches the N-1
line, then all are good for output (we also reset flag=0 here so a consecutive run of flag==1 never exceeds N-1
lines):
$ cat t24.awk
BEGIN
FS = OFS = "t";
keywords = "polyketide synthase,methyltransferase,oxidoreductase";
N = split(keywords, keys, ",")
# flag==1 means we are doing regex_match the next N-1 lines
# against corresponding array element in keys from [2:N]
# once a unmatched found, turn off flag immediately
# if the flag==1 reached N-1 lines, then print the good match
flag
if($NF ~ keys[NR - start_line + 1])
F = F ORS $0;
if (NR == start_line+N-1) print F > "out_" f++; flag = 0
next
else
flag = 0;
# set up the flag/start_line and reset F
$NF ~ keys[1] flag = 1; F = $0; start_line= NR;
Run the above code with awk -f t24.awk file.txt
. You can set up keywords
(comma delimited) from your shell(instead of hard-coded in the BEGIN
block), and then use -v keywords="..."
to make it more flexible.
You might test the following code, where I split your keywords into an awk
array named keys
with N
elements. everything starts with keys[1] and we set up a flag to check the next 1 to N-1
lines if they matches the corresponding values in the array keys [index from 2 to N], any mismatches before the N-1 line will reset this flag, if it reaches the N-1
line, then all are good for output (we also reset flag=0 here so a consecutive run of flag==1 never exceeds N-1
lines):
$ cat t24.awk
BEGIN
FS = OFS = "t";
keywords = "polyketide synthase,methyltransferase,oxidoreductase";
N = split(keywords, keys, ",")
# flag==1 means we are doing regex_match the next N-1 lines
# against corresponding array element in keys from [2:N]
# once a unmatched found, turn off flag immediately
# if the flag==1 reached N-1 lines, then print the good match
flag
if($NF ~ keys[NR - start_line + 1])
F = F ORS $0;
if (NR == start_line+N-1) print F > "out_" f++; flag = 0
next
else
flag = 0;
# set up the flag/start_line and reset F
$NF ~ keys[1] flag = 1; F = $0; start_line= NR;
Run the above code with awk -f t24.awk file.txt
. You can set up keywords
(comma delimited) from your shell(instead of hard-coded in the BEGIN
block), and then use -v keywords="..."
to make it more flexible.
answered yesterday
jxcjxc
1663
1663
add a comment |
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f511906%2fpick-up-successive-lines-containing-keywords-in-order%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
-awk, text-processing
What make those two output files
file_1
&file_2
different fro each other? what other files you are talking aboutother files that contain variable number of instances of the keywords in successive lines
? please edit your question and make it a little more clear.– αғsнιη
yesterday
@αғsнιη Sorry if I was unclear in my question. file_1 and file_2 would contain different sets of the keyword instances in successive lines (you could look at the intended output files in the question for further clarification). Also, I have made the requested edit in the question.
– BhushanDhamale
21 hours ago