Pattern Syntax

Patterns are used in extract action to specify which sequences must pass to the output and which sequences must be filtered out. Also, capture groups in patterns are used for barcode extraction. Patterns must always be specified after --pattern option and must always be in double quotes. Examples:

minnn extract --pattern "ATTAGACA"
minnn extract --pattern "*\*" --input R1.fastq R2.fastq
minnn extract --pattern "^(UMI:N{3:5})attwwAAA\*" --input-format mif

Basic Syntax Elements

Many syntax elements in patterns are similar to regular expressions, but there are differences. Uppercase and lowercase letters are used to specify the sequence that must be matched, but uppercase letters don’t allow indels between them and lowercase letters allow indels. Indels on left and right borders of uppercase letters are also not allowed. Also, score penalty for mismatches in uppercase and lowercase letters can be different: --mismatch-score parameter used for lowercase mismatches and --uppercase-mismatch-score for uppercase mismatches. Standard IUPAC wildcards (N, W, S, M etc) are also allowed in both uppercase and lowercase sequences.

\ character is very important syntax element: it used as read separator. There can be single-read input files, in this case \ character must not be used. In multi-read inputs \ must be used, and number of reads in pattern must be equal to number of input FASTQ files (or to number of reads in input MIF file if --input-format MIF parameter is used). There can be many reads, but the most common case is 2 reads: R1 and R2. By default, extract action will check input reads in order they specified in --input argument, or if input file is MIF, then in order they saved in MIF file. If --try-reverse-order argument is specified, it will also try the combination with 2 swapped last reads (for example, if there are 3 reads, it will try R1, R2, R3 and R1, R3, R2 combinations), and then choose the match with better score.

Another important syntax element is capture group. It looks like (group_name:query) where group_name is any sequence of letters and digits (like UMI or SB1) that you use as group name. Group names are case sensitive, so UMI and umi are different group names. query is part of query that will be saved as this capture group. It can contain nested groups and some other syntax elements that are allowed inside single read (see below).

R1, R2, R3 etc are built-in group names that contain full matched reads. You can override them by specifying manually in the query, and overridden values will go to output instead of full reads. For example, query like this

minnn extract --input R1.fastq R2.fastq --pattern "^NNN(R1:(UMI:NNN)ATTAN{*})\^NNN(R2:NNNGACAN{*})"

can be used if you want to strip first 3 characters and override built-in R1 and R2 groups to write output reads without stripped characters. Note that R1, R2, R3 etc, like any common groups, can contain nested groups and can be nested inside other groups.

Important: in matches that come from swapped reads (when --try-reverse-order argument is specified), if you don’t use built-in group names override, R1 in input will become R2 in output and vice versa (or there can be, for example, swapped R2 and R3 in case of 3 reads). If you use the override, R1, R2, R3 etc in output will come from the place where they matched. If you export the output MIF file from extract action to FASTQ and want to determine whether the match came from straight or swapped reads, check the comments for ||~ character sequence: it is added to matches that came from swapped reads. Look at mif2fastq section for detailed information.

* character can be used instead of read contents if any contents must match. It can be enclosed in one or multiple capture groups, but can’t be used if there are other query elements in the same read. If there are other query elements, use N{*} instead. For example, the following queries are valid:

minnn extract --input R1.fastq R2.fastq --try-reverse-order --pattern "(G1:ATTA)\(G2:(G3:*))"
minnn extract --input R1.fastq R2.fastq R3.fastq --pattern "*\*\*"
minnn extract --input R1.fastq R2.fastq --pattern "(G1:ATTAN{*})\(G2:*)"

and this is invalid:

minnn extract --input R1.fastq R2.fastq --pattern "(G1:ATTA*)\*"

Curly brackets after nucleotide can be used to specify number of repeats for the nucleotide. There can be any nucleotide letter (uppercase or lowercase, basic or wildcard) and then curly braces with quantity specifier. The following syntax constructions are allowed:

a{*} - any number of repeats, from 1 to the entire sequence

a{:} - same as the above

a{14} - fixed number of repeats

a{3:6} - specified interval of allowed repeats, interval borders are inclusive

a{:5} - interval from 1 to specified number, inclusive

a{4:} - interval from specified number (inclusive) to the entire sequence

Special Case: if n or N nucleotide is used before curly brackets, indels and pattern overlaps (see --max-overlap parameter below) are disabled, so lowercase n and uppercase N are equivalent when used before curly brackets.

Symbols ^ and $ can be used to restrict matched sequence to start or end of the target sequence. ^ mark must be in the start of the query for the read, and it means that the query match must start from the beginning of the read sequence. $ mark must be in the end, and it means that the query match must be in the end of the read. Examples:

minnn extract --pattern "^ATTA"
minnn extract --input R1.fastq R2.fastq --pattern "TCCNNWW$\^(G1:ATTAGACA)N{3:18}(G2:ssttggca)$"

Advanced Syntax Elements

There are operators &, + and || that can be used inside the read query.

& operator is logical AND, it means that 2 sequences must match in any order and gap between them. Examples:

minnn extract --pattern "ATTA & GACA"
minnn extract --input R1.fastq R2.fastq --pattern "AAAA & TTTT & CCCC \ *"
minnn extract --input R1.fastq R2.fastq --pattern "(G1:AAAA) & TTTT & CCCC \ ATTA & (G2:GACA)"

Note that AAAA, TTTT and CCCC sequences can be in any order in the target to consider that the entire query is matching. & operator is not allowed within groups, so this example is invalid:

minnn extract --pattern "(G1:ATTA & GACA)"

+ operator is also logical AND but with order restriction. Nucleotide sequences can be matched only in the specified order. Also, + operator can be used within groups. Note that in this case the matched group will also include all nucleotides between matched operands. Examples:

minnn extract --pattern "(G1:ATTA + GACA)"
minnn extract --input R1.fastq R2.fastq --pattern "(G1:AAAA + TTTT) + CCCC \ ATTA + (G2:GACA)"

|| operator is logical OR. It is not allowed within groups, but groups with the same name are allowed inside operands of || operator. Note that if a group is present in one operand of || operator and missing in another operand, this group may appear not matched in the output while the entire query is matched. Examples:

minnn extract --pattern "^AAANNN(G1:ATTA) || ^TTTNNN(G1:GACA)"
minnn extract --input R1.fastq R2.fastq --pattern "(G1:AAAA) || TTTT || (G1:CCCC) \ ATTA || (G2:GACA)"

+, & and || operators can be combined in single query. + operator has the highest priority, then &, and || has the lowest. Read separator (\) has lower priority than all these 3 operators. To change the priority, square brackets [] can be used. Examples:

minnn extract --pattern "^[AAA & TTT] + [GGG || CCC]$"
minnn extract --input R1.fastq R2.fastq --pattern "[(G1:ATTA+GACA)&TTT]+CCC\(G2:AT+AC)"

Square brackets can be used to create sequences of patterns. Sequence is special pattern that works like + but with penalty for gaps between patterns. Examples of sequence pattern:

minnn extract --pattern "[AAA & TTT]CCC"
minnn extract --input R1.fastq R2.fastq --pattern "[(G1:ATTA+GACA)][(G2:TTT)&ATT]\*"

Also square brackets allow to set separate score threshold for the query inside brackets. This can be done by writing score threshold value followed by : after opening bracket. Examples:

minnn extract --pattern "[-14:AAA & TTT]CCC"
minnn extract --input R1.fastq R2.fastq --pattern "[0:(G1:ATTA+GACA)][(G2:TTT)&ATT]\[-25:c{*}]"

Matched operands of &, + and sequence patterns can overlap, but overlaps add penalty to match score. You can control maximum overlap size and overlapping letter penalty by --max-overlap and --single-overlap-penalty parameters. -1 value for --max-overlap parameters means no restriction on maximum overlap size.

Important: parentheses that used for groups are not treated as square brackets; instead, they treated as group edges attached to nucleotide sequences. So, the following examples are different: first example creates sequence pattern and second example adds end of G1 and start of G2 to the middle of sequence TTTCCC.

minnn extract --pattern "[(G1:AAA+TTT)][(G2:CCC+GGG)]"
minnn extract --pattern "(G1:AAA+TTT)(G2:CCC+GGG)"

If some of nucleotides on the edge of nucleotide sequence can be cut without gap penalty, tail cut pattern can be used. It looks like repeated < characters in the beginning of the sequence, or repeated > characters in the end of the read, or single < or > character followed by curly braces with number of repeats. It is often used with ^/$ marks. Examples:

minnn extract --input R1.fastq R2.fastq --pattern "^<<<ATTAGACA>>$\[^<TTTT || ^<<CCCC]"
minnn extract --input R1.fastq R2.fastq --pattern "<{6}ACTCACTCGC + GGCTCGC>{2}$\<<AATCC>"

Important: < and > marks belong to nucleotide sequences and not to complex patterns, so square brackets between < / > and nucleotide sequences are not allowed. Also, the following examples are different: in first example edge cut applied only to the first operand, and in second example - to both operands.

minnn extract --pattern "<{3}ATTA & GACA"
minnn extract --pattern "<{3}ATTA & <{3}GACA"

High Level Logical Operators

There are operators ~, && and || that can be used with full multi-read queries. Note that || operator have the same symbol as read-level OR operator, so square brackets must be used to use high level ||.

|| operator is high-level OR. Groups with the same name are allowed in different operands of this operator, and if a group is present in one operand of || operator and missing in another operand, this group may appear not matched in the output while the entire query is matched. Examples:

minnn extract --pattern "[AA\*\TT] || [*\GG\CG]" --input R1.fastq R2.fastq R3.fastq
minnn extract --pattern "[^(G1:AA) + [ATTA || GACA]$ \ *] || [AT(G1:N{:8})\(G2:AATGC)]" --input R1.fastq R2.fastq

&& operator is high-level AND. For AND operator it is not necessary to enclose multi-read query in square brackets because there is no ambiguity. Groups with the same name are not allowed in different operands of && operator. Examples:

minnn extract --pattern "AA\*\TT && *\GG\CG" --input R1.fastq R2.fastq R3.fastq
minnn extract --pattern "^(G1:AA) + [ATTA || GACA]$ \ * && AT(G2:N{:8})\(G3:AATGC)" --input R1.fastq R2.fastq

~ is high-level NOT operator with single operand. It can sometimes be useful with single-read queries to filter out wrong data. Groups are not allowed in operand of ~ operator.

minnn extract --pattern "~ATTAGACA"
minnn extract --pattern "~[TT \ GC]" --input R1.fastq R2.fastq

Important: ~ operator always belongs to multi-read query that includes all input reads, so this example is invalid:

minnn extract --pattern "[~ATTAGACA] \ TTC" --input R1.fastq R2.fastq

Instead, this query can be used:

minnn extract --pattern "~[ATTAGACA \ *] && * \ TTC" --input R1.fastq R2.fastq

Note that if --try-reverse-order argument is specified, reads will be swapped synchronously for all multi-read queries that appear as operands in the entire query, so this query will never match:

minnn extract --pattern "~[ATTA \ *] && ATTA \ *" --input R1.fastq R2.fastq

Square brackets are not required for ~ operator, but recommended for clarity if input contains more than 1 read. ~ operator have lower priority than \; && has lower priority than ~, and high-level || has lower priority than &&. But remember that high-level || requires to enclose operands or multi-read blocks inside operands into square brackets to avoid ambiguity with read-level OR operator.

Square brackets with score thresholds can be used with high-level queries too:

minnn extract --pattern "~[0: ATTA \ GACA && * \ TTT] || [-18: CCC \ GGG]" --input R1.fastq R2.fastq