Routines

Barcode extraction

Barcode extraction can be performed with extract action. Typical case is when we have a pair of FASTQ files with R1 and R2 reads that contain barcodes. Main task here is to create pattern query for extract action, and barcodes will be extracted from sequences by this pattern. Patterns are similar to regular expressions, but with some features specific for nucleotide sequences. Detailed description of pattern syntax is in Pattern Syntax section. There are examples of patterns for some simple cases. In these examples we extract barcodes from data-R1.fastq and data-R2.fastq files and write results to barcodes-R1.fastq and barcodes-R2.fastq files. extract action writes output data in MIF format, so we use mif2fastq action to convert it to FASTQ format. Extracted barcodes will be in read description lines of output FASTQ files.

Example 1. Barcode is first 8 nucleotides of R1:

minnn extract --pattern "^(barcode:N{8})\*" --input data-R1.fastq data-R2.fastq --output extracted.mif
minnn mif2fastq --input extracted.mif --group R1=barcodes-R1.fastq --group R2=barcodes-R2.fastq

Example 2. There are 2 barcodes, first starting with ATT and ending with AAA, with length 9, and second starting with GCC and ending with TTT, with length 12. Swapping of R1 and R2 is not allowed, first barcode is always in R1 and second in R2:

minnn extract --pattern "(B1:ATTNNNAAA)\(B2:GCCN{6}TTT)" --input data-R1.fastq data-R2.fastq --output extracted.mif
minnn mif2fastq --input extracted.mif --group R1=barcodes-R1.fastq --group R2=barcodes-R2.fastq

Example 3. Good sequence starts with ATTAGACA, and first 5 nucleotides can be possibly cut; and if sequence starts with something else, we want to skip it. First barcode with length 5 is immediately after ATTAGACA, then there must be GGC and any 5 nucleotides, and then the second barcode starting with TTT with length 12. Also, good sequence must end with TTAGC, and last 2 nucleotides can be possibly cut. R1 and R2 can be in reverse order in some reads. And we want to allow substitutions and indels (but with score penalties) inside sequences:

minnn extract --pattern "^<{5}attagaca(B1:n{5})gccn{5}(B2:tttn{9})+ttagc>>$\*" --try-reverse-order --score-threshold -25 --input data-R1.fastq data-R2.fastq --output extracted.mif
minnn mif2fastq --input extracted.mif --group R1=barcodes-R1.fastq --group R2=barcodes-R2.fastq

Demultiplexing

Demultiplexing is splitting one dataset into multiple datasets by barcode values. Demultiplexing can be performed with demultiplex action. It works with MIF files, so if you want to demultiplex data from FASTQ files, you need to extract barcodes and convert data to MIF format first, see Barcode extraction section. Output MIF files can be converted to FASTQ with mif2fastq action. There are 2 common demultiplexing tasks: split file by barcode values and extract samples with specified combinations of barcode values.

Example 1. Split data by unique UMI values. We have input data where UMI is first 6 nucleotides, and we want to perform barcodes correction (see Correcting UMI sequence section) before demultiplexing.

minnn extract --pattern "^(UMI:N{6})\*" --input data-R1.fastq data-R2.fastq --output extracted.mif
minnn sort --groups UMI --input extracted.mif --output sorted.mif
minnn correct --groups UMI --input sorted.mif --output corrected.mif
minnn demultiplex --by-barcode UMI corrected.mif --demultiplex-log demultiplex.log

Note that splitting data by unique UMI values can result in very big number of output files!

Example 2. Input data is like in previous example, but we will extract only data with the following UMI values: AATTTT, AAAGGG, CCCCCC, AGACAT, TTTTTA, TTTTTG. For this task we will create the following sample file umi_samples.txt:

Sample UMI
value_AATTTT AATTTT
value_AAAGGG AAAGGG
value_CCCCCC CCCCCC
value_AGACAT AGACAT
value_TTTTTA TTTTTA
value_TTTTTG TTTTTG

And then issue the following commands:

minnn extract --pattern "^(UMI:N{6})\*" --input data-R1.fastq data-R2.fastq --output extracted.mif
minnn sort --groups UMI --input extracted.mif --output sorted.mif
minnn correct --groups UMI --input sorted.mif --output corrected.mif
minnn demultiplex --by-sample umi_samples.txt corrected.mif --demultiplex-log demultiplex.log

Example 3. We extracted sequence barcodes with extract action into extracted.mif file, and we named these barcodes SB1 and SB2. Now we want to put sequences with specified combinations of SB1 and SB2 into separate MIF files. There we will use sample file samples.txt with multiple barcodes:

Sample SB1 SB2
sample1 ATTAGACA CCCCCC
sample2 ATTAGACA GGGGGG
sample3 ATTACCCC TTTTTT

And then issue the following command:

minnn demultiplex --by-sample samples.txt extracted.mif --demultiplex-log demultiplex.log

Correcting UMI sequence

UMI sequences in input data often contain substitutions and indels, and we want to correct such errors to cluster sequences by UMI without creating extra clusters for variants with errors. Barcodes correction is performed with correct action. It is performed after barcode extraction, see Barcode extraction section. Important: file must be sorted with sort action before using correct action, and --groups argument in sort action must contain the same groups in the same order as in correct action. In common cases you can use the default settings for sort and correct actions and specify only input and output files and list of barcode names in --groups argument:

minnn sort --groups UMI --input extracted.mif --output sorted.mif
minnn correct --groups UMI --input sorted.mif --output corrected.mif

You can convert output MIF file into FASTQ with mif2fastq action, or watch statistics for barcode values and positions with stat-groups and stat-positions actions. If you want to specify custom settings for barcode correction, see the description of available options on correct action page.

Example. We want to extract and correct UMI in pair of FASTQ files that contain R1 and R2. We know that UMI is first 6 nucleotides of the read, and it starts with ATT. Then we use the following commands:

minnn extract --pattern "^(UMI:ATTNNN)\*" --input R1.fastq R2.fastq --output extracted.mif
minnn sort --groups UMI --input extracted.mif --output sorted-UMI.mif
minnn correct --groups UMI --input sorted-UMI.mif --output corrected-UMI.mif
minnn mif2fastq --input corrected-UMI.mif --group R1=corrected-UMI-R1.fastq --group R2=corrected-UMI-R2.fastq

Consensus assembly

Consensus assembly consists of 6 stages:

  1. Extract barcodes from raw sequences.
  2. Sort sequences by barcode values to group them for further correction.
  3. Correct mismatches and indels in barcodes.
  4. Sort sequences by barcode values to group them for further consensus assembly.
  5. Assembly consensuses for each barcode. There can be one or many consensuses for each barcode, depending on the way of obtaining original data.
  6. Export calculated consensuses to FASTQ format.

Example. We have 2 FASTQ files with R1 and R2. We want to assemble consensuses by UMI that is 8 nucleotides after first 3 nucleotides TTT. And we know that there must be only 1 consensus for each UMI. Then we use the following commands:

minnn extract --pattern "^TTT(UMI:N{8})\*" --input R1.fastq R2.fastq --output extracted.mif
minnn sort --groups UMI --input extracted.mif --output sorted-1.mif
minnn correct --groups UMI --input sorted-1.mif --output corrected.mif
minnn sort --groups UMI --input corrected.mif --output sorted-2.mif
minnn consensus --groups UMI --max-consensuses-per-cluster 1 --input sorted-2.mif --output consensus.mif
minnn mif2fastq --input consensus.mif --group R1=consensus-R1.fastq --group R2=consensus-R2.fastq

To configure settings for consensus assembly, see the description of available options on consensus action page.