Tool/Settings NameIgDiscover with Database 3
Tool NameIgDiscover
Tool Version0.9
Starting Database

IMGT database but with IGHV3-7*02 extended by two bases (GA), and with the IGHV3-7*02 with a 3’-end that reads GCGAGGGA, (i.e. two variants of IGHV3-7*02)

(see Notes regarding other databases used but whose results are not reported)

Settings

  1. Type of sequences: Choose ‘Ig’ or ‘TCR’.
#
sequence_type: Ig

  1. Barcoding settings

  1. If you have a random barcode sequence (unique molecular identifier) at the 5’ end,
  2. set this to its length. Leave at 0 when you have no 5’ barcode.
#
barcode_length_5prime: 0

  1. Same as above, but for the 3’ end of the sequence. Leave at 0 when you have no 3’ barcode.
  2. Currently, you cannot have a barcode in both ends, so at least one of the two settings
  3. must be zero.
#
barcode_length_3prime: 0

  1. When barcoding is enabled, sequences that have identical barcode and CDR3 are
  2. collapsed into a single consensus sequence.
  3. If you set this to false, no collapsing and consensus taking is done and
  4. only the barcode is removed from each sequence.
#
barcode_consensus: true

  1. When grouping by barcode and CDR3, the CDR3 location is either detected with a
  2. regular expressions or a ‘pseudo’ CDR3 sequence is used, which is at a
  3. pre-defined position within the sequence.
#
  1. Set this configuration option to a region like [-80, -60] to use a pseudo
  2. CDR3 located at bases 80 to 60 counted from the 3’ end. (Use negative numbers to
  3. count from the 3’ end, positive ones to count from the 5’ end. The most 5’
  4. base has index 0.)
#
  1. Set this to ‘detect’ (with quotation marks) in order to use CDR3s
  2. detected by regular expression. This assumes that the input contains
  3. VH sequences!
#
  1. Set this to false (no quotation marks) in order to only group by barcode, not by CDR3.
#
cdr3_location: ‘detect’  # Works only with VH sequences!

  1. When you use a RACE protocol, then the sequences have a run of G nucleotides in the beginning
  2. which need to be removed when barcodes are used. If you use RACE, set this to true.
  3. The G nucleotides are assumed to be in the 5’ end (but after the barcode if it exists).
#
race_g: false

  1. Primer-related settings

  1. If set to true, it is assumed that the forward primer is always at the 5’ end
  2. of the first read and that the reverse primer is always at the 5’ end of the
  3. second read. If it can also be the other way, set this to false.
  4. This setting has no effect if no primer sequences are defined below.
#
stranded: false

  1. List of 5’ primers
#
forward_primers:
  1. - AGCTACAGAGTCCCAGGTCCA
  2. - ACAGGYGCCCACTCYSAG
  3. - TTGCTMTTTTAARAGGTGTCCAGTGTG
  4. - CTCCCAGATGGGTCCTGTC
  5. - ACCGTCCYGGGTCTTGTC
  6. - CTGTTCTCCAAGGGWGTCTSTG
  7. - CATGGGGTGTCCTGTCACA
  1. List of 3’ primers
#
reverse_primers:
  1. - GCAGGCCTTTTTGGCCNNNNNGCCGATGGGCCCTTGGTGGAGGCTGA  # IgG
  2. - GCAGGCCTTTTTGGCCNNNNNGGGGCATTCTCACAGGAGACGAGGGGGAAAAG  # IgM

  1. Work only on this number of reads (for quick test runs). Set to false to
  2. process all reads.
# #limit: false

  1. Filter out merged reads that are shorter than this length.
#
minimum_merged_read_length: 300

  1. Read merging program. Choose either ‘pear’ or ‘flash’.
  2. pear merges more reads, but is slower.
# #merge_program: pear

  1. Maximum overlap (-M) for the flash read merger.
  2. If you use pear, this is ignored.
#
flash_maximum_overlap: 300

  1. Do not mention the original FASTA or FASTQ sequence names in the
  2. assigned.tab files, but instead use names _seq,
  3. where is a running number starting at 1.
#  true: yes, rename #  false: no, do not rename #
rename: true

  1. Whether debugging is enabled or not. Currently, if this is set to true,
  2. some large intermediate files that would otherwise be deleted will be
  3. kept.
#
debug: false

  1. The “seed value” is in arbitrary number used to get reproducible
  2. runs. Two runs that use the same software version, the same seed
  3. and otherwise the same configuration will give identical results.
#
  1. Set this to false in order to use a different seed each run.
  2. The results will then be not exactly reproducible.
#
seed: 1

  1. The preprocessing filter is always applied directly after running IgBLAST,
  2. even if no gene discovery is requested.
#
preprocessing_filter:   v_coverage: 90   # Match must cover V gene by at least this percentage   j_coverage: 60   # Match must cover J gene by at least this percentage   v_evalue: 0.001  # Highest allowed V gene match E-value

  1. Candidate discovery settings

  1. When discovering new V genes, ignore whether a J gene has been assigned
  2. and also ignore its %SHM.
#  true: yes, ignore the J #  false: do not ignore J assignment, do not ignore its %SHM #
ignore_j: false

  1. When clustering sequences to discover new genes, subsample to this number of
  2. sequences. Higher is slower.
#
subsample: 1000

  1. When computing the Ds_exact column, consider only D hits that
  2. cover the reference D gene sequence by at least this percentage.
#
d_coverage: 70

  1. V candidate filtering (germline filtering) settings

  1. Filtering criteria applied to candidate sequences in all iterations except the last.
#
pre_germline_filter:   unique_cdr3s: 2            # Minimum number of unique CDR3s (within exact matches)   unique_js: 2               # Minimum number of unique J genes (within exact matches)   whitelist: true            # Add database sequences to the whitelist   cluster_size: 0            # Minimum number of sequences assigned to cluster   differences: 0             # Merge sequences if they have at most this number of differences   allow_stop: true           # Whether to allow non-productive sequences containing stop codons   cross_mapping_ratio: 0.02  # Threshold for removal of cross-mapping artifacts (set to 0 to disable)   clonotype_ratio: 0.12      # Required minimum ratio of clonotype counts between alleles of the same gene   exact_ratio: 0.12          # Required minimum ratio of “exact” counts between alleles of the same gene   unique_d_ratio: 0.3        # Minimum Ds_exact ratio between alleles   unique_d_threshold: 10     # Check Ds_exact ratio only if highest-expressed allele has at least this Ds_exact count

  1. Filtering criteria applied to candidate sequences in the last iteration.
  2. These should be more strict than the pre_germline_filter criteria.
#
germline_filter:   unique_cdr3s: 5            # Minimum number of unique CDR3s (within exact matches)   unique_js: 3               # Minimum number of unique J genes (within exact matches)   whitelist: true            # Add database sequences to the whitelist   cluster_size: 100          # Minimum number of sequences assigned to cluster   differences: 0             # Merge sequences if they have at most this number of differences   allow_stop: false          # Whether to allow non-productive sequences containing stop codons   cross_mapping_ratio: 0.02  # Threshold for removal of cross-mapping artifacts (set to 0 to disable)   clonotype_ratio: 0.12      # Required minimum ratio of clonotype counts between alleles of the same gene   exact_ratio: 0.12          # Required minimum ratio of “exact” counts between alleles of the same gene   unique_d_ratio: 0.3        # Minimum Ds_exact ratio between alleles   unique_d_threshold: 10     # Check Ds_exact ratio only if highest-expressed allele has at least this Ds_exact count

  1. J discovery settings

j_discovery:   allele_ratio: 0.2         # Required minimum ratio between alleles of a single gene   cross_mapping_ratio: 0.1  # Threshold for removal of cross-mapping artifacts.   propagate: true           # Use J genes discovered in iteration 1 in subsequent ones