OGRDB


Tool/Settings Name	IgDiscover with Database 3
Tool Name	IgDiscover
Tool Version	0.9
Starting Database	IMGT database but with IGHV3-702 extended by two bases (GA), and with the IGHV3-702 with a 3’-end that reads GCGAGGGA, (i.e. two variants of IGHV3-7*02) (see Notes regarding other databases used but whose results are not reported)
Settings	Type of sequences: Choose ‘Ig’ or ‘TCR’. # sequence_type: Ig Barcoding settings If you have a random barcode sequence (unique molecular identifier) at the 5’ end, set this to its length. Leave at 0 when you have no 5’ barcode. # barcode_length_5prime: 0 Same as above, but for the 3’ end of the sequence. Leave at 0 when you have no 3’ barcode. Currently, you cannot have a barcode in both ends, so at least one of the two settings must be zero. # barcode_length_3prime: 0 When barcoding is enabled, sequences that have identical barcode and CDR3 are collapsed into a single consensus sequence. If you set this to false, no collapsing and consensus taking is done and only the barcode is removed from each sequence. # barcode_consensus: true When grouping by barcode and CDR3, the CDR3 location is either detected with a regular expressions or a ‘pseudo’ CDR3 sequence is used, which is at a pre-defined position within the sequence. # Set this configuration option to a region like [-80, -60] to use a pseudo CDR3 located at bases 80 to 60 counted from the 3’ end. (Use negative numbers to count from the 3’ end, positive ones to count from the 5’ end. The most 5’ base has index 0.) # Set this to ‘detect’ (with quotation marks) in order to use CDR3s detected by regular expression. This assumes that the input contains VH sequences! # Set this to false (no quotation marks) in order to only group by barcode, not by CDR3. # cdr3_location: ‘detect’ # Works only with VH sequences! When you use a RACE protocol, then the sequences have a run of G nucleotides in the beginning which need to be removed when barcodes are used. If you use RACE, set this to true. The G nucleotides are assumed to be in the 5’ end (but after the barcode if it exists). # race_g: false Primer-related settings If set to true, it is assumed that the forward primer is always at the 5’ end of the first read and that the reverse primer is always at the 5’ end of the second read. If it can also be the other way, set this to false. This setting has no effect if no primer sequences are defined below. # stranded: false List of 5’ primers # forward_primers: - AGCTACAGAGTCCCAGGTCCA - ACAGGYGCCCACTCYSAG - TTGCTMTTTTAARAGGTGTCCAGTGTG - CTCCCAGATGGGTCCTGTC - ACCGTCCYGGGTCTTGTC - CTGTTCTCCAAGGGWGTCTSTG - CATGGGGTGTCCTGTCACA List of 3’ primers # reverse_primers: - GCAGGCCTTTTTGGCCNNNNNGCCGATGGGCCCTTGGTGGAGGCTGA # IgG - GCAGGCCTTTTTGGCCNNNNNGGGGCATTCTCACAGGAGACGAGGGGGAAAAG # IgM Work only on this number of reads (for quick test runs). Set to false to process all reads. # #limit: false Filter out merged reads that are shorter than this length. # minimum_merged_read_length: 300 Read merging program. Choose either ‘pear’ or ‘flash’. pear merges more reads, but is slower. # #merge_program: pear Maximum overlap (-M) for the flash read merger. If you use pear, this is ignored. # flash_maximum_overlap: 300 Do not mention the original FASTA or FASTQ sequence names in the assigned.tab files, but instead use names _seq, where is a running number starting at 1. # true: yes, rename # false: no, do not rename # rename: true Whether debugging is enabled or not. Currently, if this is set to true, some large intermediate files that would otherwise be deleted will be kept. # debug: false The “seed value” is in arbitrary number used to get reproducible runs. Two runs that use the same software version, the same seed and otherwise the same configuration will give identical results. # Set this to false in order to use a different seed each run. The results will then be not exactly reproducible. # seed: 1 The preprocessing filter is always applied directly after running IgBLAST, even if no gene discovery is requested. # preprocessing_filter: v_coverage: 90 # Match must cover V gene by at least this percentage j_coverage: 60 # Match must cover J gene by at least this percentage v_evalue: 0.001 # Highest allowed V gene match E-value Candidate discovery settings When discovering new V genes, ignore whether a J gene has been assigned and also ignore its %SHM. # true: yes, ignore the J # false: do not ignore J assignment, do not ignore its %SHM # ignore_j: false When clustering sequences to discover new genes, subsample to this number of sequences. Higher is slower. # subsample: 1000 When computing the Ds_exact column, consider only D hits that cover the reference D gene sequence by at least this percentage. # d_coverage: 70 V candidate filtering (germline filtering) settings Filtering criteria applied to candidate sequences in all iterations except the last. # pre_germline_filter: unique_cdr3s: 2 # Minimum number of unique CDR3s (within exact matches) unique_js: 2 # Minimum number of unique J genes (within exact matches) whitelist: true # Add database sequences to the whitelist cluster_size: 0 # Minimum number of sequences assigned to cluster differences: 0 # Merge sequences if they have at most this number of differences allow_stop: true # Whether to allow non-productive sequences containing stop codons cross_mapping_ratio: 0.02 # Threshold for removal of cross-mapping artifacts (set to 0 to disable) clonotype_ratio: 0.12 # Required minimum ratio of clonotype counts between alleles of the same gene exact_ratio: 0.12 # Required minimum ratio of “exact” counts between alleles of the same gene unique_d_ratio: 0.3 # Minimum Ds_exact ratio between alleles unique_d_threshold: 10 # Check Ds_exact ratio only if highest-expressed allele has at least this Ds_exact count Filtering criteria applied to candidate sequences in the last iteration. These should be more strict than the pre_germline_filter criteria. # germline_filter: unique_cdr3s: 5 # Minimum number of unique CDR3s (within exact matches) unique_js: 3 # Minimum number of unique J genes (within exact matches) whitelist: true # Add database sequences to the whitelist cluster_size: 100 # Minimum number of sequences assigned to cluster differences: 0 # Merge sequences if they have at most this number of differences allow_stop: false # Whether to allow non-productive sequences containing stop codons cross_mapping_ratio: 0.02 # Threshold for removal of cross-mapping artifacts (set to 0 to disable) clonotype_ratio: 0.12 # Required minimum ratio of clonotype counts between alleles of the same gene exact_ratio: 0.12 # Required minimum ratio of “exact” counts between alleles of the same gene unique_d_ratio: 0.3 # Minimum Ds_exact ratio between alleles unique_d_threshold: 10 # Check Ds_exact ratio only if highest-expressed allele has at least this Ds_exact count J discovery settings j_discovery: allele_ratio: 0.2 # Required minimum ratio between alleles of a single gene cross_mapping_ratio: 0.1 # Threshold for removal of cross-mapping artifacts. propagate: true # Use J genes discovered in iteration 1 in subsequent ones