| Title: | Manage FASTA Reference Databases for Taxonomic Assignment |
|---|---|
| Description: | Download, format, summarize, and modify FASTA reference databases used for taxonomic assignment in metabarcoding pipelines. Supports major databases (UNITE, SILVA, PR2, BOLD, MaarjAM, Eukaryome) and converts between taxonomy header formats (dada2, SINTAX). Part of the 'pqverse' ecosystem. |
| Authors: | Adrien Taudière [aut, cre, cph] (ORCID: <https://orcid.org/0000-0003-1088-1182>) |
| Maintainer: | Adrien Taudière <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.0.0.9000 |
| Built: | 2026-05-15 09:40:26 UTC |
| Source: | https://github.com/adrientaudiere/dbpq |
Count lines (sequences if fasta file) matching a pattern. Accepts gzip files. May not work on Windows.
count_pattern_db(file, pattern = ">")count_pattern_db(file, pattern = ">")
file |
(Character, required) Path to a file (plain or gzip), often a FASTA file. |
pattern |
(Character, default |
An integer, the number of matching lines.
Adrien Taudière
# count_pattern_db("my_database.fasta", "Fungi")# count_pattern_db("my_database.fasta", "Fungi")
Counts the number of sequences in a FASTA file by counting header lines
(lines starting with >). Accepts gzip files.
count_seq_db(file)count_seq_db(file)
file |
(Character, required) Path to a FASTA file (plain or gzip). |
An integer, the number of sequences.
Adrien Taudière
# count_seq_db("my_database.fasta")# count_seq_db("my_database.fasta")
Removes pairs of primers and flanking regions from a FASTA reference database using cutadapt. Uses linked adapters to trim between forward and reverse primers.
cutadapt_rm_primers_db( ref_fasta, output = NULL, primer_fw = NULL, primer_rev = NULL, discard_untrimmed = TRUE, nproc = 1, verbose = TRUE, cmd_is_run = TRUE, return_file_path = FALSE, start_with_fw = FALSE, output_json = FALSE, error_tolerance = 0.1, args_before_cutadapt = paste0("source ~/miniforge3/etc/profile.d/conda.sh ", "&& conda activate cutadaptenv && ") )cutadapt_rm_primers_db( ref_fasta, output = NULL, primer_fw = NULL, primer_rev = NULL, discard_untrimmed = TRUE, nproc = 1, verbose = TRUE, cmd_is_run = TRUE, return_file_path = FALSE, start_with_fw = FALSE, output_json = FALSE, error_tolerance = 0.1, args_before_cutadapt = paste0("source ~/miniforge3/etc/profile.d/conda.sh ", "&& conda activate cutadaptenv && ") )
ref_fasta |
(Character, required) Path to a FASTA file (plain or gzip). |
output |
(Character) Path to the output FASTA file. If NULL, defaults
to |
primer_fw |
(Character, required) The forward primer DNA sequence. |
primer_rev |
(Character, required) The reverse primer DNA sequence. |
discard_untrimmed |
(Logical, default |
nproc |
(Integer, default |
verbose |
(Logical, default |
cmd_is_run |
(Logical, default |
return_file_path |
(Logical, default |
start_with_fw |
(Logical, default |
output_json |
(Logical, default |
error_tolerance |
(Numeric, default |
args_before_cutadapt |
(Character) Shell commands to run before cutadapt (e.g., conda activation). |
This function is mainly a wrapper of the work of others. Please cite cutadapt (doi:10.14806/ej.17.1.200).
The cutadapt command string, or the output file path if
return_file_path = TRUE.
Adrien Taudière
## Not run: cutadapt_rm_primers_db( "database.fasta.gz", output = "db_cutadapted.fasta", primer_fw = "GCATCGATGAAGAACGCAGC", primer_rev = "TCCTCCGCTTATTGATATGC" ) ## End(Not run)## Not run: cutadapt_rm_primers_db( "database.fasta.gz", output = "db_cutadapted.fasta", primer_fw = "GCATCGATGAAGAACGCAGC", primer_rev = "TCCTCCGCTTATTGATATGC" ) ## End(Not run)
Downloads reference sequences from BOLD Systems (Barcode of Life Data).
download_bold_db(dest_dir = ".", marker = "COI-5P", verbose = TRUE)download_bold_db(dest_dir = ".", marker = "COI-5P", verbose = TRUE)
dest_dir |
(Character, default |
marker |
(Character, default |
verbose |
(Logical, default |
The path to the downloaded file (invisibly).
Adrien Taudière
## Not run: download_bold_db() ## End(Not run)## Not run: download_bold_db() ## End(Not run)
Downloads the Eukaryome database.
download_eukaryome_db(dest_dir = ".", verbose = TRUE)download_eukaryome_db(dest_dir = ".", verbose = TRUE)
dest_dir |
(Character, default |
verbose |
(Logical, default |
The path to the downloaded file (invisibly).
Adrien Taudière
## Not run: download_eukaryome_db() ## End(Not run)## Not run: download_eukaryome_db() ## End(Not run)
Downloads the MaarjAM database for arbuscular mycorrhizal fungi (AMF).
download_marjaam_db(dest_dir = ".", verbose = TRUE)download_marjaam_db(dest_dir = ".", verbose = TRUE)
dest_dir |
(Character, default |
verbose |
(Logical, default |
The path to the downloaded file (invisibly).
Adrien Taudière
## Not run: download_marjaam_db() ## End(Not run)## Not run: download_marjaam_db() ## End(Not run)
Downloads the PR2 protist ribosomal reference database.
download_pr2_db(dest_dir = ".", version = NULL, verbose = TRUE)download_pr2_db(dest_dir = ".", version = NULL, verbose = TRUE)
dest_dir |
(Character, default |
version |
(Character) PR2 version number. |
verbose |
(Logical, default |
The path to the downloaded file (invisibly).
Adrien Taudière
## Not run: download_pr2_db() ## End(Not run)## Not run: download_pr2_db() ## End(Not run)
Downloads the SILVA ribosomal RNA database (16S/18S).
download_silva_db( dest_dir = ".", version = NULL, target = c("SSU", "LSU"), verbose = TRUE )download_silva_db( dest_dir = ".", version = NULL, target = c("SSU", "LSU"), verbose = TRUE )
dest_dir |
(Character, default |
version |
(Character) SILVA version number (e.g., |
target |
(Character, default |
verbose |
(Logical, default |
The path to the downloaded file (invisibly).
Adrien Taudière
## Not run: download_silva_db() ## End(Not run)## Not run: download_silva_db() ## End(Not run)
Downloads the latest UNITE fungal ITS database for taxonomic assignment.
download_unite_db( dest_dir = ".", type = c("dynamic", "static"), taxon_group = c("fungi", "eukaryotes"), verbose = TRUE )download_unite_db( dest_dir = ".", type = c("dynamic", "static"), taxon_group = c("fungi", "eukaryotes"), verbose = TRUE )
dest_dir |
(Character, default |
type |
(Character, default |
taxon_group |
(Character, default |
verbose |
(Logical, default |
The path to the downloaded file (invisibly).
Adrien Taudière
## Not run: download_unite_db() ## End(Not run)## Not run: download_unite_db() ## End(Not run)
Filters sequences from a FASTA database whose header lines match a given pattern. Accepts gzip files. May not work on Windows.
filter_db( ref_fasta, pattern, output = NULL, force_two_lines_per_seq = TRUE, keep_temporary_files = FALSE )filter_db( ref_fasta, pattern, output = NULL, force_two_lines_per_seq = TRUE, keep_temporary_files = FALSE )
ref_fasta |
(Character, required) Path to a FASTA file (plain or gzip). |
pattern |
(Character, required) A pattern to search for in sequence headers. |
output |
(Character, required) Path to the output FASTA file (must not be gzipped). |
force_two_lines_per_seq |
(Logical, default |
keep_temporary_files |
(Logical, default |
The path to the output file (invisibly).
Adrien Taudière
# filter_db("database.fasta.gz", "Rhizophydiales", "output.fasta")# filter_db("database.fasta.gz", "Rhizophydiales", "output.fasta")
Converts taxonomy headers to the format expected by
dada2::assignTaxonomy(): Kingdom;Phylum;Class;Order;Family;Genus;.
format2dada2( fasta_db = NULL, taxnames = NULL, output_path = NULL, from_sintax = TRUE, pattern_to_remove = NULL, ... )format2dada2( fasta_db = NULL, taxnames = NULL, output_path = NULL, from_sintax = TRUE, pattern_to_remove = NULL, ... )
fasta_db |
(Character) Path to a FASTA file. Mutually exclusive
with |
taxnames |
(Character vector) Taxonomy header strings. Mutually
exclusive with |
output_path |
(Character) If provided and |
from_sintax |
(Logical, default |
pattern_to_remove |
(Character) Optional regex pattern to remove from the reformatted names. |
... |
Additional arguments passed to |
If taxnames is used, a character vector. If fasta_db is used,
a DNAStringSet with reformatted names. When output_path is provided,
the FASTA file is written and the DNAStringSet is returned invisibly.
Adrien Taudière
format2sintax(), format2dada2_species()
format2dada2( taxnames = "AB123;tax=k:Fungi,p:Ascomycota,c:Sordariomycetes", from_sintax = TRUE )format2dada2( taxnames = "AB123;tax=k:Fungi,p:Ascomycota,c:Sordariomycetes", from_sintax = TRUE )
Converts taxonomy headers to the format expected by
dada2::addSpecies(): AccessionID Genus Species.
format2dada2_species( fasta_db = NULL, taxnames = NULL, from_sintax = FALSE, output_path = NULL, ... )format2dada2_species( fasta_db = NULL, taxnames = NULL, from_sintax = FALSE, output_path = NULL, ... )
fasta_db |
(Character) Path to a FASTA file. Mutually exclusive
with |
taxnames |
(Character vector) Taxonomy header strings. Mutually
exclusive with |
from_sintax |
(Logical, default |
output_path |
(Character) If provided and |
... |
Additional arguments passed to internal functions. |
If taxnames is used, a character vector. If fasta_db is used,
a DNAStringSet with reformatted names.
Adrien Taudière
format2dada2(), format2sintax()
format2dada2_species( taxnames = "AB123;k__Fungi;g__Aspergillus;s__fumigatus", from_sintax = FALSE )format2dada2_species( taxnames = "AB123;k__Fungi;g__Aspergillus;s__fumigatus", from_sintax = FALSE )
Converts taxonomy headers from the common k__Kingdom;p__Phylum;...
format to the VSEARCH SINTAX format (tax=k:Kingdom,p:Phylum,...).
format2sintax( fasta_db = NULL, taxnames = NULL, pattern_tax = "k__", pattern_sintax = "tax=k:", output_path = NULL )format2sintax( fasta_db = NULL, taxnames = NULL, pattern_tax = "k__", pattern_sintax = "tax=k:", output_path = NULL )
fasta_db |
(Character) Path to a FASTA file. Mutually exclusive
with |
taxnames |
(Character vector) Taxonomy header strings. Mutually
exclusive with |
pattern_tax |
(Character, default |
pattern_sintax |
(Character, default |
output_path |
(Character) If provided and |
If taxnames is used, a character vector of reformatted names.
If fasta_db is used, a DNAStringSet with reformatted names.
Adrien Taudière
format2dada2(), format2dada2_species()
format2sintax(taxnames = "AB123;k__Fungi;p__Ascomycota;c__Sordariomycetes")format2sintax(taxnames = "AB123;k__Fungi;p__Ascomycota;c__Sordariomycetes")
Get file extension(s)
get_file_extension(file_path)get_file_extension(file_path)
file_path |
(Character, required) Path to a file. |
A character vector of file extensions.
get_file_extension("my_database.fasta") get_file_extension("my_database.fasta.gz")get_file_extension("my_database.fasta") get_file_extension("my_database.fasta.gz")
Extracts and counts occurrences of a given taxonomic rank from FASTA
sequence headers. Requires taxonomy encoded in headers following
the convention k__Kingdom;p__Phylum;... or similar.
list_ranks_db(file, rank_prefix = "k__")list_ranks_db(file, rank_prefix = "k__")
file |
(Character, required) Path to a FASTA file (plain or gzip). |
rank_prefix |
(Character, default |
A named integer vector of counts, sorted in decreasing order. Names are the taxonomic rank values.
Adrien Taudière
# list_ranks_db("my_database.fasta", rank_prefix = "p__")# list_ranks_db("my_database.fasta", rank_prefix = "p__")
Provides an overview of a FASTA reference database: number of sequences, sequence length distribution, and taxonomic coverage at each rank.
summarize_db( file, rank_prefixes = c("k__", "p__", "c__", "o__", "f__", "g__", "s__") )summarize_db( file, rank_prefixes = c("k__", "p__", "c__", "o__", "f__", "g__", "s__") )
file |
(Character, required) Path to a FASTA file (plain or gzip). |
rank_prefixes |
(Character vector) Taxonomic rank prefixes to summarize. Defaults to kingdom through species. |
A list with components:
n_sequences: total number of sequences
length_summary: summary statistics of sequence lengths
ranks: a named list of unique count per rank
Adrien Taudière
# summarize_db("my_database.fasta")# summarize_db("my_database.fasta")