R/count_kmers.R
count_kmers.Rd
This is an in-memory, probabilistic, highly-optimized, and multi-threaded implementation of k-mer counting algorithm.
The function supports
several types of k-mers (contiguous, gapped, and positional variants)
all biological sequences (in particular, nucleic acids and proteins)
two common in-memory representations of sequences, i.e., string vectors and list of string vectors
Moreover, several extra features are provided
(for more information see details
'):
configurable k-mer alphabet (i.e., which elements of a sequence should be considered during the k-mer counting procedure)
verbose mode
configurable batch size (i.e., how many sequences are processed in a single step)
configurable dimension of the hash value of a k-mer
possibility to compute k-mers with or without their frequencies
possibility to compute a result k-mer matrix with or without human-readable k-mer (column) names
count_kmers( sequences, k = length(kmer_gaps) + 1, kmer_alphabet = getOption("seqR_kmer_alphabet_default"), positional = getOption("seqR_positional_default"), kmer_gaps = c(), with_kmer_counts = getOption("seqR_with_kmer_counts_default"), with_kmer_names = getOption("seqR_with_kmer_names_default"), batch_size = getOption("seqR_batch_size_default"), hash_dim = getOption("seqR_hash_dim_default"), verbose = getOption("seqR_verbose_default") )
sequences | input sequences of one of two supported types,
either |
---|---|
k | an |
kmer_alphabet | a |
positional | a single |
kmer_gaps | an |
with_kmer_counts | a single |
with_kmer_names | a single |
batch_size | a single |
hash_dim | a single |
verbose | a single |
a Matrix
value that represents a result k-mer matrix.
The result is a sparse matrix in order to reduce memory consumption.
The i-th row of the matrix represents k-mers found in the i-th input sequence.
Each column represents a distinct k-mer.
The names of columns conform to human-readable schema for k-mers,
if parameter with_kmer_names = TRUE
The comprehensive description of supported features is available
in vignette("features-overview", package = "seqR")
.
Function that counts many k-mer variants in the single invocation: count_multimers
Function that merges several k-mer matrices (rbind): rbind_columnwise
batch_size <- 1 # Counting 1-mers af two DNA sequences count_kmers( c("ACAT", "ACC"), batch_size=batch_size)#>#> 2 x 3 sparse Matrix of class "dgCMatrix" #> C A T #> [1,] 1 2 1 #> [2,] 2 1 .#>#> 2 x 4 sparse Matrix of class "dgCMatrix" #> A.T_0 A.C_0 C.A_0 C.C_0 #> [1,] 1 1 1 . #> [2,] . 1 . 1# Counting positional 2-mers of two DNA sequences count_kmers( c("ACAT", "ACC"), k=2, positional=TRUE, batch_size=batch_size)#>#> 2 x 4 sparse Matrix of class "dgCMatrix" #> 3_A.T_0 1_A.C_0 2_C.A_0 2_C.C_0 #> [1,] 1 1 1 . #> [2,] . 1 . 1# Counting positional 2-mers of two DNA sequences (second representation) count_kmers( list(c("A", "C", "A", "T"), c("A", "C", "C")), k=2, positional=TRUE, batch_size=batch_size)#>#> 2 x 4 sparse Matrix of class "dgCMatrix" #> 1_A.C_0 2_C.A_0 3_A.T_0 2_C.C_0 #> [1,] 1 1 1 . #> [2,] 1 . . 1# Counting 3-mers of two DNA sequences, considering only A and C elements count_kmers( c("ACAT", "ACC"), k=2, kmer_alphabet=c("A", "C"), batch_size=batch_size)#>#> 2 x 3 sparse Matrix of class "dgCMatrix" #> A.C_0 C.A_0 C.C_0 #> [1,] 1 1 . #> [2,] 1 . 1# Counting gapped 3-mers with lengths of gaps 1 and 2 count_kmers( c("ACATACTAT", "ACCCCCC"), kmer_gaps=c(1,2), batch_size=batch_size)#>#> 2 x 6 sparse Matrix of class "dgCMatrix" #> C.T.T_1.2 T.C.T_1.2 A.A.A_1.2 A.A.C_1.2 A.C.C_1.2 C.C.C_1.2 #> [1,] 1 1 1 1 . . #> [2,] . . . . 1 1