Count k-mers of one, particular type for a given collection of sequences

This is an in-memory, probabilistic, highly-optimized, and multi-threaded implementation of k-mer counting algorithm.

The function supports

several types of k-mers (contiguous, gapped, and positional variants)
all biological sequences (in particular, nucleic acids and proteins)
two common in-memory representations of sequences, i.e., string vectors and list of string vectors

Moreover, several extra features are provided (for more information see details'):

configurable k-mer alphabet (i.e., which elements of a sequence should be considered during the k-mer counting procedure)
verbose mode
configurable batch size (i.e., how many sequences are processed in a single step)
configurable dimension of the hash value of a k-mer
possibility to compute k-mers with or without their frequencies
possibility to compute a result k-mer matrix with or without human-readable k-mer (column) names

count_kmers(
  sequences,
  k = length(kmer_gaps) + 1,
  kmer_alphabet = getOption("seqR_kmer_alphabet_default"),
  positional = getOption("seqR_positional_default"),
  kmer_gaps = c(),
  with_kmer_counts = getOption("seqR_with_kmer_counts_default"),
  with_kmer_names = getOption("seqR_with_kmer_names_default"),
  batch_size = getOption("seqR_batch_size_default"),
  hash_dim = getOption("seqR_hash_dim_default"),
  verbose = getOption("seqR_verbose_default")
)

Arguments

sequences	input sequences of one of two supported types, either `string vector` or `list` of `string vectors`
k	an `integer` representing the length of a k-mer
kmer_alphabet	a `string vector` representing the elements that should be used during the construction of k-mers. By default, all elements that are present in sequences are taking into account
positional	a single `logical` value that determines whether positional k-mer variant should be considered
kmer_gaps	an `integer vector` representing the lengths of gaps between consecutive k-mer elements. The length of the vector should be equal to `k - 1`
with_kmer_counts	a single `logical` value that determines whether the result should contain k-mer frequencies
with_kmer_names	a single `logical` value that determines whether the result should contain human-readable k-mer names
batch_size	a single `integer` value that represents the number of sequences that are being processed in a single step
hash_dim	a single `integer` value (`1 <= hash_dim <= 500`) representing the length of a hash vector that is internally used in the algorithm
verbose	a single `logical` value that denotes whether a user wants to get extra information on the current state of computations

Value

a Matrix value that represents a result k-mer matrix. The result is a sparse matrix in order to reduce memory consumption. The i-th row of the matrix represents k-mers found in the i-th input sequence. Each column represents a distinct k-mer. The names of columns conform to human-readable schema for k-mers, if parameter with_kmer_names = TRUE

Details

The comprehensive description of supported features is available in vignette("features-overview", package = "seqR").

Examples


batch_size <- 1

# Counting 1-mers af two DNA sequences
count_kmers(
   c("ACAT", "ACC"),
   batch_size=batch_size)
#> Single-threaded mode enabled. In order to speed up computations, increase defined batch_size or use a default value
#> 2 x 3 sparse Matrix of class "dgCMatrix"
#>      C A T
#> [1,] 1 2 1
#> [2,] 2 1 .

# Counting 2-mers of two DNA sequences
count_kmers(
    c("ACAT", "ACC"),
    k=2,
    batch_size=batch_size)
#> Single-threaded mode enabled. In order to speed up computations, increase defined batch_size or use a default value
#> 2 x 4 sparse Matrix of class "dgCMatrix"
#>      A.T_0 A.C_0 C.A_0 C.C_0
#> [1,]     1     1     1     .
#> [2,]     .     1     .     1

# Counting positional 2-mers of two DNA sequences
count_kmers(
    c("ACAT", "ACC"),
    k=2,
    positional=TRUE,
    batch_size=batch_size)
#> Single-threaded mode enabled. In order to speed up computations, increase defined batch_size or use a default value
#> 2 x 4 sparse Matrix of class "dgCMatrix"
#>      3_A.T_0 1_A.C_0 2_C.A_0 2_C.C_0
#> [1,]       1       1       1       .
#> [2,]       .       1       .       1

# Counting positional 2-mers of two DNA sequences (second representation)
count_kmers(
     list(c("A", "C", "A", "T"), c("A", "C", "C")),
     k=2,
     positional=TRUE,
     batch_size=batch_size)
#> Single-threaded mode enabled. In order to speed up computations, increase defined batch_size or use a default value
#> 2 x 4 sparse Matrix of class "dgCMatrix"
#>      1_A.C_0 2_C.A_0 3_A.T_0 2_C.C_0
#> [1,]       1       1       1       .
#> [2,]       1       .       .       1

# Counting 3-mers of two DNA sequences, considering only A and C elements
count_kmers(
    c("ACAT", "ACC"),
    k=2,
    kmer_alphabet=c("A", "C"),
    batch_size=batch_size)
#> Single-threaded mode enabled. In order to speed up computations, increase defined batch_size or use a default value
#> 2 x 3 sparse Matrix of class "dgCMatrix"
#>      A.C_0 C.A_0 C.C_0
#> [1,]     1     1     .
#> [2,]     1     .     1

# Counting gapped 3-mers with lengths of gaps 1 and 2
count_kmers(
    c("ACATACTAT", "ACCCCCC"),
    kmer_gaps=c(1,2),
    batch_size=batch_size)
#> Single-threaded mode enabled. In order to speed up computations, increase defined batch_size or use a default value
#> 2 x 6 sparse Matrix of class "dgCMatrix"
#>      C.T.T_1.2 T.C.T_1.2 A.A.A_1.2 A.A.C_1.2 A.C.C_1.2 C.C.C_1.2
#> [1,]         1         1         1         1         .         .
#> [2,]         .         .         .         .         1         1

Count k-mers of one, particular type for a given collection of sequences

Arguments

Value

Details

See also

Examples