Introduction

This vignette provides comprehensive overview of the features provided by seqR package with appropriate examples.

The package consists of two functions that facilitate k-mer counting:

  • count_kmers (used for counting k-mers of one type)
  • count_multimers (a wrapper of count_kmers, used for counting k-mers of many types in a single invocation of the function)

and one function that facilitates processing of k-mer matrices:

  • rbind_columnwise (a helper function used for merging several k-mer matrices that do not have same sets of columns)

Supported input sequences

The package supports all types of biological sequences, in particular, nucleic acids and proteins. In fact, there is no constraint regarding the domain of input. However, input sequences must conform to the one of the following representations.

String vectors

Example:

count_kmers(c("AAA", "ACDA"),
            k=2)
#> 2 x 4 sparse Matrix of class "dgCMatrix"
#>      A.A_0 C.D_0 A.C_0 D.A_0
#> [1,]     2     .     .     .
#> [2,]     .     1     1     1

This representation is more efficient and recommended to use. However, it does not support multi-character alphabets, as opposed to the second representation.

List of string vectors

Example:

count_kmers(list(c("A", "A", "A"), c("A", "C", "D", "A")),
            k=2)
#> 2 x 4 sparse Matrix of class "dgCMatrix"
#>      A.A_0 D.A_0 A.C_0 C.D_0
#> [1,]     2     .     .     .
#> [2,]     .     1     1     1

Supported variants of k-mers

A user explicitly specifies k-mer configuration using the following function parameters: k, kmer_gaps, and positional (for count_kmers). There are four major variants of k-mers: contiguous k-mers, gapped k-mers, positional contiguous k-mers, and positional gapped k-mers.

Contiguous k-mers

Contiguous k-mers can be defined as subwords of a fixed length. A user specifies the length of a k-mer with k parameter. The other function parameters should be kmer_gaps = NULL (default) and positional = FALSE (default).

For example, 3-mers of sequence AABC are: AAB and ABC.

Usage

count_kmers(c("DDDVSVAA", "FFSFSVAAA"),
            k=5)
#> 2 x 9 sparse Matrix of class "dgCMatrix"
#>      D.D.V.S.V_0.0.0.0 D.D.D.V.S_0.0.0.0 V.S.V.A.A_0.0.0.0 D.V.S.V.A_0.0.0.0
#> [1,]                 1                 1                 1                 1
#> [2,]                 .                 .                 .                 .
#>      F.S.F.S.V_0.0.0.0 F.S.V.A.A_0.0.0.0 F.F.S.F.S_0.0.0.0 S.V.A.A.A_0.0.0.0
#> [1,]                 .                 .                 .                 .
#> [2,]                 1                 1                 1                 1
#>      S.F.S.V.A_0.0.0.0
#> [1,]                 .
#> [2,]                 1

Gapped k-mers

Gapped k-mers can be defined as subsequences (not necessarily contiguous) of a given sequence. In particular, between two contiguous elements of the sequence there might be a gap of the length specified by a user.

A user specifies the length of each gap separately using kmer_gaps parameter. The other function parameters should be k = length(kmer_gaps) + 1 (default) and positional = FALSE (default).

For example, gapped 3-mers with gaps’ lengths 0 and 1 of sequence AABCCA are: AAC, ABC, BCA.

Usage

count_kmers(c("DDDVSVAA", "FFSFSVAAA"),
            kmer_gaps=c(2, 0)) # gapped 3-mer template: X__XX
#> 2 x 9 sparse Matrix of class "dgCMatrix"
#>      D.S.V_2.0 D.V.A_2.0 D.V.S_2.0 V.A.A_2.0 S.V.A_2.0 F.S.V_2.0 F.F.S_2.0
#> [1,]         1         1         1         1         .         .         .
#> [2,]         .         .         .         .         1         1         1
#>      S.A.A_2.0 F.A.A_2.0
#> [1,]         .         .
#> [2,]         1         1

Positional contiguous k-mers

A positional contiguous k-mer is a subtype of a contiguous k-mer with extra information related to the exact (start) position of the k-mer. It means that two same contiguous k-mers that start at different positions of a given sequence are considered to be different, as opposed to (non-positional) contiguous k-mers.

A user specifies the length of the k-mer with k parameter. The other function parameters should be kmer_gaps = NULL (default) and positional = TRUE.

For example, positional contiguous 3-mers of sequence AABCCA are: 1_AAB, 2_ABC, 3_BCC, 4_CCA.

Usage

count_kmers(c("DDDVSVAA", "FFSFSVAAA"),
            k=5,
            positional=TRUE)
#> 2 x 9 sparse Matrix of class "dgCMatrix"
#>      1_D.D.D.V.S_0.0.0.0 2_D.D.V.S.V_0.0.0.0 4_V.S.V.A.A_0.0.0.0
#> [1,]                   1                   1                   1
#> [2,]                   .                   .                   .
#>      3_D.V.S.V.A_0.0.0.0 2_F.S.F.S.V_0.0.0.0 4_F.S.V.A.A_0.0.0.0
#> [1,]                   1                   .                   .
#> [2,]                   .                   1                   1
#>      3_S.F.S.V.A_0.0.0.0 1_F.F.S.F.S_0.0.0.0 5_S.V.A.A.A_0.0.0.0
#> [1,]                   .                   .                   .
#> [2,]                   1                   1                   1

Positional gapped k-mers

A positional gapped k-mer is a subtype of a gapped k-mer with extra information about the exact (start) position of the k-mer. It means that two same gapped k-mers that start at different positions of a given sequence are considered to be different, as opposed to (non-positional) gapped k-mers.

A user specifies the length of each gap using kmer_gaps parameter. The other function parameters should be k = length(kmer_gaps) + 1 (default) and positional = TRUE.

For example, positional gapped 3-mers with gaps’ lengths 0 and 1 of sequence AABCCA are: 1_AAC, 2_ABC, 3_BCA.

Usage

count_kmers(c("DDDVSVAA", "FFSFSVAAA"),
            kmer_gaps=c(2, 0), # gapped 3-mer template: X__XX
            positional=TRUE)
#> 2 x 9 sparse Matrix of class "dgCMatrix"
#>      1_D.V.S_2.0 3_D.V.A_2.0 2_D.S.V_2.0 4_V.A.A_2.0 2_F.S.V_2.0 4_F.A.A_2.0
#> [1,]           1           1           1           1           .           .
#> [2,]           .           .           .           .           1           1
#>      3_S.V.A_2.0 5_S.A.A_2.0 1_F.F.S_2.0
#> [1,]           .           .           .
#> [2,]           1           1           1

Result type

The result of the k-mer counting function is a k-mer matrix, represented as a sparse matrix (see package Matrix).

The i-th row of the result matrix corresponds to the i-th input sequence and contains the information about particular occurrences of k-mers. On the other hand, columns correspond to k-mers found in input sequences and they are represented in a human-readable form (for more detail see section Human-readable representation of k-mers).

Due to the usage of a sparse matrix representation, a signification memory optimization is achieved, as it can be seen below.

get_random_string_vector <- function(seq_num, seq_len, alphabet, seed=123) {
  set.seed(seed)
  sapply(1:seq_num,
         function(n) paste0(sample(alphabet, seq_len, replace=TRUE), collapse=""))
}

r <- count_kmers(get_random_string_vector(300, 1000, LETTERS),
                 k=4)
print(pryr::object_size(r))
#> 20.3 MB
print(pryr::object_size(as.matrix(r)))
#> 543 MB

Importantly, such a representation is supported by machine learning packages such as ranger and xgboost. Therefore, in those cases, it does not require extra conversions.

Extra features

In order to improve user experience and meet custom needs, several extra features are provided.

Configurable k-mer alphabet

A k-mer alphabet defines which elements of a sequence should be considered during the k-mer counting procedure.

By default, alphabet derived from input sequences is used:

count_kmers(c("VAAAAHDFAA", "KKKKAKAAVA"),
            k=4)
#> 2 x 14 sparse Matrix of class "dgCMatrix"
#>    [[ suppressing 14 column names 'A.A.H.D_0.0.0', 'A.A.A.A_0.0.0', 'A.A.A.H_0.0.0' ... ]]
#>                                 
#> [1,] 1 1 1 1 1 1 1 . . . . . . .
#> [2,] . . . . . . . 1 1 1 1 1 1 1

However, a user can provide a custom k-mer alphabet (c("A", "K")):

count_kmers(c("VAAAAHDFAA", "KKKKAKAAVA"),
            k=4,
            kmer_alphabet = c("A", "K"))
#> 2 x 6 sparse Matrix of class "dgCMatrix"
#>      A.A.A.A_0.0.0 A.K.A.A_0.0.0 K.A.K.A_0.0.0 K.K.A.K_0.0.0 K.K.K.K_0.0.0
#> [1,]             1             .             .             .             .
#> [2,]             .             1             1             1             1
#>      K.K.K.A_0.0.0
#> [1,]             .
#> [2,]             1

Configurable batch size

The internal algorithm processes input sequences in consecutive batches of a given size. The larger the size of a batch is, the more sequences will be used in a single step that processes entries concurrently. Therefore, a user should be aware that in order to take full advantage of multi-threading, they should set the batch size to the number that is greater or equal to the number of CPU cores. However, simultaneously, this number should not be too large in order not to consume too much memory, as during a single step, all sequences of a given batch are encoded, which, in turn, increases memory usage additionally.

Generally, there are two common use cases when a user should explicitly set the size of a batch size (in other cases, the default value is sufficient): 1. a user does not want to process sequences concurrently (then, they should set batch_size = 1) 2. a user needs to tweak the memory consumption accordingly to their custom needs

Usage

sequences <- get_random_string_vector(seq_num = 200, seq_len = 1000, alphabet = LETTERS)
microbenchmark::microbenchmark(
  multithreaded = count_kmers(sequences, k=5, batch_size = 200),
  singlethreaded = count_kmers(sequences, k=5, batch_size = 1),
  times=11
)
#> Single-threaded mode enabled. In order to speed up computations, increase defined batch_size or use a default value
#> Single-threaded mode enabled. In order to speed up computations, increase defined batch_size or use a default value
#> Single-threaded mode enabled. In order to speed up computations, increase defined batch_size or use a default value
#> Single-threaded mode enabled. In order to speed up computations, increase defined batch_size or use a default value
#> Single-threaded mode enabled. In order to speed up computations, increase defined batch_size or use a default value
#> Single-threaded mode enabled. In order to speed up computations, increase defined batch_size or use a default value
#> Single-threaded mode enabled. In order to speed up computations, increase defined batch_size or use a default value
#> Single-threaded mode enabled. In order to speed up computations, increase defined batch_size or use a default value
#> Single-threaded mode enabled. In order to speed up computations, increase defined batch_size or use a default value
#> Single-threaded mode enabled. In order to speed up computations, increase defined batch_size or use a default value
#> Single-threaded mode enabled. In order to speed up computations, increase defined batch_size or use a default value
#> Unit: milliseconds
#>            expr      min      lq     mean   median       uq      max neval
#>   multithreaded 154.1921 162.223 168.1859 163.7308 165.0471 222.0474    11
#>  singlethreaded 223.1620 226.276 230.7504 230.7760 235.7555 237.7434    11

Verbose mode

Verbose mode allows a user to get extra information about the current state of computations, e.g., which batch of sequences is currently processed.

Usage

count_kmers(c("VAAAAHDFAA", "KKKKAKAAVA", "ASDDA"),
            k=3,
            batch_size=1,
            verbose = TRUE)
#> Single-threaded mode enabled. In order to speed up computations, increase defined batch_size or use a default value
#> Start processing sequences (batch: [1-1])...
#> Start processing sequences (batch: [2-2])...
#> Start processing sequences (batch: [3-3])...
#> 3 x 17 sparse Matrix of class "dgCMatrix"
#>    [[ suppressing 17 column names 'A.H.D_0.0', 'H.D.F_0.0', 'A.A.H_0.0' ... ]]
#>                                       
#> [1,] 1 1 1 1 2 1 1 . . . . . . . . . .
#> [2,] . . . . . . . 1 1 2 1 1 1 1 . . .
#> [3,] . . . . . . . . . . . . . . 1 1 1

Configurable dimension of the hash value of a k-mer

As the core k-mer counting algorithm intensively takes advantage of hashing functions, seqR package is a probabilistic solution. However, due to the application of multi-dimensional hash values, a user can specify the length of the hash vector, and, thus, increase the probability of a correct result.

Nevertheless, for small values of k (k <= 10), the result is certainly correct for the default value of the parameter (i.e., hash_size = 2).

compute k-mers with or without their frequencies

This feature is particularly useful when a user only wants to get information related to presence or absence of k-mers. Then, they might set this feature (with_kmer_counts = FALSE).

Usage

Compare the following two code snippets:

count_kmers(c("AAAAAAADFSSAAAA", "AADADADADDAD"),
            k=3,
            with_kmer_counts=FALSE)
#> 2 x 11 sparse Matrix of class "dgCMatrix"
#>    [[ suppressing 11 column names 'A.D.F_0.0', 'A.A.D_0.0', 'S.S.A_0.0' ... ]]
#>                           
#> [1,] 1 1 1 1 1 1 1 . . . .
#> [2,] . 1 . . . . . 1 1 1 1
count_kmers(c("AAAAAAADFSSAAAA", "AADADADADDAD"),
            k=3,
            with_kmer_counts=TRUE)
#> 2 x 11 sparse Matrix of class "dgCMatrix"
#>    [[ suppressing 11 column names 'A.D.F_0.0', 'A.A.D_0.0', 'S.S.A_0.0' ... ]]
#>                           
#> [1,] 1 1 1 1 1 1 7 . . . .
#> [2,] . 1 . . . . . 4 3 1 1

Compute a result k-mer matrix with or without human-readable k-mer (column) names

The aim of this feature is to optimize memory consumption and CPU time. It is particularly useful when many different k-mer models are tested and there is no need to get a human-readable features.

Usage

Compare the following two code snippets:

count_kmers(c("AAAAAAADFSSAAAA", "AADADADADDAD"),
            k=3,
            with_kmer_names=FALSE)
#> 2 x 11 sparse Matrix of class "dgCMatrix"
#>                           
#> [1,] 1 1 1 1 1 1 7 . . . .
#> [2,] . 1 . . . . . 4 3 1 1
count_kmers(c("AAAAAAADFSSAAAA", "AADADADADDAD"),
            k=3,
            with_kmer_names=TRUE)
#> 2 x 11 sparse Matrix of class "dgCMatrix"
#>    [[ suppressing 11 column names 'A.D.F_0.0', 'A.A.D_0.0', 'S.S.A_0.0' ... ]]
#>                           
#> [1,] 1 1 1 1 1 1 7 . . . .
#> [2,] . 1 . . . . . 4 3 1 1

Count k-mers of several types in a single function invocation

For the sake of convenient usage, an extra wrapper (count_multimers) has been implemented. It makes possible counting k-mers of various type using only a single invocation of a function. In this case, a user must provide sequences (rather than single values) that denote k-mer configurations (for more detail see the documentation of count_multimers).

count_multimers(c("AAAAAAADFSSAAAA", "AADADADADDAD"),
                k_vector=c(1,2,3,4),
                verbose=TRUE)
#> [1] "Processing sequences for config no.1"
#> Start processing sequences (batch: [1-2])...
#> [1] "Processing sequences for config no.2"
#> Start processing sequences (batch: [1-2])...
#> [1] "Processing sequences for config no.3"
#> Start processing sequences (batch: [1-2])...
#> [1] "Processing sequences for config no.4"
#> Start processing sequences (batch: [1-2])...
#> 2 x 37 sparse Matrix of class "dgCMatrix"
#>    [[ suppressing 37 column names 'S', 'D', 'A' ... ]]
#>                                                                                
#> [1,] 2 1 11 1 1 1 1 9 1 1 . . 1 1 1 1 1 1 7 . . . . 1 5 1 1 1 1 1 1 . . . . . .
#> [2,] . 6  6 . . . . 1 5 . 1 4 . 1 . . . . . 4 3 1 1 . . . . . . . . 1 2 1 1 1 3

Merge (rbind) two k-mer matrices derived from distinct sequences

As rbind function does not work correctly if input matrices does not have equivalent sets of columns, seqR package provides a new function (rbind_columnwise) to satisfy this use case:

mA <- count_kmers(c("AAAAAAADFSSAAAA", "AADADADADDAD"),
                  k=1)
mB <- count_kmers(c("VVVVVAAVA", "ADSDDD", "AAAAAV"),
                  k=1)

rbind_columnwise(mA, mB)
#> 5 x 5 sparse Matrix of class "dgCMatrix"
#>      S D  A F V
#> [1,] 2 1 11 1 .
#> [2,] . 6  6 . .
#> [3,] . .  3 . 6
#> [4,] 1 4  1 . .
#> [5,] . .  5 . 1

Human-readable representation of k-mers

Each column of the result represents a single k-mer that has the following form:

[p_]s₁.s₂....sₖ_g₁.g₂...gₖ₋₁

The first part ([p_]) is an integer that is used only in case of positional k-mers and it indicates the exact (start) position of the k-mer in a sequence. The second part (s₁.s₂....sₖ) represents consecutive elements of the k-mer. Finally, the last part (g₁.g₂...gₖ₋₁) represents the consecutive lengths of gaps. In particular, if contiguous k-mers are considered, all elements of this part is equal to 0 (e.g., 0.0.0 for 4-mers). Importantly, for 1-mers, this part is not present.