vignettes/features-overview.Rmd
features-overview.Rmd
This vignette provides comprehensive overview of the features provided by seqR
package with appropriate examples.
The package consists of two functions that facilitate k-mer counting:
count_kmers
(used for counting k-mers of one type)count_multimers
(a wrapper of count_kmers
, used for counting k-mers of many types in a single invocation of the function)and one function that facilitates processing of k-mer matrices:
rbind_columnwise
(a helper function used for merging several k-mer matrices that do not have same sets of columns)The package supports all types of biological sequences, in particular, nucleic acids and proteins. In fact, there is no constraint regarding the domain of input. However, input sequences must conform to the one of the following representations.
Example:
count_kmers(c("AAA", "ACDA"),
k=2)
#> 2 x 4 sparse Matrix of class "dgCMatrix"
#> A.A_0 C.D_0 A.C_0 D.A_0
#> [1,] 2 . . .
#> [2,] . 1 1 1
This representation is more efficient and recommended to use. However, it does not support multi-character alphabets, as opposed to the second representation.
Example:
count_kmers(list(c("A", "A", "A"), c("A", "C", "D", "A")),
k=2)
#> 2 x 4 sparse Matrix of class "dgCMatrix"
#> A.A_0 D.A_0 A.C_0 C.D_0
#> [1,] 2 . . .
#> [2,] . 1 1 1
A user explicitly specifies k-mer configuration using the following function parameters: k
, kmer_gaps
, and positional
(for count_kmers
). There are four major variants of k-mers: contiguous k-mers, gapped k-mers, positional contiguous k-mers, and positional gapped k-mers.
Contiguous k-mers can be defined as subwords of a fixed length. A user specifies the length of a k-mer with k
parameter. The other function parameters should be kmer_gaps = NULL
(default) and positional = FALSE
(default).
For example, 3-mers of sequence AABC are: AAB and ABC.
count_kmers(c("DDDVSVAA", "FFSFSVAAA"),
k=5)
#> 2 x 9 sparse Matrix of class "dgCMatrix"
#> D.D.V.S.V_0.0.0.0 D.D.D.V.S_0.0.0.0 V.S.V.A.A_0.0.0.0 D.V.S.V.A_0.0.0.0
#> [1,] 1 1 1 1
#> [2,] . . . .
#> F.S.F.S.V_0.0.0.0 F.S.V.A.A_0.0.0.0 F.F.S.F.S_0.0.0.0 S.V.A.A.A_0.0.0.0
#> [1,] . . . .
#> [2,] 1 1 1 1
#> S.F.S.V.A_0.0.0.0
#> [1,] .
#> [2,] 1
Gapped k-mers can be defined as subsequences (not necessarily contiguous) of a given sequence. In particular, between two contiguous elements of the sequence there might be a gap of the length specified by a user.
A user specifies the length of each gap separately using kmer_gaps
parameter. The other function parameters should be k = length(kmer_gaps) + 1
(default) and positional = FALSE
(default).
For example, gapped 3-mers with gaps’ lengths 0 and 1 of sequence AABCCA are: AAC, ABC, BCA.
count_kmers(c("DDDVSVAA", "FFSFSVAAA"),
kmer_gaps=c(2, 0)) # gapped 3-mer template: X__XX
#> 2 x 9 sparse Matrix of class "dgCMatrix"
#> D.S.V_2.0 D.V.A_2.0 D.V.S_2.0 V.A.A_2.0 S.V.A_2.0 F.S.V_2.0 F.F.S_2.0
#> [1,] 1 1 1 1 . . .
#> [2,] . . . . 1 1 1
#> S.A.A_2.0 F.A.A_2.0
#> [1,] . .
#> [2,] 1 1
A positional contiguous k-mer is a subtype of a contiguous k-mer with extra information related to the exact (start) position of the k-mer. It means that two same contiguous k-mers that start at different positions of a given sequence are considered to be different, as opposed to (non-positional) contiguous k-mers.
A user specifies the length of the k-mer with k
parameter. The other function parameters should be kmer_gaps = NULL
(default) and positional = TRUE
.
For example, positional contiguous 3-mers of sequence AABCCA are: 1_AAB, 2_ABC, 3_BCC, 4_CCA.
count_kmers(c("DDDVSVAA", "FFSFSVAAA"),
k=5,
positional=TRUE)
#> 2 x 9 sparse Matrix of class "dgCMatrix"
#> 1_D.D.D.V.S_0.0.0.0 2_D.D.V.S.V_0.0.0.0 4_V.S.V.A.A_0.0.0.0
#> [1,] 1 1 1
#> [2,] . . .
#> 3_D.V.S.V.A_0.0.0.0 2_F.S.F.S.V_0.0.0.0 4_F.S.V.A.A_0.0.0.0
#> [1,] 1 . .
#> [2,] . 1 1
#> 3_S.F.S.V.A_0.0.0.0 1_F.F.S.F.S_0.0.0.0 5_S.V.A.A.A_0.0.0.0
#> [1,] . . .
#> [2,] 1 1 1
A positional gapped k-mer is a subtype of a gapped k-mer with extra information about the exact (start) position of the k-mer. It means that two same gapped k-mers that start at different positions of a given sequence are considered to be different, as opposed to (non-positional) gapped k-mers.
A user specifies the length of each gap using kmer_gaps
parameter. The other function parameters should be k = length(kmer_gaps) + 1
(default) and positional = TRUE
.
For example, positional gapped 3-mers with gaps’ lengths 0 and 1 of sequence AABCCA are: 1_AAC, 2_ABC, 3_BCA.
count_kmers(c("DDDVSVAA", "FFSFSVAAA"),
kmer_gaps=c(2, 0), # gapped 3-mer template: X__XX
positional=TRUE)
#> 2 x 9 sparse Matrix of class "dgCMatrix"
#> 1_D.V.S_2.0 3_D.V.A_2.0 2_D.S.V_2.0 4_V.A.A_2.0 2_F.S.V_2.0 4_F.A.A_2.0
#> [1,] 1 1 1 1 . .
#> [2,] . . . . 1 1
#> 3_S.V.A_2.0 5_S.A.A_2.0 1_F.F.S_2.0
#> [1,] . . .
#> [2,] 1 1 1
The result of the k-mer counting function is a k-mer matrix, represented as a sparse matrix (see package Matrix).
The i-th row of the result matrix corresponds to the i-th input sequence and contains the information about particular occurrences of k-mers. On the other hand, columns correspond to k-mers found in input sequences and they are represented in a human-readable form (for more detail see section Human-readable representation of k-mers
).
Due to the usage of a sparse matrix representation, a signification memory optimization is achieved, as it can be seen below.
get_random_string_vector <- function(seq_num, seq_len, alphabet, seed=123) {
set.seed(seed)
sapply(1:seq_num,
function(n) paste0(sample(alphabet, seq_len, replace=TRUE), collapse=""))
}
r <- count_kmers(get_random_string_vector(300, 1000, LETTERS),
k=4)
print(pryr::object_size(r))
#> 20.3 MB
print(pryr::object_size(as.matrix(r)))
#> 543 MB
Importantly, such a representation is supported by machine learning packages such as ranger and xgboost. Therefore, in those cases, it does not require extra conversions.
In order to improve user experience and meet custom needs, several extra features are provided.
A k-mer alphabet defines which elements of a sequence should be considered during the k-mer counting procedure.
By default, alphabet derived from input sequences is used:
count_kmers(c("VAAAAHDFAA", "KKKKAKAAVA"),
k=4)
#> 2 x 14 sparse Matrix of class "dgCMatrix"
#> [[ suppressing 14 column names 'A.A.H.D_0.0.0', 'A.A.A.A_0.0.0', 'A.A.A.H_0.0.0' ... ]]
#>
#> [1,] 1 1 1 1 1 1 1 . . . . . . .
#> [2,] . . . . . . . 1 1 1 1 1 1 1
However, a user can provide a custom k-mer alphabet (c("A", "K")
):
count_kmers(c("VAAAAHDFAA", "KKKKAKAAVA"),
k=4,
kmer_alphabet = c("A", "K"))
#> 2 x 6 sparse Matrix of class "dgCMatrix"
#> A.A.A.A_0.0.0 A.K.A.A_0.0.0 K.A.K.A_0.0.0 K.K.A.K_0.0.0 K.K.K.K_0.0.0
#> [1,] 1 . . . .
#> [2,] . 1 1 1 1
#> K.K.K.A_0.0.0
#> [1,] .
#> [2,] 1
The internal algorithm processes input sequences in consecutive batches of a given size. The larger the size of a batch is, the more sequences will be used in a single step that processes entries concurrently. Therefore, a user should be aware that in order to take full advantage of multi-threading, they should set the batch size to the number that is greater or equal to the number of CPU cores. However, simultaneously, this number should not be too large in order not to consume too much memory, as during a single step, all sequences of a given batch are encoded, which, in turn, increases memory usage additionally.
Generally, there are two common use cases when a user should explicitly set the size of a batch size (in other cases, the default value is sufficient): 1. a user does not want to process sequences concurrently (then, they should set batch_size = 1
) 2. a user needs to tweak the memory consumption accordingly to their custom needs
sequences <- get_random_string_vector(seq_num = 200, seq_len = 1000, alphabet = LETTERS)
microbenchmark::microbenchmark(
multithreaded = count_kmers(sequences, k=5, batch_size = 200),
singlethreaded = count_kmers(sequences, k=5, batch_size = 1),
times=11
)
#> Single-threaded mode enabled. In order to speed up computations, increase defined batch_size or use a default value
#> Single-threaded mode enabled. In order to speed up computations, increase defined batch_size or use a default value
#> Single-threaded mode enabled. In order to speed up computations, increase defined batch_size or use a default value
#> Single-threaded mode enabled. In order to speed up computations, increase defined batch_size or use a default value
#> Single-threaded mode enabled. In order to speed up computations, increase defined batch_size or use a default value
#> Single-threaded mode enabled. In order to speed up computations, increase defined batch_size or use a default value
#> Single-threaded mode enabled. In order to speed up computations, increase defined batch_size or use a default value
#> Single-threaded mode enabled. In order to speed up computations, increase defined batch_size or use a default value
#> Single-threaded mode enabled. In order to speed up computations, increase defined batch_size or use a default value
#> Single-threaded mode enabled. In order to speed up computations, increase defined batch_size or use a default value
#> Single-threaded mode enabled. In order to speed up computations, increase defined batch_size or use a default value
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> multithreaded 154.1921 162.223 168.1859 163.7308 165.0471 222.0474 11
#> singlethreaded 223.1620 226.276 230.7504 230.7760 235.7555 237.7434 11
Verbose mode allows a user to get extra information about the current state of computations, e.g., which batch of sequences is currently processed.
count_kmers(c("VAAAAHDFAA", "KKKKAKAAVA", "ASDDA"),
k=3,
batch_size=1,
verbose = TRUE)
#> Single-threaded mode enabled. In order to speed up computations, increase defined batch_size or use a default value
#> Start processing sequences (batch: [1-1])...
#> Start processing sequences (batch: [2-2])...
#> Start processing sequences (batch: [3-3])...
#> 3 x 17 sparse Matrix of class "dgCMatrix"
#> [[ suppressing 17 column names 'A.H.D_0.0', 'H.D.F_0.0', 'A.A.H_0.0' ... ]]
#>
#> [1,] 1 1 1 1 2 1 1 . . . . . . . . . .
#> [2,] . . . . . . . 1 1 2 1 1 1 1 . . .
#> [3,] . . . . . . . . . . . . . . 1 1 1
As the core k-mer counting algorithm intensively takes advantage of hashing functions, seqR
package is a probabilistic solution. However, due to the application of multi-dimensional hash values, a user can specify the length of the hash vector, and, thus, increase the probability of a correct result.
Nevertheless, for small values of k
(k <= 10
), the result is certainly correct for the default value of the parameter (i.e., hash_size = 2
).
This feature is particularly useful when a user only wants to get information related to presence or absence of k-mers. Then, they might set this feature (with_kmer_counts = FALSE
).
Compare the following two code snippets:
count_kmers(c("AAAAAAADFSSAAAA", "AADADADADDAD"),
k=3,
with_kmer_counts=FALSE)
#> 2 x 11 sparse Matrix of class "dgCMatrix"
#> [[ suppressing 11 column names 'A.D.F_0.0', 'A.A.D_0.0', 'S.S.A_0.0' ... ]]
#>
#> [1,] 1 1 1 1 1 1 1 . . . .
#> [2,] . 1 . . . . . 1 1 1 1
count_kmers(c("AAAAAAADFSSAAAA", "AADADADADDAD"),
k=3,
with_kmer_counts=TRUE)
#> 2 x 11 sparse Matrix of class "dgCMatrix"
#> [[ suppressing 11 column names 'A.D.F_0.0', 'A.A.D_0.0', 'S.S.A_0.0' ... ]]
#>
#> [1,] 1 1 1 1 1 1 7 . . . .
#> [2,] . 1 . . . . . 4 3 1 1
The aim of this feature is to optimize memory consumption and CPU time. It is particularly useful when many different k-mer models are tested and there is no need to get a human-readable features.
Compare the following two code snippets:
count_kmers(c("AAAAAAADFSSAAAA", "AADADADADDAD"),
k=3,
with_kmer_names=FALSE)
#> 2 x 11 sparse Matrix of class "dgCMatrix"
#>
#> [1,] 1 1 1 1 1 1 7 . . . .
#> [2,] . 1 . . . . . 4 3 1 1
count_kmers(c("AAAAAAADFSSAAAA", "AADADADADDAD"),
k=3,
with_kmer_names=TRUE)
#> 2 x 11 sparse Matrix of class "dgCMatrix"
#> [[ suppressing 11 column names 'A.D.F_0.0', 'A.A.D_0.0', 'S.S.A_0.0' ... ]]
#>
#> [1,] 1 1 1 1 1 1 7 . . . .
#> [2,] . 1 . . . . . 4 3 1 1
For the sake of convenient usage, an extra wrapper (count_multimers
) has been implemented. It makes possible counting k-mers of various type using only a single invocation of a function. In this case, a user must provide sequences (rather than single values) that denote k-mer configurations (for more detail see the documentation of count_multimers
).
count_multimers(c("AAAAAAADFSSAAAA", "AADADADADDAD"),
k_vector=c(1,2,3,4),
verbose=TRUE)
#> [1] "Processing sequences for config no.1"
#> Start processing sequences (batch: [1-2])...
#> [1] "Processing sequences for config no.2"
#> Start processing sequences (batch: [1-2])...
#> [1] "Processing sequences for config no.3"
#> Start processing sequences (batch: [1-2])...
#> [1] "Processing sequences for config no.4"
#> Start processing sequences (batch: [1-2])...
#> 2 x 37 sparse Matrix of class "dgCMatrix"
#> [[ suppressing 37 column names 'S', 'D', 'A' ... ]]
#>
#> [1,] 2 1 11 1 1 1 1 9 1 1 . . 1 1 1 1 1 1 7 . . . . 1 5 1 1 1 1 1 1 . . . . . .
#> [2,] . 6 6 . . . . 1 5 . 1 4 . 1 . . . . . 4 3 1 1 . . . . . . . . 1 2 1 1 1 3
As rbind
function does not work correctly if input matrices does not have equivalent sets of columns, seqR
package provides a new function (rbind_columnwise
) to satisfy this use case:
mA <- count_kmers(c("AAAAAAADFSSAAAA", "AADADADADDAD"),
k=1)
mB <- count_kmers(c("VVVVVAAVA", "ADSDDD", "AAAAAV"),
k=1)
rbind_columnwise(mA, mB)
#> 5 x 5 sparse Matrix of class "dgCMatrix"
#> S D A F V
#> [1,] 2 1 11 1 .
#> [2,] . 6 6 . .
#> [3,] . . 3 . 6
#> [4,] 1 4 1 . .
#> [5,] . . 5 . 1
Each column of the result represents a single k-mer that has the following form:
[p_]s₁.s₂....sₖ_g₁.g₂...gₖ₋₁
The first part ([p_]
) is an integer that is used only in case of positional k-mers and it indicates the exact (start) position of the k-mer in a sequence. The second part (s₁.s₂....sₖ
) represents consecutive elements of the k-mer. Finally, the last part (g₁.g₂...gₖ₋₁
) represents the consecutive lengths of gaps. In particular, if contiguous k-mers are considered, all elements of this part is equal to 0 (e.g., 0.0.0
for 4-mers). Importantly, for 1-mers, this part is not present.