goglcount.blogg.se - Quick node

Given two solid ( k + 1)-mers x and y from the same read, where x has no outdegree and y has no indegree. Thus we introduce a mercy-kmer strategy to recover these low-depth edges. This method removes many spurious edges, but may be risky for metagenomics assembly since many low-abundance species may have been sequenced at very low depth. To cope with the problem, before graph construction, all ( k + 1)-mers from the input reads are sorted and counted, and only ( k + 1)-mers that appear at least d (2 by default) times are kept as solid-kmer. Notably, sequencing error is problematic, because a single base of sequencing error leads to k erroneous k-mer singletons, which increases the memory consumption of MEGAHIT significantly. Leveraging the parallelism of GPU, MEGAHIT speeds up the construction by 3–5 times over its CPU-only counterpart. The k-mers in consecutive partitions that fit within the GPU memory are sorted together. Limited by the relatively small size of GPU’s on-board memory, we adopt a block-wise strategy that partitions the k-mers according to their length- l prefix ( l = 8 in our implementation).

MEGAHIT exploits the parallelism of a graphics processing unit (GPU, CUDA-enabled) by adapting the recent BWT-construction algorithm CX1 ( Liu et al., 2014), which takes advantage of a GPU to sort the suffices of a set of reads very efficiently. MEGAHIT is rooted in a fast parallel algorithm for SdBG construction the bottleneck is sorting a set of ( k+1)-mers that are the edges of an SdBG in reverse lexicographical order of their length- k prefixes ( k-mers). Our implementation has added a bit-vector of length m to mark the validity of each edge (so as to support dynamic removal of edges efficiently), and an auxiliary vector of 2kt bits (where k is the k-mer size and t is the number of zero-indegree vertices) to store the sequence of zero-indegree vertices to ensure the graph being lossless.ĭespite its advantages, constructing a SdBG efficiently is non-trivial. A SdBG encodes a graph with m edges in O( m) bits, and supports O( 1) time traversal from a vertex to its neighbors. MEGAHIT makes use of succinct de Bruijn graphs (SdBG Bowe et al., 2012), which are compressed representation of de Bruijn graphs. As the volume of metagenomics data keeps growing, we are motivated to develop MEGAHIT, an assembler that can assemble large and complex metagenomics data in a time- and cost-efficient manner, especially on a single-node server (current maximum memory capacity 768 GB for a 2-socket server). Estimated memory requirement for SOAPdenovo2 ( Luo et al., 2012) and IDBA-UD ( Peng et al., 2012) to assemble the soil data is at least 4 TB. At present no de novo assembler can assemble the data as a whole using a feasible amount of computer memory. The dataset was successfully assembled with pre-processing steps including partitioning and digital normalization. comprises 252 Gbp even after trimming low quality bases. The soil metagenomics dataset recently published by Howe et al.

This step is, however, constrained by the heavy requirement of computational resources, especially for large and complex datasets encountered in environmental metagenomics ( Howe et al., 2014). Due to the lack of reference genomes, de novo assembly of metagenomics data (short reads) is a beneficial and almost inevitable step for metagenomics analysis ( Qin et al., 2010). Next generation sequencing technologies have offered new opportunities to study metagenomics and understand various microbial communities such as human guts, rumen and soil.