mashengwei的个人博客分享 http://blog.sciencenet.cn/u/mashengwei

博文

Sambamba: process your BAM data faster!

已有 12404 次阅读 2017-3-25 17:43 |系统分类:科研笔记| sort, bam, samtools

Sambamba: process your BAM data faster!

  对于很大(>100G)的bam文件,排序时间很慢不说,往往需要1天或更多的时间,但结果还会出错。如下边的错误. 经测试Sambamba表现较好,能够节省很多时间。随着接触的数据越来越多,感觉很简单的事情也需要花很多时间。不仅仅是数据多了的问题!
排版不是很好,也可以看这里 http://blog.csdn.net/msw521sg/article/details/65938377

[bam_sort_core] merging from 3288 files...[E::hts_open_format] fail to open file 'CS_RNA_seq.sorted.bam.tmp.1020.bam'[bam_merge_core] fail to open file CS_RNA_seq.sorted.bam.tmp.1020.bam

1、下载地址

https://github.com/lomereiter/sambamba/releases

2、安装

拷贝至全局环境变量路径即可

3、使用

4、view

sambamba-view - tool for extracting information from SAM/BAM/CRAM files

sambamba view [OPTIONS] <input.bam | input.sam | input.cram> [region1 […]]

sambamba view allows to efficiently filter SAM/BAM/CRAM files for alignments satisfying various conditions, as well as access its SAM header and information about reference sequences. In order to make these data readily available for consumption by scripts in Perl/Python/Ruby, JSON output is provided.

By default, the tool expects BAM file as an input. In order to work with CRAM, specify -C and for SAM, specify -S|--sam-input as a command-line option, the tool does NOT try to guess file format from the extension. Beware that when reading SAM, the tool will skip tags which don't conform to the SAM/BAM specification, and set invalid fields to their default values.

FILTERING

Filtering is presented in two ways. First, you can specify a condition with -F option, using a special language for filtering, described at

https://github.com/lomereiter/sambamba/wiki/%5Bsambamba-view%5D-Filter-expression-syntax

Second, if you have an indexed BAM file, several regions can be specified as well. The syntax for regions is the same as in samtools: chr:beg-end where beg and end are 1-based start and end of a closed-end interval on the reference chr.

JSON

Alignment record JSON representation is a hash with keys 'qname', 'flag', 'rname', 'pos', 'mapq', 'cigar', 'rnext', 'qual', 'tags', e.g.

{"qname":"EAS56_57:6:190:289:82","flag":69,"rname":"chr1","pos":100,"mapq":0,"cigar":"*","rnext":"=","pnext":100,"tlen":0,"seq":"CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA","qual":[27,27,27,22,27,27,27,26,27,27,27,27,27,27,27,27,23,26,26,27,22,26,19,27,26,27,26,26,26,26,26,24,19,27,26],"tags":{"MF":192}}

JSON representation mimics SAM format except quality is given as an array of integers.

Postprocessing JSON output is best accomplished with https://stedolan.github.io/jq/

The output is one line per read, for building a proper JSON array pipe the output into jq --slurp.

OPTIONS

  • -F, --filter=FILTER Set custom filter for alignments.

  • -f, --format=FORMAT Specify output format. FORMAT must be one of sam, bam, cram, or json (in lowercase). Default is SAM.

  • -h, --with-header  Print SAM header before reads. This is always done for BAM output.

  • -H, --header Print only SAM header to STDOUT. If FORMAT is sam or bam, its text version is printed, otherwise JSON object is written.

  • -I, --reference-info Output to STDOUT reference sequence names and lengths in JSON (see EXAMPLES).

  • -L, --regions=BEDFILE Intersect a file with regions specified in the BED file.

  • -c, --count Output to STDOUT only the number of matching records, -hHI options are ignored.

  • -v, --valid Output only valid reads.

  • -S, --sam-input Specify that the input is SAM file (default is BAM for all operations).

  • -C, --cram-input Specify that input is in CRAM format

  • -p, --show-progress Show progressbar in STDERR. Works only for BAM files, and with no regions specified, i.e. only when reading full file.

  • -l, --compression-level=COMPRESSION_LEVEL Set compression level for BAM output, a number from 0 to 9.

  • -o, --output-filename=FILENAME Specify output filename (by default everything is written to STDOUT).

  • -t, --nthreads=NTHREADS Number of threads to use.

EXAMPLES

Print basic reference sequence information:

$ sambamba view --reference-info ex1_header.bam [{"name":"chr1","length":1575},{"name":"chr2","length":1584}]

Count reads with mapping quality not less than 50:

$ sambamba view -c-F"mapping_quality >= 50" ex1_header.bam 3124

Count properly paired reads overlapping 100..200 on chr1:

$ sambamba view -c-F"proper_pair" ex1_header.bam chr1:100-200 39

Output header in JSON format:

$ sambamba view --header--format=json ex1_header.bam {"format_version":"1.3","rg_lines":[],  "sq_lines":[{"sequence_length":1575,"species":"","uri":"",  "sequence_name":"chr1","assembly":"","md5":""},    {"sequence_length":1584,"species":"","uri":"",  "sequence_name":"chr2","assembly":"","md5":""}],  "sorting_order":"coordinate","pg_lines":[]}

5、sort

sambamba-sort  - tool for sorting BAM files

sambamba sort [OPTIONS] <input.bam>

BAM files can have either 'coordinate' sort order, or 'qname' one.

The first one means to sort the file by (integer) reference ID, and for each reference sort corresponding reads by start coordinate.

'qname' sorting order is when reads are sorted lexicographically by their names.

sambamba sort does an external stable-type sort on input file. That means it reads the source BAM file in chunks that fit into memory, sorts them and writes to a temporary directory, and then merges them. After merging temporary files are removed automatically.

Both sorting orders are supported. Default one is 'coordinate' because this is the one used for building index later. In order to switch to 'qname' sorting order, use -n|--sort-by-name flag.

OPTIONS

  • -m, --memory-limit=LIMIT

    Sets an upper bound for used memory. However, this is very approximate. Default memory limit is 2GB. Increasing it will allow to make chunk sizes larger and also reduce amount of I/O seeks thus improving the overall performance.LIMIT must be a number with an optional suffix specyfying unit of measumerent. The following endings are recognized: K, KiB, KB, M, MiB, MB, G, GiB, GB.

  • --tmpdir=TMPDIR

    Use TMPDIR to output sorted chunks. Default behaviour is to use system temporary directory.

  • -o, --out=OUTPUTFILE Output file name. If not provided, the result is written to a file with .sorted.bam extension.

  • -n, --sort-by-name Sort by read name lexicographically

  • -N, --natural-sortSort by read name using natural order (like samtools)

  • -F, --filter=FILTER Keep only reads that pass FILTER. More on that in sambamba-view documentation.

  • -l, --compression-level=COMPRESSION_LEVEL

    Compression level to use for sorted BAM, from 0 (known as uncompressed BAM in samtools) to 9.

  • -u, --uncompressed-chunks

    Write sorted chunks as uncompressed BAM. Default behaviour is to write them with compression level 1, because that reduces time spent on I/O, but in some cases using this option can give you a better speed. Note, however, that the disk space needed for sorting will typically be 3-4 times more than without enabling this option.

  • -p, --show-progressShow wget-like progressbar in STDERR (in fact, two of them one after another, first one for sorting, and then another one for merging).

  • -t, --nthreads=NTHREADS Number of threads to use.

6、merge

sambamba-merge - tool for merging several BAM files into one

sambamba merge [OPTIONS] <output.bam> <input1.bam> <input2.bam> […]

DESCRIPTION

sambamba merge is used for merging several sorted BAM files into one. The sorting order of all the files must be the same, and it is maintained in the output file.

SAM headers are merged automatically like in Picard merging tool.

OPTIONS

  • -t, --nthreads=NTHREADS

    Specify number of threads to use for compression/decompression tasks.

  • -l, --compression-level=COMPRESSION_LEVEL

    Specify compression level of output file as a number from 0 to 9

  • -H, --header Output only merged header to stdout in SAM format (e.g. for debugging purposes). Other options are ignored.

  • -p, --show-progressShow progressbar in STDERR.

7、slice

sambamba-slice - copying a slice of BAM file

sambamba slice OPTIONS <input.bam> region

DESCRIPTION

Outputs reads overlapping specified region into new BAM file. (Default destination is STDOUT.) Input file must be coordinate-sorted and indexed.

While the same can be done with sambamba-view, that would be much slower due to a lot of compression/decompression work. Instead of naive method, sambamba-slice leverages knowledge about structure of BAM file and recompresses only a few BGZF blocks at the beginning and the end of the region, while the BGZF blocks in the middle are copied as is. As such, this tool doesn't offer any options related to number of threads or compression level - most of the time is spent on I/O operations.

OPTIONS

  • -o, --output-filename=OUTPUTFILE

    Name of output file

8、flagstat

sambamba-flagstat - getting flag statistics from BAM file

sambamba flagstatOPTIONS<input.bam>

DESCRIPTION

Outputs some statistics drawn from read flags.

First line contains numbers of QC-passed and QC-failed reads. Then come pairs of numbers, the former for QC-passed reads, the latter for QC-failed ones:

  • duplicates

  • mapped reads (plus percentage relative to the numbers from the first line)

  • reads with 'is_paired' flag set

  • paired reads which are first mates

  • paired reads which are second mates

  • paired reads with 'proper_pair' flag set (plus percentage relative to the numbers of QC-passed/failed reads with 'is_paired' flag set)

  • paired reads where both mates are mapped

  • paired reads where read itself is unmapped but mate is mapped

  • paired reads where mate is mapped to a different chromosome

  • the same as previous but mapping quality is not less than 5

OPTIONS

  • -t, --nthreads=NTHREADS

    Specify number of threads to use for BAM decompression

  • -p, --show-progress

    Show progressbar in STDERR

8、markdup

sambamba-markdup - finding duplicate reads in BAM file

sambamba markdup OPTIONS <input.bam> <output.bam>

Marks (by default) or removes duplicate reads. For determining whether a read is a duplicate or not, the same criteria as in Picard are used.

OPTIONS

  • -r, --remove-duplicates

    remove duplicates instead of just marking them

  • -t, --nthreads=NTHREADS

    number of threads to use

  • -l, --compression-level=N

    specify compression level of the resulting file (from 0 to 9)");

  • -p, --show-progress

    show progressbar in STDERR

  • --tmpdir=TMPDIR

    specify directory for temporary files; default is /tmp

  • --hash-table-size=HASHTABLESIZE

    size of hash table for finding read pairs (default is 262144 reads); will be rounded down to the nearest power of two; should be > (average coverage) * (insert size) for good performance

  • --overflow-list-size=OVERFLOWLISTSIZE

    size of the overflow list where reads, thrown away from the hash table, get a second chance to meet their pairs (default is 200000 reads); increasing the size reduces the number of temporary files created

  • --io-buffer-size=BUFFERSIZE

    controls sizes of two buffers of BUFFERSIZE megabytes each, used for reading and writing BAM during the second pass (default is 128)

BUGS

External sort is not implemented. Thus, memory consumption grows by 2Gb per each 100M reads. Check that you have enough RAM before running the tool.




https://blog.sciencenet.cn/blog-1094241-1041577.html

上一篇:小麦B基因组的起源
下一篇:综述:小麦基因组研究进展
收藏 IP: 58.213.93.*| 热度|

0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...
扫一扫,分享此博文

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-3-29 15:28

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部