|||
针对老师们问到的三代数据分析中的一些问题,今天主要针对基本信息分析中的测序数据统计、质量QC评估,data summary等,结合项目案例解释如下:
General - Filtering Report
* Polymerase Read Bases : The number of bases in the polymerase read.
即测序获得所有数据量,包含adaptors序列。
* Polymerase Reads : The number of polymerases generating high quality reads. Polymerase reads are trimmed to the high quality region and include bases from adaptors, as well as potentially multiple passes around a circular template.
即高质量测序reads,包含adaptors以及测多次获得multiple subreads。
* Polymerase Read N50 : 50% of all polymerase reads are longer than this value.
测序reads中,50%的reads长度大于N50这个值。
* Polymerase Read Length : The mean trimmed read length of all polymerase reads. The value includes bases from adaptors as well as multiple passes around a circular template.
测序reads的平均长度,包含adaptors以及multiple subreads。
* Polymerase Read Quality : The mean single-pass read quality of all polymerase reads.
测序reads中,single-pass read平均质量值。
* Post-Filter Polymerase Read Bases : The number of bases in the polymerase reads after filtering, including adaptors.
测序reads过滤后所包含的碱基数,包含adaptors以及multiple subreads。
* Post-Filter Polymerase Reads : The number of polymerases generating trimmed reads after filtering. Polymerase reads include bases from adaptors and multiple passes around a circular template.
过滤后测序reads数,过滤后reads中包含adaptors以及multiple subreads。
* Post-Filter Polymerase Read Length : The mean trimmed read length of all polymerase reads after filtering. The value includes bases from adaptors as well as multiple passes around a circular template.
过滤后测序reads的平均长度,过滤后reads中包含adaptors以及multiple subreads。
* Post-Filter Polymerase Read Quality : The mean single-pass read quality of all polymerase reads after filtering.
过滤后测序reads中,single-pass read平均质量值。
附其他输出报告中的名词释义:
Diagnostic - Adapters Report
Adapter Dimers (%): The % of pre-filter ZMWs which have observed inserts of 0-10bp. These are likely adapter dimers.
接头二聚体(%): 测序reads过滤前,其中0-10bp的序列,极有可能为接头二聚体。
Short Inserts (%): The % of pre-filter ZMWs which have observed inserts of 11-100bp. These are likely short fragment contamination.
短的插入片段(%): 测序reads过滤前,其中11-100bp的序列,极有可能为短的污染序列。
Diagnostic - Spike-In Control Report
Control Sequence: The name of the control sequence.
对照序列/样本的信息。
Control Reads (%): The percent of post-filter polymerase reads that are from the control sample. The formula for this is: (total # of control reads)/(total # of post-filter reads).
测序reads过滤后,control reads所占过滤后reads的比例。计算公式为: (total # of control reads)/(total # of post-filter reads).
Control Polymerase Read Length: The mean mapped read length of the polymerase reads from the control sample.
对照样品测序reads中,可比对上的reads的平均长度。
Control Reads: The total number of polymerase reads from the control sample that passed filtering.
经过滤后,对照样本中总的测序reads数。
Control Subread Accuracy: The mean single-pass accuracy of the mapped polymerase reads from the control sample.
对照样本中,可比对上的测序reads的平均single-pass准确性。
Control Polymerase Read Length 95%: The 95th percentile of mapped read length of the polymerase reads from the control sample.
对照样本中,比对率在95%的reads长度。
SMRT Cell ID: ID number of the SMRT Cell(s) used in this run.
此次运行中,SMRT Cell(s)的ID号。
Productive ZMWs: The number of ZMWs for this SMRT Cell that produced results with Productivity = 1.
此测序SMRT cell中,零膜波导孔测序产生的序列结果,且聚合酶填充率Productivity = 1。
Productivity 0 (%): Percentage of ZMWs that are empty, with no polymerase.
零膜波导孔没有被聚合酶填充,是空的。
Productivity 1 (%): Percentage of ZMWs that are productive and sequencing.
零膜波导孔被聚合酶填充满,可开展测序。
Productivity 2 (%): Percentage of ZMWs that are not P0 (empty) or P1 (productive). This may occur for a variety of reasons and the sequence data is not usable.
零膜波导孔填充值既不是P0 (empty) 也不是 P1 (productive)。这可能是由多方面的原因导致的、且测序数据不可用。
Resequencing - Coverage Report
Coverage: The mean depth of coverage across the reference sequence.
总测序数据量相对参考基因组序列的平均覆盖度(平均测序深度)。
Missing Bases (%): The percentage of the reference sequence that has zero coverage.
参考基因组序列中完全没有被覆盖到的区域,即该区域测序深度为0。
Post-Filter Reads: The number of reads that passed filtering.
过滤后的reads数。
Mapped Reads: The number of post-filter reads that mapped to the reference sequence.
过滤后的reads中,可比对至参考基因组序列上的reads数。
Mapped Subreads: The number of post-filter subreads that mapped to the reference sequence.
过滤后获得的subreads中,可比对至参考基因组序列上的subreads数。
Mapped CCS Reads: The number of post-filter CCS reads that mapped to the reference sequence.
CCS即为consensus sequence,由来自同一个ZMWs的subreads比对获得。
这里是指过滤后,可比对至参考基因组序列上的CCS序列数。
Mapped Subread Bases: The number of post-filter bases from all subreads that mapped to the reference sequence. This does not include adapters.
过滤后,可比对至参考基因组序列上的subreads的总碱基数。这里不包含adapters。
Mapped CCS Read Bases: The number of post-filter CCS read bases that mapped to the reference sequence. This does not include adapters.
过滤后,可比对至参考基因组序列上的CCS的总碱基数。这里不包含adapters。
Mapped Subread Accuracy: The mean accuracy of post-filter subreads that mapped to the reference sequence.
过滤后,可比对至参考基因组序列上的subreads的平均准确性。
Mapped CCS Read Accuracy: The mean accuracy of post-filter CCS reads that mapped to the reference sequence.
过滤后,可比对至参考基因组序列上的CCS的平均准确性。
Mapped Subread Length: The mean read length of post-filter subreads that mapped to the reference sequence. This does not include adapters.
过滤后,可比对至参考基因组序列上的subreads的平均长度。这里不包含adapters。
Mapped Read Length of Insert: The mean read length of all insert sequences, which includes only mapped sequences. The read length of insert is approximately the longest subread length per ZMW.
过滤后,可比对至参考基因组序列上的所有插入片段的平均长度。在同一个ZMW中,插入片段的长度大约是该ZMW中最长的subread的长度。
Mapped Polymerase Read Length: The mean read length of post-filter polymerase reads that mapped to the reference sequence. This includes adapters.
过滤后,可比对至参考基因组序列上的测序reads的长度,Polymerase Read是包含adapters的。
Mapped Polymerase Read Length 95%: The 95th percentile of read length of post-filter polymerase reads that mapped to the reference sequence.
过滤后,可比对至参考基因组序列上,比对率在95%的polymerase reads的长度。
Mapped Polymerase Read Length Max: The maximum read length of post-filter polymerase reads that mapped to the reference sequence.
过滤后,可比对至参考基因组序列上的最长的polymerase reads的长度。
Mapped Full Subread Length: The average of the lengths of full subreads that mapped to the reference sequence. Full subreads are subreads flanked by two adapters.
过滤后,可比对至参考基因组序列上的full subreads的平均长度。full subreads两侧均包含adapter。
Reference: The name of the reference sequence.
Reference Length: The length of the reference sequence.
Bases Called (%): The percentage of reference sequence that has ≥ 1x coverage. % Bases Called + % Missing Bases should equal 100.
Consensus Accuracy: The accuracy of the consensus sequence compared to the reference.
Base Coverage: The mean depth of coverage across the reference sequence.
Analysis - Top Variants Report
Sequence: The name of the reference sequence.
Position: The position of the variant along the reference sequence.
Variant: The variant position, type, and affected nucleotide.
Type: The variant type: Insertion, Deletion, or Substitution.
Coverage: The coverage at position.
Confidence: The confidence of the variant call.
Genotype: Includes the full number of chromosomes (diploid) or half the number (haploid).
Assembly Iterations: The number of iterations of overlap-layout-consensus performed by the de novo or hybrid assembly algorithm.
Assembly - Draft Assembly Report
Draft Contigs: The number of contigs output by Celera Assembler, which may include singleton and degenerate contigs. After assembly polishing with Quiver, the final number of contigs may be smaller.
N50 Contig Length: The length L of the contig for which 50% of all bases in the final contigs are of length greater than L.
Reads Assembled (%): The fraction of all reads that are assembled into contigs in the final assembly.
Max Contig Length: The length of the longest contig in the final assembly.
Sum of Contig Lengths: The sum of the lengths of all contigs in the final assembly.
Hybrid Assembly - Assembly Iterations Report
Input Contigs: The number of contigs used as input to the AHA algorithm.
Min Align Score: The minimum alignment score between a read and a contig to use the alignment for scaffolding.
Min Link Redundancy: The minimum number of reads that must link two contigs for those contigs to be connected in a scaffold.
Min Subread Length: The minimum length required for a subread to be used by the AHA algorithm.
Min Contig Length: The minimum length required for a contig to be used by the AHA algorithm.
Scaffolds Across Assembly Iterations: The number of scaffolds at a particular iteration of the AHA algorithm.
Linking Reads Across Assembly Iterations: The number of linking reads at a particular iteration of the AHA algorithm.
Hybrid Assembly - Final Assembly Report
Number: The number of scaffolds, contigs, or gaps in the initial or final assembly.
Max Length: The length of the longest scaffold, contig, or gap in the initial or final assembly.
N50 Length: The length L of the scaffold, contig, or gap for which 50% of all bases in the initial/final scaffold/contig/gap are of length greater than L.
Sum Length: The sum of the lengths of all scaffolds, contigs, or gaps in the initial or final assembly.
Initial Scaffolds: The distribution of the lengths of the scaffolds sequences before completing the AHA algorithm. Scaffolds are composed of contigs optionally separated by gap sequences.
Final Scaffolds: The distribution of the lengths of the scaffolds sequences after completing the AHA algorithm. Scaffolds are composed of contigs optionally separated by gap sequences.
Initial Contigs: The distribution of the lengths of the contig sequences before completing the AHA algorithm. Contigs are stretches of continuous sequence that do not contain gaps.
Final Contigs: The distribution of the lengths of the contig sequences after completing the AHA algorithm. Contigs are stretches of continuous sequence that do not contain gaps.
Initial Gaps: The distribution of the lengths of the gaps between contig sequences before completing the AHA algorithm.
Final Gaps: The distribution of the lengths of the gaps between contig sequences after completing the AHA algorithm.
Base Modifications - Motifs Report
Motif: The nucleotide sequence of the methyltransferase recognition motif, using the standard IUPAC nucleotide alphabet.
Modified Position: The position within the motif that is modified. The first base is "1". Example: The modified adenine in GATC is at position 2.
Modification Type: The type of chemical modification most commonly identified at that motif. These are: 6mA, 4mC, 5mC, or modified_base (modification not recognized by the software.)
% Motifs Detected: The percentage of times that this motif was detected as modified across the entire genome.
# Of Motifs Detected: The number of times that this motif was detected as modified across the entire genome.
# Of Motifs In Genome: The number of times this motif occurs in the genome.
Mean Modification QV: The mean modification QV for all instances where this motif was detected as modified.
Mean Motif Coverage: The mean coverage for all instances where this motif was detected as modified.
Partner Motif: For motifs that are not self-palindromic, this is the complementary sequence.
Assembly - Pre-Assembly Report
Seed Bases: The number of bases from seed reads.
Pre-Assembled Yield: The percentage of seed read bases that were successfully aligned to generate pre-assembled reads.
Pre-Assembled Read Length: The average length of the pre-assembled reads.
Length Cutoff: Reads with lengths greater than the length cutoff are used as seed reads for pre-assembly.
Pre-Assembled Bases: The number of bases in the pre-assembled reads.
Pre-Assembled Reads: The number of reads output by the pre-assembler. Pre-assembled reads are very long, highly accurate reads that can be used as input to a de novo assembler.
Pre-Assembled N50: The N50 read length of the pre-assembled reads.
待继续更新。
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2025-1-9 21:00
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社