Data Analysis for the Life Sciences是哈佛大学PH525x系列课程——生物医学中的数据分析(PH525x series - Biomedical Data Science ),课程全部采用R语言进行统计分析理论教学与实战。教材采用Rmarkdown语言编写,易轻松易读,又保证分析的可重复性,代表了科学界最先进的可重复计算要求,我们不仅可以系统学习一个生物学家所要掌握的统计知识,还能新手用代码实现,并达到CNS发表可重复代码的要求。
传统的统计材料关注数学原理。而本文重点是用计算机实现数据分析。本书采用实例来讲解数学原理,提供代码亲自实现分析。全文采用R markdown编写,保证读者完成全部分析。
关于作者:
Rafael A Irizarry是哈佛大学公共卫生学院丹娜法伯癌症研究院的生物统计和计算生物学教授,有17年分析基因组数据的经验。
Michael I Love是北卡教堂山大学生统与遗传系助理教授。研究方向为利用统计模型发现基因组数据中的生物为规律,并开发了Bioconductor中开源统计软件。
课程源代码 :https://github.com/genomicsclass/labs 包括课程所有源代码、测试数据和结果
网页版教程 : https://genomicsclass.github.io/book/ ,包括课程的Rmd运行结果网页教程,和Rmd源代码的每节导航和下载链接。
电子书 :https://leanpub.com/dataanalysisforthelifesciences/ 方便下载各版本在移动端阅读
有意思的是可选择免费学习,或最高付给作者80$。
教程大纲 https://genomicsclass.github.io/book/
PH525x series - Biomedical Data Science 链接与资源Links and resources R markdown source files ePub version on Leanpub Links to the HarvardX class pages External resources and books Finding more help for data analysis Chapter 0 - 简介Introduction Introduction [Rmd] Getting started [Rmd] Getting started exercises 数据操作dplyr introduction [Rmd] dplyr introduction exercises Mathematical notation [Rmd] Chapter 1 - 推理统计基础Inference 随机变量Random variables [Rmd] Random variables exercises 群体与样本Populations and samples [Rmd] Populations and samples exercises CLT and t-distribution [Rmd] CLT and t-distribution exercises CLT in practice [Rmd] CLT in practice exercises t-test in practice [Rmd] 置信区间Confidence intervals [Rmd] Power calculations [Rmd] Power calculations exercises Monte carlo [Rmd] Monte carlo exercises 排列检验Permutation tests [Rmd] Permutation tests exercises 关联研究Association tests [Rmd] Association tests exercises Chapter 2 - 数据探索Exploratory Data Analysis Exploratory data analysis [Rmd] Plots to avoid [Rmd] Exploratory data analysis exercises Chapter 3 - 稳健统计Robust Statistics Robust summaries [Rmd] Rank tests [Rmd] Robust summaries exercises Chapter 4 - 矩阵代数Matrix Algebra 回归Introduction to using regression [Rmd] Introduction to using regression exercises Matrix notation [Rmd] Matrix notation exercises Matrix operations [Rmd] Matrix operations exercises Matrix algebra examples [Rmd] Matrix algebra examples exercises Chapter 5 - 线性模型 Linear Models Linear models introduction [Rmd] Linear models introduction exercises Expressing design formula [Rmd] Expressing design formula exercises Linear models in practice [Rmd] Linear models in practice exercises Standard errors [Rmd] Standard errors exercises Interactions and contrasts [Rmd] Interactions and contrasts exercises Collinearity [Rmd] Collinearity exercises QR and regression [Rmd] Linear models going further [Rmd] Chapter 6 - 推断高维数据Inference for High-Dimensional Data Introduction to high-throughput data [Rmd] Introduction to high-throughput data exercises Inference for high-throughput data [Rmd] Inference for high-throughput data exercises Multiple testing [Rmd] Multiple testing exercises EDA for high-throughput data [Rmd] EDA for high-throughput data exercises Chapter 7 - 统计模型Statistical Modeling Modeling [Rmd] Modeling exercises Bayes theorem [Rmd] Bayes theorem exercises Hierarchical models [Rmd] Hierarchical models exercises Chapter 8 - 降维Distance and Dimension Reduction Distance [Rmd] Distance exercises PCA motivation [Rmd] SVD [Rmd] SVD exercises Projections [Rmd] Rotations [Rmd] MDS [Rmd] MDS exercises PCA [Rmd] Chapter 9 - 机器学习Practical Machine Learning 聚类和热图Clustering and heatmaps [Rmd] Clustering and heatmaps exercises Conditional expectation [Rmd] Conditional expectation exercises Smoothing [Rmd] Smoothing exercises Machine learning [Rmd] Crossvalidation [Rmd] Crossvalidation exercises Chapter 10 - 批次效应Batch Effects Introduction to batch effects [Rmd] Confounding [Rmd] Confounding exercises EDA with PCA [Rmd] EDA with PCA exercises Adjusting with linear models [Rmd] Adjusting with linear models exercises Factor analysis [Rmd] Factor analysis exercises Adjusting with factor analysis [Rmd] Adjusting with factor analysis exercises Chapter 11 - 生物R包简介Introduction to Bioconductor Mike Love’s general reference card Motivations and core values (optional) Installing Bioconductor and finding help [Rmd] Data structure and management for genome scale experiments [Rmd]Coordinating multiple tables: ExpressionSet Institutional archives: GEO, ArrayExpress Interlude: Working with general genomic features using GenomicRangesIRanges introduced Intra-range operations Inter-range operations GRanges Calculating overlaps Range-oriented solutions for current experimental paradigmsSummarizedExperiment: for RNA-seq and 450k methylation External storage for very large assays GenomicFiles for families of BAM or BED DNA Variants: VCF handling with VariantAnnotation and VariantTools Handling multiomic archives like TCGA Cloud-oriented solutions: e.g., Google BigQuery Short read mapping/alignment software (optional) [Rmd] Chapter 12 - 基因组注释Genomic Annotation with Bioconductor More details on GRanges [Rmd]Run-length encoding, views Application to genomic landmarks Application to 450k methylation array visualization General overview of Bioconductor annotation [Rmd]Levels: reference sequence, regions of interest, pathways Discovering reference sequence A build of the human genome Gene/Transcript/Exon catalogs from UCSC and Ensembl Importing and exporting regions and scores AnnotationHub: brokering thousands of annotation resources OrgDb: simple interface to annotation databases Finding and managing gene sets OrganismDb: unifying diverse annotation Cheat sheet on Bioconductor annotation [Rmd] Translating addresses between genome builds: liftOver [Rmd] Chapter 13 - 假设检验Genome-scale hypothesis testing with Bioconductor 区分生物重复和技术重复的变异Distinguishing biological and technical variability [Rmd]An experiment with pooled and individual samples Measuring technical variation Measuring biological variation Interpretation 多重比较Multiple comparisons with genewise t-tests [Rmd]Gene-wise testing Naive enumeration of genes Demonstrating danger of multiple testing with a set of sham comparisons Adjusting for multiplicity with qvalue Adjusted counts in the sham case Moderated t tests via limma [Rmd]A spike-in dataset Naive t-tests Three steps with limma: lmFit, eBayes, topTable Exposing the spiked-in genes A view of the shrinkage of variance estimates 基因集分析Introducing gene sets and gene set analysis [Rmd]Data wranglingA dataset for comparing expression by gender Finding surrogate variables/batch effect correction The Broad Institute MsigDbIdentifier remapping Categorical testing Statistical summaries for sets: Wilcoxon Statistical summaries for sets: t statistics Adjusting for within-set correlation A permutation procedure Chapter 14 - 基因组数据可视化Visualization of genome scale data 可视化任务与策略A basic overview of visualization tasks and strategies[Rmd]Gene models Gene models plus data Driving visualizations with functions Using the browser to drive visualization functions via shiny Queriable dynamic displays with plotly Annotation-oriented visualizationsSketching the binding landscape over chromosomes with ggbio’s karyogram layout [Rmd] Plotting data in the context of genomic features with Gviz [Rmd] Visualizing NGS data [Rmd] Interactive visualizationGraphical user interfaces for multivariate data with shiny [Rmd] Clustering gene expression data with shiny [Rmd] Final remarks on visualization [Rmd] Chapter 15: 并行与内存不足Pursuing scalability in genomic analysis: parallelism and out-of-memory data Parallel computing with R and Bioconductor [Rmd]Demonstrating simple speedup in multicore environments Implicit parallelism with BiocParallel and GenomicAlignments External data: data interfaces that spare RAM[Rmd]SQLite for annotation Tabix-indexed BAM HDF5 An illustration of NoSQL with S4: mongodb and RaggedMongoExpt[Rmd] Benchmarking various out-of-memory solutions[Rmd] Introduction to Bioconductor’s Amazon Machine Instance for cluster creation and use in EC2 [Rmd] Sharded GRanges for scalable integrative analysis[Rmd] Chapter 16: 多组学数据Multi-omic data integration Basic examples of multi-omic integration[Rmd]Transcription factor (TF) binding and gene coexpression in yeast TF binding and GWAS hits in humans Using RTCGAToolbox outputs to integrate clinical, mutation, expression and methylation assays[Rmd]Basic data acquisition Working with clinical dataDefining a severity marker Extracting survival times Working with mutations Curation tasks for discrepant identifier formats Working with expression dataAssociating tumor stage with expression patterns Linking DNA methylation with expression patterns Application to visualization: kataegis and rainfall plot[Rmd] Chapter 17: Fostering reproducible genome-scale analysis Overview of unit on reproducibility[Rmd]Basic definitions Infrastructure requirements Statistical aspects of reproducibility Analysis of reproducibility probability (Boos and Stefanski 2011) Costs of highly reproducible designs Package structure, creation, installation, management[Rmd]What is a package? Using package.skeleton Using makeOrganismPackage Using devtoolscreate() to set up folders and DESCRIPTION Composing documentation plus code document(), install() Conclusions, including a link to a recent Nature Toolbox article on Bioconductor 如何学习 我们选择在线阅读网页版教程 ,结合源代码进行练习。
https://genomicsclass.github.io/book/ 逐节阅读学习,内容较多。读者可挑选适合自己的章节学习即可。
有实战的内容,都有Rmd的源代码,下载用本地的Rstudio打开即可。
批量下载所有资源
Windows下载:https://github.com/genomicsclass/labs/archive/master.zip
Linux下使用git或wget下载
# 方法1. 解压后为labs-master目录
wget -c https://github.com/genomicsclass/labs/archive/master.zip
unzip master.zip
# 方法2. 下载为labs目录下
git clone git@github.com:genomicsclass/labs.git
猜你喜欢 写在后面 为鼓励读者交流、快速解决科研困难,我们建立了“宏基因组”专业讨论群,目前己有国内外2000+ 一线科研人员加入。参与讨论,获得专业解答,欢迎分享此文至朋友圈,并扫码加主编好友带你入群,务必备注“姓名-单位-研究方向-职称/年级”。技术问题寻求帮助,首先阅读《如何优雅的提问》 学习解决问题思路,仍末解决群内讨论,问题不私聊,帮助同行。
学习扩增子、宏基因组科研思路和分析实战,关注“宏基因组”
点击阅读原文,跳转最新文章目录阅读https://mp.weixin.qq.com/s/5jQspEvH5_4Xmart22gjMA
转载本文请联系原作者获取授权,同时请注明本文来自刘永鑫科学网博客。 链接地址: https://blog.sciencenet.cn/blog-3334560-1131943.html
上一篇:
Nature Method:Bioconda解决生物软件安装的烦恼 下一篇:
Gut-2018-早期肝癌肠道生物标志物鉴定