大工至善|大学至真分享 http://blog.sciencenet.cn/u/lcj2212916

博文

[转载]【计算机科学】【2020.05】基于深度学习的计算蛋白质结构预测

已有 186 次阅读 2020-9-17 17:09 |系统分类:科研笔记|文章来源:转载

本文为美国密苏里大学(作者:ZHAOYU LI)的博士论文,共136页。

 

蛋白质结构预测在生物信息学和计算生物学中具有重要意义。在过去的30年里,许多机器学习方法已经被发展出许多基于同源性和abinitio的方法。近年来,深度学习得到了成功的应用,并取得了比以往方法更好的效果。在蛋白质初级氨基酸序列到蛋白质二维或三维结构的复杂映射建模中,深度学习方法可以有效地处理高维特征输入

 

本文针对蛋白质结构预测中的三个问题:回路建模、接触图预测和接触图细化,提出了新的深度学习方法和深度学习网络。目前已在最先进的MUFOLD软件中实现,并获得了显著的性能改进。回路模型的目的是预测相对较短的蛋白质主链构象,提出了一种基于生成性对抗网络(GAN)的新方法MUFOLD-LM。蛋白质的三维结构可以用C原子的二维距离图来表示。结构中的缺失区域将相应地成为距离图中的缺失区域。我们的网络利用生成网络根据上下文填充距离图中的缺失区域,判别网络将预测的完整距离图与真实情况作为输入进行区分。该方法综合利用漏环区域的特征和上下文信息,较好地预测了漏环区域的三维结构。在使用常用基准数据集8-Res12-Res的实验中,MUFOLD-LM显著优于以前的方法,RMSD分别高达43.9%4.13%。据我们所知,这是首次成功地将GAN应用于蛋白质结构预测。


接触图预测的目的是在蛋白质的特定阈值内预测两个𝐶’’原子。它可以帮助确定蛋白质的整体结构,以辅助三维建模过程。本文提出了一种新的基于全卷积网络和扩张残差网络的两级多分支神经网络,称为MUFOLD_Contact,它将问题描述为像素级回归和分类问题。第一阶段预测短期、中期和长期残差对的距离图。第二阶段是从前一级输入预测第二个特征的距离。该方法利用特征集的距离分布信息来改进二值化预测结果。在使用CASP13目标的实验中,新方法的性能优于单级网络,并且可以与现有最好的工具相媲美。除了直接使用深层神经网络预测接触外,还提出了一种新方法TPCrefTemplate Prediction Correction refinement)来改进蛋白质模板预测的结果。基于推荐系统中协同滤波的思想,TPCref首先根据目标序列找到多个模板序列,然后利用模板结构和由接触预测器生成的模板预测接触图,利用协同滤波的思想形成目标接触图滤波器,然后利用接触图滤波器对预测的接触图进行细化。在最近发布的PDB蛋白实验结果中,TPCref显著改善了现有预测因子的接触预测结果,使MUFOLD_ContactMetaPSICOVCCMPred分别提高了5.0%12.8%37.2%

本文所提出的新方法已在MUFOLD蛋白质结构预测综合平台上实现。它提供了一套丰富的功能,包括数据库生成、二级和超二级结构预测、β-转角和γ-转角预测、接触图预测和细化、蛋白质三维结构预测、回路建模、模型质量评估和模型精细化。本文设计并开发了一种新的模块化MUFOLD流水线,每个模块之间相互解耦,并提供标准通信协议接口供其他程序调用。模块化提供了方便地集成新算法和工具的能力,以便在研究期间进行快速迭代。此外,还为MUFOLD设计并实现了一个新的web门户,为交流互动社区提供我们工具的在线服务或api

 

Protein structure prediction is of greatimportance in bioinformatics and computational biology. Over the past 30 years,many machine learning methods have been developed for this problem inhomology-based and ab-initio approaches. Recently, deep learning has beensuccessfully applied and has outperformed previous methods. Deep learningmethods could effectively handle high dimensional feature inputs in modelingthe complex mapping from protein primary amino acid sequences to protein 2-D or3-D structures. In this dissertation, new deep learning methods and deeplearning networks have been proposed for three problems in protein structure prediction:loop modeling, contact map prediction, and contact map refinement. They havebeen implemented in the state-of-the-art MUFOLD software and obtainedsignificant performance improvement.

The goal of loop modeling is to predict theconformation of a relatively short stretch of protein backbone. A new methodbased on Generative Adversarial Network (GAN), called MUFOLD-LM, is proposed.The protein 3-D structure can be represented using the 2-D distance map of 𝐶! atoms. The missing region in the structure will be a missingregion in the distance map correspondingly. Our network uses the GeneratorNetwork to fill in the missing regions in the distance map based on thecontext, and the Discriminator Network will take both the predicted completedistance map and the ground truth as input to distinguish between them. Themethod utilizes both the features and context of the missing loop region tomake better prediction of the 3-D structure of the loop region. In experiments usingcommonly used benchmark datasets 8-Res and 12-Res, MUFOLD-LM outperformed previousmethods significantly, up to 43.9% and 4.13% in RMSD, respectively. To the bestof our knowledge, it is the first successful GAN application in proteinstructure prediction.

The goal of contact map prediction is topredict whether the distance between two 𝐶" atoms (𝐶! forGlycine) in a protein falls within a certain threshold. It can help todetermine the global structure of a protein in order to assist the 3D modelingprocess. In this work, a new two-stage multi-branch neural network based onFully Convolutional Network and Dilated Residual Network, calledMUFOLD_Contact, is proposed. It formulates the problem as a pixel-wiseregression and classification problem. The first stage predicts distance mapsfor short-, medium-, and long-range residue pairs. The second stage takes thepredicted distances from stage 1 along with other features as input to predicta binary contact map. The method utilizes the distance distribution informationin the feature set to improve the binary prediction results. In experimentsusing CASP13 targets, the new method outperformed single stage networks and iscomparable with the best existing tools. In addition to predicting contactdirectly using deep neural networks, a new method, called TPCref (TemplatePrediction Correction refinement), is proposed to refine and improve theprediction results of a contact predictor using protein templates. Based on theidea of collaborative filtering from recommendation system, TPCref first findsmultiple template sequences based on the target sequence and uses thetemplates’ structures and the templates’ predicted contact map generated by acontact predictor to form a target contact-map filter using the idea ofcollaborative filtering. Then the contact-map filter is used to refine thepredicted contact map. In experimental results using recently released PDB proteins,TPCref significantly improved the contact prediction results of existing predictors,improving MUFOLD_Contact, MetaPSICOV, and CCMPred by 5.0%, 12.8%, and 37.2%,respectively.

The proposed new methods have beenimplemented in MUFOLD, a comprehensive platform for protein structureprediction. It provides a rich set of functions, including database generation,secondary and supersecondary structure prediction, beta-turn and gamma-turnprediction, contact map prediction and refinement, protein 3D structure prediction,loop modeling, model quality assessment, and model refinement. In this work, anew modularized MUFOLD pipeline has been designed and developed. Each module isdecoupled from each other and provides standard communication protocolinterfaces for other programs to call. The modularization provides thecapability to easily integrate new algorithms and tools to have a fastiteration during research. In addition, a new web portal for MUFOLD has beendesigned and implemented to provide online services or APIs of our tools to thecommunity.

 

1. 引言

2. 项目背景与相关工作

3. MUFOLD-LM:基于生成对抗网络的蛋白质回路模型

4. MUFOLD-CONTACT:蛋白质残体-残体接触图预测

5. TPCREF:预测接触图细化

6. MUFOLD平台研发

7. 总结与展望


更多精彩文章请关注公众号:205328s611i1aqxbbgxv19.jpg




http://blog.sciencenet.cn/blog-69686-1250982.html

上一篇:[转载]【信息技术】【2016】混响和噪声干扰下的单通道语音增强
下一篇:[转载]【信息计数】【2013.06】目标检测与跟踪算法综述

0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2020-10-26 01:59

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部