Computational methods for comparing and integrating multiple probing assays to predict RNA secondary structure

Autor:	Saaidi, Afaf, Allouche, Delphine, Ponty, Yann, Sargueil, Bruno, Regnier, Mireille
Přispěvatelé:	Laboratoire d'informatique de l'École polytechnique [Palaiseau] ( LIX ), Centre National de la Recherche Scientifique ( CNRS ) -École polytechnique ( X ), Algorithms and Models for Integrative Biology ( AMIB ), Centre National de la Recherche Scientifique ( CNRS ) -École polytechnique ( X ) -Centre National de la Recherche Scientifique ( CNRS ) -École polytechnique ( X ) -Laboratoire de Recherche en Informatique ( LRI ), Université Paris-Sud - Paris 11 ( UP11 ) -Institut National de Recherche en Informatique et en Automatique ( Inria ) -CentraleSupélec-Centre National de la Recherche Scientifique ( CNRS ) -Université Paris-Sud - Paris 11 ( UP11 ) -Institut National de Recherche en Informatique et en Automatique ( Inria ) -CentraleSupélec-Centre National de la Recherche Scientifique ( CNRS ) -Inria Saclay - Ile de France, Institut National de Recherche en Informatique et en Automatique ( Inria ), Laboratoire de cristallographie et RMN biologiques ( LCRB - UMR 8015 ), Université Paris Descartes - Paris 5 ( UPD5 ) -Centre National de la Recherche Scientifique ( CNRS ), Laboratoire d'informatique de l'École polytechnique [Palaiseau] (LIX), Centre National de la Recherche Scientifique (CNRS)-École polytechnique (X), Algorithms and Models for Integrative Biology (AMIB ), Centre National de la Recherche Scientifique (CNRS)-École polytechnique (X)-Centre National de la Recherche Scientifique (CNRS)-École polytechnique (X)-Laboratoire de Recherche en Informatique (LRI), CentraleSupélec-Université Paris-Sud - Paris 11 (UP11)-Centre National de la Recherche Scientifique (CNRS)-CentraleSupélec-Université Paris-Sud - Paris 11 (UP11)-Centre National de la Recherche Scientifique (CNRS)-Inria Saclay - Ile de France, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), Laboratoire de cristallographie et RMN biologiques (LCRB - UMR 8015), Université Paris Descartes - Paris 5 (UPD5)-Institut de Chimie du CNRS (INC)-Centre National de la Recherche Scientifique (CNRS), École polytechnique (X)-Centre National de la Recherche Scientifique (CNRS), École polytechnique (X)-Centre National de la Recherche Scientifique (CNRS)-École polytechnique (X)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire de Recherche en Informatique (LRI), Université Paris-Sud - Paris 11 (UP11)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Université Paris-Sud - Paris 11 (UP11)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Inria Saclay - Ile de France, Centre National de la Recherche Scientifique (CNRS)-Université Paris Descartes - Paris 5 (UPD5)
Jazyk:	angličtina
Rok vydání:	2016
Předmět:	Probing data [ INFO.INFO-BI ] Computer Science [cs]/Bioinformatics [q-bio.QM] [INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM] RNA secondary structure
Zdroj:	Doctorial journey Interface, Ecole polytechnique, Palaiseau Doctorial journey Interface, Ecole polytechnique, Palaiseau, Nov 2016, Palaiseau, France. 10
Popis:	International audience; 1- Introduction:RNA structure is a key to understand retroviruses’s mechanisms e.g. HIV. Many prediction approaches suggesting accurate structures are available but they could be further improved by both taking advantage of Next generation sequencing technology and new experimental techniques (Enzymatic and SHAPE).2 - Experimental probing dataIn this poster, we present an integrative approach based on using many experimental data, resulting from sequencing, to direct predictions with the aim to find an accurate structure lying in the intersection of different sources of experiments. From one side, to reveal single nucleotide, reactivity profiles resulting from a SHAPE technology were used as “soft constraints”, meaning that the reactivity values were translated into pseudo-energies as described (Lorenz et al, 2016). From the other side, RNAses cleavage was used with two enzymes V and T targeting respectively paired and unpaired nucleotides. Reactivity scores resulting from those two experiments are used as hard constraints, forcing positions that exceed a specific threshold to be paired(case of Enzymatic-V) or unpaired(case of Enzymatic-T).3-stochastic sampling:At the thermodynamic equilibrium, a given RNA can have many alternative structures, where each structure could be characterized by a probability within the space of all the possible conformations (Boltzmann ensemble). This probability is related to the energy of the structure, the highest the energy needed to break pairs present in the structure the highest is its probability in the ensemble.We admit that the optimal structure(s) should be energetically stable and supported by several experimental data. For this reason, we coupled a stochastic sampling from the Boltzmann ensembles associated with the experimentally derived constraints, with a clustering across experimental conditions, to generate a structural models that are well-supported by available data.4-The work-flow description: 1. Experimental data from different conditions(SHAPE,Enzymatic-T, Enzymatic-V) were analysed to extract reactivity profiles that will serve as constraints. 2. We sampled 2000 structures per condition: We perform a Boltzmann sampling (Ding et al, 2005) to generate a predefined number of stable structures, compatible with the constraints derived for each condition. We used the stochastic sampling mode of RNAsubopt (-p option) to generate energetically stable structures that are either fully compliant with constraints derived from enzymatic data (hard constraints(Mathews et al,2004)), or constitute reasonable trade-offs between thermodynamic stability and compatibility with SHAPE data (soft constraints, using thepseudo-potentials of Deigan et al. (Deigan et al, 2009), see (Lorenz et al,2016) for details). 3. We merge the structures while keeping labels to retain the origin of each structure. 4. In order to detect structures with affinity to each other, The merged sets of models were clustered, using the base-pair distance as a measure of dissimilarity, the distance between two structures corresponds to the number of base pairs needed to break and to build in order to go from a structure to an other. A clustering algorithms (affinity propagation (Wang et al, 2007) implemented in the scikit-learn Python package (Pedregosa et al, 2011) is used to agglomerate and identify recurrent structures. One of the advantages of affinity propagation resides in its low computational requirements. 5. The next step consists on identifying clusters that are homogeneous, stable and well supported by experimental evidences, leading to the identification of the following objective criteria:-Present conditions that informs as about the diversity of the cluster: Our primary target are clusters compatible with multiple experimental conditions. However, the larger sampled sets required for reproducibility tend to populate each cluster with structures fromall conditions. We thus associated with each cluster the number of represented conditions, defined as the number of conditions for whichthe accumulated Boltzmann probability in the cluster exceeds a predefined threshold.-Boltzmann weight that is a measure of stability: Structures that are found in a given cluster may be unstable, and should be treated asoutliers. For this reason, we computed the cumulated normalized Boltzmann probabilities within the cluster, to favor stable clustersconsisting of stable structures;- Average Cluster Distance to count for coherence: We observed a general tendency of clustering algorithms to create heterogeneousclusters when faced with noisy data. We thus associated with each cluster the mean distance between pairs of structures, estimated asthe average distance to the MEA (Lu et al, 2009) for the sake of efficiency, in order to neglect clusters that were too diverse.6. The next steps consist on choosing cluster(s) with high coherence, diversity and stability. for this purpose we restricted our analysis to clusters that were found on the 3D Pareto Frontier (Mattson Messac, 2005) with respect to the three mentioned above criteria .7. After detecting the optimal Pareto cluster(s), we need to identify representative structure for each cluster. We chose the maximum expectedaccuracy (MEA) structure (Lu et al, 2009) as the representative structure for each cluster, which is defined as the secondary structure whose structural elements have highest accumulated Boltzmann probability within the cluster.5 Results:This resulted in 2 structures which we narrowed down to a single candidate using compatibility with the 1M7 SHAPE data as a final discriminatory criterion.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=dedup_wf_001::be13fe6d0a2fb80936f0b0ec3d2776bb https://hal.archives-ouvertes.fr/hal-01558227 Zobrazit plný text záznamu