Defining the statistical metrics of a pangenome=Καθορίζοντας τις στατιστικές μετρικές ενός Πανγονιδιώματος

Asterios Mpatziakas

Defining the statistical metrics of a pangenome=Καθορίζοντας τις στατιστικές μετρικές ενός Πανγονιδιώματος

Asterios Mpatziakas

Περίληψη

Advances in sequencing techniques have massively increased the publicly accessible genome data and thus enable further and more extensive research opportunities on genome diversity at increasing levels of detail. The concept of the pangenome refers to the union of gene families shared by a set of genomes. There are several studies that have implemented specific pangenome analyses for a variety of organisms, ranging from microbes to viruses and plants, leading to genomic projects of various scales. These projects have led to the advancement of general understanding of evolutionary mechanisms, leading to usable knowledge across multiple sectors such as health, medicine and agriculture. A pangenome can be defined as the identification and construction of three distinct subsets of gene families, the Core genome consisting of all gene families that are shared amongst all genomes, the Dispensable or Accessory genome consisting of gene families present in the majority of the genomes and genes that have presence only in one genome, known as Peripheral or Cloud genome. Other names and overlapping definitions have been used in literature that provide alternate description of a pangenome. However, the essential part of this type of analysis is the use of data in an encompassing way instead of the traditionally linear approaches evident in targeted genome studies. Currently there is a variety of tools available, enabling several computational aspects of the pangenome approach, the majority of which are primarily aimed towards the study of prokaryote genomes. We present a package written for the statistical programming language R, named pasaR, usable in the later stages of such an analysis, i.e. after the construction of the gene families for a given set of genomes, based on information of the full complement of gene families. A complete methodology is proposed, suitable for sets of genomes of varying complexity, optimizing and enriching an assortment of existing measures from micropan, the only R package currently available on CRAN for such studies. Furthermore, we propose a new technique using the Sorensen distance, referred to as fluidity in the context of a pangenome analysis, that allows the identification of distinct subsets of genomes in a given dataset, based on their inferred commonalities at the gene family level. Finally, we demonstrate the methodology using publicly available data from UniProt and additional reference databases.

Keywords: pangenome, genome diversity, comparative genomics, R statistical language

H πρόοδος στις τεχνικές sequencing έχει αυξήσει τον δημόσια διαθέσιμο, όγκο της πληροφορίας που αφορά το γονιδίωμα επιτρέποντας περαιτέρω και εις βάθος ερευνητική δραστηριότητα στο ζήτημα της γονιδιακής ποικιλομορφίας. Η έννοια του πανγονιδιώματος (pangenome) αναφέρεται στην ένωση οικογενειών γονιδίων που είναι κοινά ανάμεσα σε κάποια γονιδιώματα . Υπάρχει μια πληθώρα από μελέτες στις οποίες εφαρμόστηκε η ανάλυση του πανγονιδιώματος σε διάφορους οργανισμούς, από μικρόβια σε ιούς και φυτά. Οι μελέτες αυτές έχουν βοηθήσει στην προαγωγή γενικότερης κατανόησης σχετικά με τους εξελικτικούς μηχανισμούς, οδηγώντας σε πρακτική γνώση σε διάφορους τομείς όπως πχ. την υ υγεία, την φαρμακολογία και την γεωργία. Ενώ υπάρχει μια ποικιλία εργαλείων που είναι διαθέσιμα για την διεξαγωγή μιας ανάλυσης πανγονιδιώματος, η πλειοψηφία αυτών έχει ως κύρια λειτουργία την μελέτη προκαρυωτικών γονιδιωμάτων. Στην παρούσα εργασία παρουσιάζεται ένα λογισμικό γραμμένο στην στατιστική προγραμματιστική γλώσσα R, που ονομάζεται pasaR, το οποίο μπορεί να χρησιμοποιηθεί στα τελευταία στάδια μιας τέτοιας ανάλυσης, δηλαδή μετά την κατασκευή των οικογενειών των γονιδίων για κάποια γονιδιώματα. Προτείνεται μια πλήρης μεθοδολογία για την ανάλυση γονιδιακών δεδομένων διαφορετικής πολυπλοκότητας, βελτιστοποιώντας και εμπλουτίζοντας ήδη υπάρχοντα εργαλεία από το πακέτο micropan, το μοναδικό αντίστοιχο πακέτο διαθέσιμο για την γλώσσα R. Επιπλέον προτείνεται μια καινούργια τεχνική η οποία χρησιμοποιεί την απόσταση Sorensen, γνωστή και ως ρευστότητα (fluidity) στο πλαίσιο της ανάλυσης πανγονιδιώματος, με στόχο την αναγνώριση διακριτών υποομάδων γονιδιωμάτων μέσα σε δοσμένο σύνολο δεδομένων. Τέλος εφαρμόζεται η μεθοδολογία αυτή σε δημόσια διαθέσιμα δεδομένα από τις βάσεις UniProt και Ensembl.

Λέξεις κλειδιά: Πανγονιδίωμα. Γονιδιακή ποικοιλομορφία, συγκριτικά genomics. R statistical language

Πλήρες Κείμενο:

PDF

Εισερχόμενη Αναφορά

Δεν υπάρχουν προς το παρόν εισερχόμενες αναφορές.