Statistical analysis of synonymous and stop

Abstract

Knowledge of the frequencies of synonymous triplets in protein-coding and non-coding DNA stretches can be used in gene finding. These frequencies depend on the GC content of the genome or parts of it. An example of interest is provided by stop codons. This is relevant for the definition of Open Reading Frames. A generic case is provided by pseudo-random sequences, especially when they code for complex proteins or when they are non-coding and not subject to selection pressure. Here, we calculate, for such sequences and for all 25 known genetic codes, the frequency of each amino acid and stop codon based on their set of codons and as a function of GC content. The amino acids can be classified into five groups according to the GC content where their expected frequency reaches its maximum. We determine the overall Shannon information based on groups of synonymous codons and show that it becomes maximum at a percent GC of 43.3% (for the standard code). This is in line with the observation that in most fungi, plants, and animals, this genomic parameter is in the range from 35 to 50%. By analysing natural sequences, we show that there is a clear bias for triplets corresponding to stop codons near the 5′- and 3′-splice sites in the introns of various clades.

Details

Title

Statistical analysis of synonymous and stop codons in pseudo-random and real sequences as a function of GC content

Author

Wesp, Valentin¹; Theißen, Günter²; Schuster, Stefan¹

¹ Friedrich Schiller University Jena, Department of Bioinformatics, Matthias Schleiden Institute, Jena, Germany (GRID:grid.9613.d) (ISNI:0000 0001 1939 2794)
² Friedrich Schiller University Jena, Department of Genetics, Matthias Schleiden Institute, Jena, Germany (GRID:grid.9613.d) (ISNI:0000 0001 1939 2794)

Pages

22996

Publication year

2023

Publication date

2023

Publisher

Nature Publishing Group

e-ISSN

20452322

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1038/s41598-023-49626-9

ProQuest document ID

2906651300

© The Author(s) 2023. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Statistical analysis of synonymous and stop codons in pseudo-random and real sequences as a function of GC content

Jump to:

Abstract

Details

Suggested sources