The 4th Central German Meeting on Bioinformatics, known as the Mittelerde Meeting 2019, will be held on June 13 and 14 at the Biotechnology Center TU Dresden (BIOTEC). The BIOTEC is part of the Center for Molecular and Cellular Bioengineering (CMCB) and develops innovative technologies driving the progress of modern life sciences. It is the perfect place to host and strengthen the computational biology community of central Germany. We sincerely invite bioinformatics researchers and all those interested to be part of the meeting and to submit a contribution. On Friday evening, the event will be complemented by the Dresden Long Night of Science that allows you to vividly explore the excellent research conducted at the CMCB. Beyond science, the conference is followed up by the famous cultural festival Bunte Republik Neustadt. Mittelerde 2019 is the perfect occasion for an extended stay in Dresden.
After genomic sequencing, an important step in deciphering information from sequences is the detection of open reading frames (ORFs). The concept of ORFs is of importance to identify potential protein-coding genes. The term is used frequently in biology and, in particular, bioinformatics.
However, in many textbooks, not much effort is spent on defining the term, or it is not perfectly clear-cut. Most commonly, an ORF is defined as a sequence stretch that is bounded by a start and stop codon and not interrupted by internal stop codons in the considered reading frame. Often, an ORF is considered equal to the corresponding coding sequence, and this definition already reaches its limits when introns are involved.
Surprisingly, there is no unique agreed definition of that term and at least three definitions are in use with different specified boundaries. This demonstrates that it is worth questioning the established ORF definition. We present several molecular biological and bioinformatics aspects and discuss advantages and disadvantages of the different definitions. In the end, we recommend using the definition in which an ORF starts and ends with a stop codon each.
The development of higher plants requires an exquisite spatio-temporal regulation of their transcriptomes. Although organ-specific and cell-type-specific analyses of global gene ex- pression patterns in the model plant Arabidopsis thaliana were performed a decade ago [1, 2], the dynamic nature of the non-coding transcriptome is poorly understood. Recent studies [3, 4, 5] extended the knowledge of coding and non-coding RNAs in A. thaliana, but these studies are restricted to only one or a few tissue types in typically only one organism. Here, we build a workflow to annotate novel coding and non-coding transcripts from developmental transcriptomes of nine different tissues from seven related plant species based on strand-specific total RNA-Seq experiments. Evidence for an in vivo function has only been shown for a small fraction of non-coding RNAs and splice variants in plants to date, so the constructed annotations in combination with corresponding expression profiles might potentially become useful for deepening our understanding of the developmental transcriptomes of higher plants.
Kenneth Birnbaum et al. A gene expression map of the Arabidopsis root. Science (New York, N.Y.), 302(5652):1956–1960, 2003.
Markus Schmid et al. A gene expression map of Arabidopsis thaliana devel- opment. Nature genetics, 37(5):501–506, 2005.
Song Li et al. Integrated detection of natural antisense transcripts using strand-specific RNA sequencing data Integrated detection of natural antisense transcripts using strand-specific RNA sequencing data. Genome Research, 23:1730–1739, 2013.
Song Li et al. High resolution expression map of the Arabidopsis root re- veals alternative splicing and lincRNA regulation. Developmental Cell, in press(4):508–522, 2016.
Chia-Yi Cheng et al. Araport11: a complete reannotation of the Arabidopsis thaliana reference genome. The Plant Journal, 89:789–804, 2017.
Check out the BIOTEC cafeteria in the ground floor or explore the surroundings with the help of the Map of Mittelerde: http://tiny.cc/MapOfME19
in the courtyard of the BIOTEC and CRTD
In silico prediction of aptamer binders to enhance Systematic Evolution of Ligands
by EXponential Enrichment (SELEX) technologies has attracted a lot of interest in
the recent years. Molecular docking has emerged as an important tool in
computational chemistry and computer-aided drug design. The goal of small
molecules-aptamer docking is to identify favored binding modes of a ligand with
an aptamer of a given three-dimensional structure obtained experimentally or
predicted from nucleotide sequences.
We propose a workflow involving 2D structure prediction, 3D RNA/DNA aptamer
modeling and docking of small molecules to the target aptamer. This study is
focused on describing several approaches and algorithms used to find the optimal
conformation of resulting aptamer/ligand complex. It also aims to provide an
overview and assessment of two commonly used docking engines. We tested
PatchDock and AutoDock Vina program on a set of aptamers to evaluate their
accuracy. The initial aptamer and ligand structures have been converted if
necessary, docked with each software and the results have been compared with
The procedure was first validated on seven different aptamer structures obtained
from RCSB Protein Data Bank. Further, several statistical analyses of the docked
poses were carried out to evaluate and compare newly predicted positions and their
corresponding energies. The software were often able to find correct binding
positions with best score (PatchDock) or lowest binding energy (AutoDock Vina)
for the closest predicted ligands. Nevertheless, the newly predicted positions were
still more than 1 Å farther away from original binding spot as compared to
experimentally determined aptamer-ligand interactions.
Further work may include refinement of docking results using molecular dynamics
Classification of the proteinogenic amino acids is crucial for understanding their commonalities as well as their differences in order to provide a hint why life settled on usage of precisely those amino acids and is also crucial for predicting electrostatic, hydrophobic, stacking and other interactions, for assessing conservation in multiple alignments and many other applications. While several methods have been proposed to find “the” optimal classification, they have several shortcomings, such as the lack of efficiency and interpretability or an unnecessarily high number of discriminating features. In this study, we propose a novel method with repeated binary separation via a combination of the minimum amount of five features (such as hydrophobicity or gyration radius) expressed by numerical values for amino acid characteristics. The features are extracted from the AAindex database. We successfully find four such combinations by simple separation at the medians. We extend our analysis to separations other than by the median. We further score our combinations based on how natural the separations performed by the features included in the obtained combinations are and identified several high ranking combinations in the process. We examine an experimental study from the literature where incorporation of unnatural amino acids into the genetic code is used to enhance antibody binding towards a HIV coat protein. Our method is able to suggest the most diverse amino acids for incorporation upon counting how often they occupy vectors unoccupied by natural amino acids in our high-ranking combinations.
In recent years, the study of neuron connectivity gained interest. It is assumed that neuronal connectivity will grant knowledge how ageing or neurological diseases affects the brain structure and how memory traces are physically stored. It is beyond question that new methods for data acquisition will produce large amounts of neuronal image data, which will exceed the zetabyte range and is impossible to annotate manually for visualization tools. Nowadays, machine learning algorithms and especially deep convolutional neuronal networks are heavily used in medical imaging and computer vision. This study focuses on a new workflow for dense reconstruction of neurons in the mouse neocortex. Stacks of electron microscopic images are used as input to produce a three dimensional reconstruction.
The workflow consists of three major parts: image processing using consecutive deep convolutional networks, a pixel-grouping step called connected components and 3D visualization. The data used as ground truth for our deep convolutional neuronal networks was published 2013 as part of the IEEE International Symposium on Biomedical Imaging conference. A hallmark of current imaging data of neuronal tissue is the anisotropy. Stained and embedded tissue blocks are sliced and sections are imaged separately or they are block-face imaged and milled using ion beams. Thus the acquired electron microscopic image stacks possess a high resolution in x and y-direction, whereas the z-direction resolution is low. The basic idea of our method is to use deep convolutional neuronal networks for the detection of cellular structures like cytoplasm as well as overlapping cross sections in two consecutive images. This knowledge helps us to increase the information given in z-direction that is substantial for further pixel-grouping algorithms and neuron tracing through the image stack. This pipeline could be used to clarify complex connectomics problems in neuronal tissue and neuronal circuits, which are often correlated with neurological diseases.
In recent years many pathway characteristics determining the time optimal
regulation of metabolic pathways were studied by dynamic optimization.
Among these a key finding of our work was that toxic intermediates influence
the position of highly regulated enzymes and guide us to valuable antimicrobial
targets. We propose a disturbance of the optimal regulation as an
antimicrobial strategy to provoke an endogenous accumulation of a toxic
Among pathogenic microbes, fungal species are an underestimated threat to human
health and difficult to treat due to a small number of antifungal drugs. Since
the virulence of fungi relies heavily on their metabolic versatility, we decided
to explore the landscape of toxic intermediates and drug targets in the
metabolic networks of pathogenic fungi. In order to do this, we employed machine
learning to create a new toxicity prediction tool for fungal species. The
identification of drug targets is supported by the integration of toxicity
prediction with metabolic networks from the KEGG database as well as estimates
of enzyme regulation.
With these resources, which will be available as a web service, we analyzed
fungal specific pathways for toxic intermediates. As a key result we could
identify the toxic intermediate glyoxylate as target for accumulation in the
pathogen Candida albicans. The intermediate is part of the glyoxylate shunt
which is a known virulence factor of C. albicans to survive in the glucose poor
phagolysosome of host macrophages. Interestingly, experimental investigation
shows that C. albicans relies on multiple enzymes which control glyoxylate
accumulation providing new targets for antifungal drugs.
In many parts of the world, the cultivation of rice is important for ensuring nutrition of the population. Plant-pathogenic Xanthomonas bacteria cause diseases on various crop plants including rice, where Xanthomonas infections can lead to a harvest loss of up to 50%. Thus, it is important to investigate the pathogenicity of these bacteria in more detail to find ways to protect the rice plants or to breed resistant ones.
Xanthomonas bacteria express proteins called transcription activator-like effectors (TALEs) that bind to the promoter of plant genes and activate their transcription. The binding domain of TALEs consists of tandem repeats of approximately 34 amino acids. Each repeat contains two hypervariable amino acids at positions 12 and 13, which are called repeat variable di-residue (RVD). Each RVD recognizes one nucleotide of its target DNA and the consecutive array of RVDs determines TALE target specificity.
Here, we present our novel approach for TALE target prediction to identify potential virulence targets. Our approach accounts for recent findings concerning TALE targeting, including frame-shift binding by repeats of aberrant lengths, and the flexible strand orientation of target boxes relative to the transcription start of the downstream target gene. The computational model can account for dependencies between adjacent RVD positions. Model parameters are learned from the wealth of quantitative data that have been generated over the last years.
We benchmark the novel approach, termed PrediTALE, using RNA-seq data after Xanthomonas infection in rice, and find an overall improvement of prediction performance compared with previous approaches. Using PrediTALE, we are able to predict novel putative virulence targets. However, we also observe that no target genes are predicted by any prediction tool for several TALEs, which we term orphan TALEs for this reason. We postulate that one explanation for orphan TALEs are incomplete gene annotations and, hence, propose to replace promoterome-wide by genome-wide scans for target boxes. We demonstrate that targets reported from promoterome-wide predictions are also recovered in genome-wide scans, but we also find differentially expressed regions at loci that do not overlap with annotated genes. These could be either protein-coding genes that are missing from the current annotation, but also include putative non-coding RNAs, which might have regulatory activity or other functions that foster bacterial infection.
Elevated expression levels of epidermal growth factor receptor (EGFR) are associated with prognosis and clinical outcomes of patients in a variety of tumor types.
There are at least four mRNA splice variants of the mRNA encoding four protein isoforms of EGFR in humans, named I through IV.
EGFR isoform I is the full-length membrane protein, whereas isoforms II-IV are shorter protein isoforms.
Nevertheless, all EGFR isoforms are capable of binding the ligand epidermal growth factor (EGF).
Although EGFR is an essential target of long-established and successful tumor therapeutics, the function and biomarker potential of alternative EGFR isoforms II-IV are unclear, motivating more in-depth analysis.
We analyzed transcriptome data from glioblastoma cell line SF767 to identify target genes regulated by EGFR isoforms II-IV, but not by EGFR isoform I nor other receptors such as HER2.
Using microarrays, we analyzed the differential expression of potential target genes in a glioblastoma cell line in two RNAi experimental conditions and one negative control, contrasting expression with EGF stimulation against expression without EGF stimulation.
In one experiment, we selectively knocked down EGFR splice variant I only, while in the other we knocked down all four EGFR splice variants.
Due to the nested experimental design, the associated effects of EGFR II-IV knockdown can only be calculated in an indirect manner.
For this type of nested experimental design, we developed a two-step bioinformatics approach, named Bayesian Gene Selection Criterion (BGSC) approach, for identifying putative target genes of EGFR isoforms II-IV, using a novel algorithm based on the Bayesian Information Criterion.
Finally, we experimentally validated a set of six putative target genes, and we found that qPCR validations confirmed the predictions in all cases.
By performing RNAi experiments for three poorly investigated EGFR isoforms, we were able to successfully classify 1,140 putative target genes specifically regulated by EGFR isoforms II-IV using the BGSC approach.
Lipid membrane is a component of every living organism. It acts as a boundary which defines the cell shape and protects cell interior against environmental stresses. To fulfill these tasks the membrane evolved into a very dynamic and robust organelle which can sense plethora of signals, selectively transport chemical compounds and remodel its composition when needed.
Still, basic principles of how the remodelling and adjusting the membrane properties to environmental conditions takes place is far from being understood.
To address this question bacteria are commonly used as an experimental system because of their simplicity.
Usually, bacteria samples are measured in bulk solution and a single value of studied quantity is provided. This approach, however, might not reflect the whole complexity of the sample. We argued that acquiring data at single cell level would provide more information and would better characterize the system.
In this work, we used automated microscopy to characterize biophysical properties of bacterial membrane at the single cell level. In that way, we were able to extract distribution of several quantities describing bacteria population. We applied this methodology in order to study membrane properties of a wild type and a mutant strain of Methylobacterium extorquens. Our results show that there is statistical difference between these two populations.
The basal average cAMP concentration in eukaryotic cells is about 1 micro mole per liter. The reported cAMP concentration to half-maximally activate protein kinase A (PKA) in vitro is about 200 nano mole per liter. This relationship suggests that PKA should be active constantly and is unable to respond to signaling. However, in vivo studies determined the sensitivity of PKA to be significantly lower. A promising hypothesis for the apparent low sensitivity is, that cAMP abundance is highly regulated by concentration gradients. As a model system we choose collecting duct principal cells that require PKA signaling for the transport of vesicles storing water channel proteins.
We modeled the interplay of localized cAMP pools, PKA, and phosphodiesterase and their effect on the regulation of vesicle transport in a spatial model. To model the movement and behavior of vesicles as well as reaction kinetics and diffusion a novel simulation technique was devised. We have found that cAMP forms localized sinks around vesicles under very defined circumstances. The locally reduced concentration of cAMP prevents unjustified PKA activation and in turn transport initiation. Further, cAMP concentration is decreased along the path of traveling vesicles. The paths might temporarily prevent other vesicles from following the initial vesicles and therefore regulate the throughput of vesicles to the membrane.
Nowadays, the RNA world hypothesis explaining the origin of life is supported by the majority of researchers. This hypothesis assumes that the life originated from the RNA sequences which had an ability of self-replication. ssRNA-viruses are seen as most primitive organisms using RNA-replication. Thus, a crucial question is how differ their single-stranded-RNA sequences from random sequences, i.e. which RNA characteristics provide biological functionality. We tackle this problem by means of a machine learning trained classifier to distinguish biologically originated nucleotide sequences of ssRNA-viruses from randomly generated nucleotide sequences, which show similar frequency statistics for the nucleotides. Application of machine learning methods for classification, requires the comparison of nucleotide sequences by means of mathematical (dis-) similarity measures. Therefore, we investigated several measures reflecting different aspects and features like statistical distributions of nucleotides, co-occurrences of them or information theoretic quantities.
Carbohydrates play a key role in a wide range of biological processes and the detailed understanding of the principles of their recognition by carbohydrate-binding proteins is of particular interest for researchers. Biomimetic carbohydrate receptors [1-3] provide valuable model systems to study the underlying principles of carbohydrate-based molecular recognition events and their development is also strong motivated by the belief that such artificial carbohydrate-binding agents could be used for the detection and treatment of diseases. Although effective artificial receptors have been developed, the exact prediction of their binding strength and selectivity is still further away and it is hoped that combined theoretical and experimental studies will contribute significantly to the solution of this problem.
Our previous studies showed that receptors consisting of both a macrocyclic building block and flexible side-arms [4-6] exhibit strong selectivity towards β-D-glucoside and represent particularly interesting objects for systematic binding studies. The binding capabilities of these receptor molecules were determined by investigations in two-phase systems, such as liquid-liquid extractions of sugars from water into organic phase, and by studies in homogenous media, including 1H NMR, fluorescence and microcalorimetric titrations.
We present the calculational study of the selectivity of macrocyclic carbohydrate receptors towards methyl β-D-glycosides, such as gluco- and galactopyranoside, using popular semiempirical and ab-initio approaches, combined with an analysis of the potential energy surface (PES). The PES of the glycosides was sampled using molecular dynamics and simulated annealing while the PES of the complexes of the macrocyclic receptors with the bound substrates was sampled using a simplified docking procedure. The selectivity towards the carbohydrates is then calculated at different levels of theory. Although the calculated selectivity indicates better binding of β-methylglucoside for most of these samples, the results elucidate the difficulties of modelling flexible molecules (including carbohydrates) and more studies have to be carried out.
 Mazik, M. RSC Adv. 2012, 2, 2630.  Davis, A. P. Org. Biomol. Chem. 2009, 7, 3629.  Mazik, M. Chem. Soc. Rev. 2009, 38, 935.  Lippe, J.; Mazik, M. J. Org. Chem. 2015, 80, 1427.  Lippe, J.; Mazik, M. J. Org. Chem. 2013, 78, 9013.  Amrhein, F.; Lippe, J.; Mazik, M. Org. Biomol. Chem. 2016, 14, 10648.
Liquid Chromatography Tandem Mass Spectrometry (LC-MS/MS) is one of the predominant experimental platforms for untargeted metabolomics, but searching acquired tandem spectra in spectral libraries will only identify a small portion of the measured metabolites.
Here, we present the new release of the SIRIUS software, based on its publication in Nature Methods 2019. SIRIUS 4 is the best-in-class software method for de novo molecular formula annotation and structure elucidation. SIRIUS 4 integrates high-resolution isotope pattern analysis and fragmentation trees for molecular formula identification. CSI:FingerID is seamlessly integrated via a RESTful webservice to search MS/MS spectra in a molecular structure database.
We will give an overview of SIRIUS 4 and its latest advancements, such as the novel isotope pattern scoring, a deep neural network for rare element detection and new kernels and fingerprints for improved identification performance.
SIRIUS 4 was evaluated on 208 compounds from the CASMI 2016 challenge. SIRIUS 4 correctly identified the molecular formula in 195 of the cases (93.75 %). Out of the 127 compounds in positive ion mode CSI:FingerID correctly identified 73 of the structures (57.5 %) when searching in PubChem and 94 (74 %) when searching in a smaller database of 0.5 million structures of biological interest. The best competitor correctly identified 32 of the 127 compounds. Furthermore, we compared the performance of CSI:FingerID with its first release from 2015, by exactly repeating the evaluation setup of Dührkop et al (2015). We found that, due to the methodical improvements, structure identification rates increased from 31.8 % to 40.4 %.
Finding the fragmentation tree that best explains the fragmentation spectrum is NP-hard. We report multiple algorithm engineering approaches to enable analysis of complete datasets within minutes. These approaches include candidate exclusion based on isotope scoring and fragmentation tree heuristics. Excluding highly unlikely candidates decreases running times dramatically.
The introduction of a new job system leads to further speed up due to increased efficiency in terms of resource management and job scheduling. The resulting parallelization is effective regardless of which ILP solver is used to calculate fragmentation trees.
Combined with the algorithm engineering this results in a speedup by more than two orders of magnitude. Apart from performance enhancements, the new job system allows for smooth manual job management with the ability to cancel jobs and view logs.
Drug repositioning aims to identify new indications for known drugs. With the growth
of 3D structures of drug-target complexes, it is today possible to study drug promiscuity at the structural level and to screen vast amounts of drug-target interactions to predict side effects, polypharmacological potential, and repositioning opportunities. Here, we developed a structure- based drug repositioning approach, which extends the scope of the search to novel chemical scaffolds by exploiting the binding mode similarities between drugs. We applied this approach to identify drugs inactivating B-cells, whose dysregulation can function as a driver of autoimmune diseases. As an initial step, an RNAi screening over 500 kinases identified 22 proteins whose knock out imped the activation of B-cells. Our drug reposi-tioning approach was applied to those targets’ structures revealing a well-known cancer drug as a micromolar inhibitor. The repositioning is explained through a specific pattern of noncovalent interactions shared between the original and predicted target. The novel inhibitor was finally validated, showing a very high therapeutic and selectivity index in B-cell inactivation. Overall,
the repositioning approach was able to predict these findings at a fraction of the time and cost of a conventional screen.
Transcription Factors act across disease risk loci: EBNA2 in autoimmunity
Meet at CRTD Entrance Hall for Dresden Long Night of Science just around the corner.
Meet at Martin-Luther-Platz for party at Bunte Republik Neustadt.