APPLICATIONS OF DEEP LEARNING IN THE FIELD OF BIOINFORMATICS
Manipal School of Life Sciences
Machine learning has emerged as the prime contributor to the extensive application of artificial intelligence (AI). Deep learning (DL) is a very important part of machine learning which is based on the artificial neural network (ANNs). One of the important features of DL is the ability to mine the hidden relationships in the huge biomedical and biological data.
In the last decade, we have seen a lot of data generation especially imaging and natural language data from where the derivation of information was easily managed by ANNs. In the field of Bioinformatics, high-throughput biological data including next-generation sequencing, metabolomic data, proteome data, and electron microscopic structural data are analysed by deep ANNs with the development of computing devices like GPUs and FPGAs for parallel computing.
Before applying ANNs like convolutional neural networks (CNNs) for imaging data and recurrent neural networks (RNNs) for natural language data, understanding and obtaining the accuracy of the computational models are required (Fig 1). Listed below are the few recently developed tools this year using deep learning models and algorithms which can be used in the field of Bioinformatics and Computational Biology for various research purposes.
Fig 1. Overview of applications of DL in Bioinformatics. A) Showing the input and the research objectives; B) An example for the prediction of splice junctions in DNA sequence data with a deep neural network; C) An example for application in finger Joint detection from X-ray images with a CNN (imaging); D) An example for application in Lapse detection from EEG signal with a RNN (signal processing) (Min et al., 2017)
PlasGUN: is the first gene prediction tool for plasmid metagenomic short reads data. Plasmids are the most important part of mobile genetic elements that are discovered from the metagenomic short reads data. The discovery of the plasmids from a sequence assemble than in short reads can be a tedious job due to the mobility of the plasmids. Multiple CNNs are used as an input in the DL algorithm in the development of this tool. PlasGUN extricates all candidate ORFs (Open Reading Frame) from the input short reads and then analyze each ORF as a coding or non-coding ORF. The tool showed a better performance when an artificial dataset of short reads was taken and even the real plasmid metagenomic data provided a reliable result compared to conventional gene prediction tools for chromosome-derived short reads.
MusiteDeep: is an online server tool for protein post-translational modification (PTM) site prediction and visualization using the DL framework. The DL framework uses CNNs along with a 2D novel attention mechanism (Fig 2). Only protein sequences are provided as input without any more complex attribute and a real-time result is obtained for numerous proteins at a time. For each type of PTM, it consumes less than 3 minutes for 1000 protein sequences. The output is achieved at the amino acid level for the selected PTM types. The users can analyze the predicted PTM sites with the known annotated sites and 3D protein structures through a homology-based search. The prediction using this tool is more accurate, faster, and consists of additional features.
DeepTorrent: is a DL-based method to predictDNA N4-methylcytosine (4mC) sites from DNA sequences. 4mC sites is one of the main epigenetic modifications having a vital role in DNA expression and replication. However, the detection of 4mC sites using experimental methods is quite time-consuming and expensive. DeepTorrent unites four encoding schemes for different features encoding the DNA sequences. It uses multiple CNNs with a module of bidirectional long short-term memory to learn the higher-order feature representations. Training robust predictor is performed using an attention mechanism and transfer learning, additionally. DeepTorrent has significantly improved 4mC site prediction compared to other such methods.
Compound2Drug: is a DL-based approach to obtain the pharmacological activity of a PubChem compound from its biological network. DL algorithms were trained with compound-target interactions retrieved from bindingDB and can be used to predict the drug targets for that compound. The tool also provides features of in silico modelling of the compound and the drug target and performs molecular docking using AutoDock MGLtools to get their interaction profiles. The results are then stored in the working directory of the user.