



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
These bioinformatics notes explain in detail about all the biological databases and tools used in bioinformatics
Typology: Lecture notes
1 / 6
This page cannot be seen from the preview
Don't miss anything!
In simple language, a database is a systematic collection of data or information, stored and accessed electronically from a computer system. Thus, a biological database is organised collection of biological information which can be accessed, managed and updated easily. There are different types of biological databases like nucleotide databases, gene databases, protein databases, metabolic pathway databases etc. There are two types of biological databases:
The features of a database are:
PubMed is a literature database and is maintained and created by National Library of Medicine, National Center for Biotechnology and National Institutes of Health. It basically contains the abstracts on journal articles and on various topics like life science, chemical science, MEDLINE and bioinformatics. It also provides additional links from various websites related to the search. All citations in MEDLINE are assigned MeSH Terms and Publication Types from NLM;s controlled vocabulary. The biggest disadvantage of PubMed is that it does not contain the full articles for most journals. It may link a bibliographic record to the full text on the journal website. Whether the article will be free for public or not depends on the author.
GenBank is a publicly available comprehensive database mostly used for nucleotide sequences and proteins. These sequences are obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects. It is a primary database. It exchanges data on daily basis exchange with the European Nucleotide Archive and the DNA Data Bank of Japan which ensures worldwide coverage. GenBank is accessible through the National Center for Biotechnology Information (NCBI) Entrez retrieval system, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed.
The Universal Protein Resource (UniProt) is a freely accessible database used for finding information about protein sequences and their functions. The core activities of UniProt are manual curation of protein sequences assisted by computational analysis, sequence archiving, development of a user friendly UniProt website and the provision of additional value added information through cross-references to other databases. UniProt is a joint effort of European Bioinformatics Institute(EBI), the Protein Resource(PIR) and the Swiss Institute of Bioinformatics(SIB). UniProt comprises four components for different uses: the UniProt Knowledgebase, the UniProt Reference Clusters, the UniProt Archive and the UniProt Metagenomic and Environmental Sequences. UniProt is a good resource for students as well as any person related to bioinformatics as it interconnects information from large and disparate sources, and it is the most comprehensive catalogue for protein sequences and functional annotation.
RefSeq is a publicly available database of annotated genomic, transcript and protein sequence records. It is a maintained and curated by National Center for Biotechnology (NCBI). It is a secondary type of database. It produces a set of stable, non-redundant reference sequences. Its biggest advantage is its non-redundancy.
PROSITE is a secondary database of proteins which collects together the patterns found in protein databases rather than the complete sequences. It consists of a database which with the help of these patterns and profiles rapidly and reliably tells us that to which class this protein belongs to. PROSITE is complemented by ProRule, a collection of rules based on profiles and patterns, which increases the discriminatory power of profiles and patterns by providing additional information about functionally and/or structurally critical amino acids. PROSITE is largely used for the annotation of domain features of UniProtKB/Swiss-Prot entries.
The Protein Data Bank is a database which provides information about the 3D structures of biological molecules like nucleic acids and proteins. It reveals interesting insights about the impact of 3D structures of protein targets important for discovery of new drugs. The PDB was establishes in 1971 as the first open access, digital data resource in biology and is now managed by the Worldwide PDB partnership. RCSB PDB is the Research Collaboratory for Structural Bioinformatics PDB operates the USA data center for the global pdb archive. The structures in PDB are usually obtained by the methods of X-Ray crystallography and NMR. It is a great tool for everyone who is associated with bioinformatics, biotechnology or biomedicine.
Structural Classification of Proteins is a secondary structural database of proteins. It is a secondary type of database. It's main purpose is to provide the details of the structure and history of proteins i.e. how the protein is evolved with time. It also helps the user in finding the similarities between proteins. The source of these protein structures is PDB. The unit of classification is usually the protein domain. The classification of proteins in SCOP on hierarchical levels is done as follows:
CATH stands for Class Architecture Topology Homologous superfamily. It is a free protein structure classification database. It classifies proteins on the basis of:
proteins, those whose structure is essentially formed by β-sheets; α/β proteins, those with α-helices and β-strands;
#(H)omologous superfamily: The proteins having high structural similarity is kept in this hierarchy level, which suggests us that they have evolved from a common ancestor.
that they have evolved from a common ancestor. One big disadvantage of CATH is that it classifies only the protein structures that are in PDB bank. CATH-Gene 3D As we know, CATH is a protein database which takes it’s structures from PDB. Gene 3D uses the protein structure information from CATH and they are split into their consecutive polypeptide chains where applicable. Now their protein domains are identified and classified on the basis of CATH hierarchy level. Uses of CATH: It tells us that how secondary structures are connected with each other, how proteins are evolved, helps in finding out the conserved sites, predicts the 3D structure of protein.
KEGG is a biological database which provides information about the genes and genomes, chemical reactions, systems for the basic understanding of biological systems and diseases and drugs. It is a group of sixteen databases which are