Search This Blog

Tuesday 1 March 2011

BIO-INFORMATICS


BIO-INFORMATICS                                                  



 



Abstract:


The human body is a wonderful machine.  Attempts have been mode since time immemorial to understand the functioning of this machine.  The converges of biology, computational science, electronics and mathematics into bioinformatics domain in creating new hopes to patients of life threatening diseases. There are other fields-for example medical imaging / image analysis which might be considered part of bioinformatics. There is also a whole other discipline of biologically-inspired computation; genetic algorithms, AI, neural networks.
What almost all bioinformatics has in common is the processing of large amounts of biologically-derived information, whether DNA sequences or breast X-rays. Bioinformatics has become a mainstay of genomics, proteomics, and all other *.omics (such as phenomics) that many information technology companies have entered the business or are considering entering the business (see Bioinformatics world), creating an IT (information technology) and BT (biotechnology) convergence.
There are fields related to bioinformatics such as Computational biology, Genomics, Proteomics, Pharmacogenomics, Pharmacogenetics, Cheminformatics, and Medical Informatics.
This paper deals with biological databases where the biological genes are compared with the required genes that are stored in the database. Apart from the biological databases, there is about the programs and tools that are used in bioinformatics. Bioinformatics tools are software programs that are designed for extracting the meaningful information from the mass of data & to carry out this analysis step. 



Bio-Informatics
1.    Introduction to Bioinformatics

1.1 What is a Bioinformatics?

Roughly, bioinformatics describes any use of computers to handle biological information. In practice the definition used by most people is narrower;   bioinformatics to them is a synonym for "computational molecular biology"- the use of computers to characterize the molecular components of living things.

                                                

The Tight Definition is:

"The mathematical, statistical and computing methods that aim to solve biological problems using DNA and amino acid sequences and related information.”

The Loose Definition is:
There are other fields-for example medical imaging / image analysis which might be    considered part of bioinformatics. There is also a whole other discipline of   biologically-inspired computation; genetic algorithms, AI, neural networks. Bioinformatics is currently defined as the study of information content and information flow in biological systems and processes.

It has evolved to serve as the bridge between observations (data) in diverse biologically-related disciplines and the derivations of understanding (information) about how the systems or processes function, and subsequently the application.

The National Center for Biotechnology Information (NCBI 2001) defines bioinformatics as:

“Bioinformatics is the field of science in which biology, computer science, and information technology merges into a single discipline. There are three important sub-disciplines within bioinformatics: the development of new algorithms and statistics with which to assess relationships among members of large data sets; the analysis and interpretation of various types of data including nucleotide and amino acid sequences, protein domains, and protein structures; and the development and implementation of tools that enable efficient access and management of different types of information.”

1.2  Origin & History of Bioinformatics:

Over a century ago, bioinformatics history started with an Austrian monk named Gregor Mendel. He is known as the "Father of Genetics". He cross-fertilized different colors of the same species of flowers. Mendel illustrated that the inheritance of traits could be more easily explained if it was controlled by factors passed down from generation to generation.

Marvin Carruthers and Leory Hood made a huge leap in bioinformatics when they invented a method for automated DNA sequencing. In 1988, the Human Genome organization (HUGO) was founded. This is an international organization of scientists involved in Human Genome Project. In 1989, the first complete genome map was published of the bacteria Haemophilus influenza. Bioinformatics was fuelled by the need to create huge databases, such as GenBank and EMBL and DNA Database of Japan to store and compare the DNA sequence data erupting from the human genome and other genome sequencing projects

2.    Biological Databases:
A biological database is a large, organized body of persistent data, usually associated with computerized software designed to update, query, and retrieve components of the data stored within the system.
A simple database might be a single file containing many records, each of which includes the same set of information. For example, a record associated with a nucleotide sequence database typically contains information such as contact name; the input sequence with a description of the type of molecule; the scientific name of the source organism from which it was isolated; and, often, literature citations associated with the sequence. For researchers to benefit from the data stored in a database, two additional requirements must be met:
§         Easy access to the information
§         A method for extracting only that information needed to answer a specific biological question.
These databases include both "public" repositories of gene data like GenBank  or the Protein DataBank (the PDB), and private databases like those used by research groups involved in gene mapping projects or those held by biotech companies.
2.1 List of Biological Databases:
GenBank:
GenBank (Genetic Sequence Databank) is one of the fastest growing repositories of known genetic sequences. It has a flat file structure, which is an ASCII text file, readable by both humans and computers. In addition to sequence data, GenBank files contain information like accession numbers and gene names, phylogenetic classification and references to published literature
SwissProt:
This is a protein sequence database that provides a high level of integration with other databases and also has a very low level of redundancy.
GDB:
The GDB Human Genome Data Base supports biomedical research, clinical medicine, and professional and scientific education by providing for the storage and dissemination of data about genes and other DNA markers, map location, genetic disease and locus information, and bibliographic information.
EMBL:
The EMBL Nucleotide Sequence Database is a comprehensive database of DNA and RNA sequences collected from the scientific literature and patent applications and directly submitted from researchers and sequencing groups. Data collection is done in collaboration with GenBank (USA) and the DNA Database of Japan (DDBJ).
2.2 The Database Industry:
Because of the high rate of data production and the need for researchers to have rapid access to new data, public databases have become the major medium through which genome sequence data are published. EMBL and GenBank are the two major nucleotide databases. EMBL is the European version and GenBank is the American. EMBL and GenBank collaborate and synchronize their databases so that the databases will contain the same information.
The principal requirements on the public data services are:
§         Data quality - data quality has to be of the highest priority. However, because the data services in most cases lack access to supporting data, the quality of the data must remain the primary responsibility of the submitter.
§         Supporting data - database users will need to examine the primary experimental data, either in the database itself, or by following cross-references back to network-accessible laboratory databases.
§         Deep annotation - deep, consistent annotation comprising supporting and ancillary information should be attached to each basic data object in the database.
§         Timeliness - the basic data should be available on an Internet-accessible server within days (or hours) of publication or submission.
§         Integration - each data object in the database should be cross-referenced to representation of the same or related biological entities in other databases.
2.2 The Creation of Sequence Database’s:
Most biological databases consist of long strings of nucleotides (guanine, adenine, thymine, cytosine and uracil) and/or amino acids (threonine, serine, glycine, etc.). Each sequence of nucleotides or amino acids represents a particular gene or protein (or section thereof), respectively. Sequences are represented in shorthand, using single letter designations. This decreases the space necessary to store information and increases processing speed for analysis.
Oligonucleotide synthesis provided researchers with the ability to construct short fragments of DNA with sequences of their own choosing. These oligonucleotides could then be used in probing vast libraries of DNA to extract genes containing that sequence.
For researchers to benefit from all this information, however, two additional things were required:
1)   Ready access to the collected pool of sequence information
2)   A way to extract from this pool only those sequences of interest to a given researcher.
A researcher could take weeks to months to search sequences which were collected for a project by hand in order to find related genes or proteins. Computer technology has provided the obvious solution to this problem.
 2.3 Acquisition of sequence data:
Bioinformatics tools can be used to obtain sequences of genes or proteins of interest, either from material obtained, labeled, prepared and examined in electric fields by individual researchers/groups or from repositories of sequences from previously investigated material.


2.4 Analysis of data:
Both types of sequence can then be analyzed in many ways with bioinformatics tools. They can be assembled. Sequencing can only be performed for relatively short stretches of a biomolecule and finished sequences are therefore prepared by arranging overlapping "reads" of monomers (single beads on a molecular chain) into a single continuous passage of "code". They can be compared, usually by aligning corresponding segments and looking for matching and mismatching letters in their sequences.
3. Tools & Programs of Bioinformatics:
Bioinformatics tools are software programs that are designed for extracting the meaningful information from the mass of data & to carry out this analysis step.
Factors that must be taken into consideration when designing these tools are:
§         The end user (the biologist) may not be a frequent user of computer technology.
§         These software tools must be made available over the internet given the global distribution of the scientific research community.
3.1 Major categories of Bioinformatics Tools:
There are both standard and customized products to meet the requirements of particular projects. There are data-mining software that retrieves data from genomic sequence databases and also visualization tools to analyze and retrieve information from proteomic databases.
3.2 Homology and Similarity Tools:
Homologous sequences are sequences that are related by divergence from a common ancestor. Thus the degree of similarity between two sequences can be measured while their homology is a case of being either true of false. This set of tools can be used to identify similarities between novel query sequences of unknown structure and function and database sequences whose structure and function have been elucidated.



3.3 Protein Function Analysis:
This group of programs allows you to compare your protein sequence to the secondary (or derived) protein databases that contain information on motifs, signatures and protein domains.
3.4 Structural Analysis:
This set of tools allows you to compare structures with the known structure databases. The function of a protein is more directly a consequence of its structure rather than its sequence with structural homologs tending to share functions. The determination of a protein's 2D/3D structure is crucial in the study of its function.
                                     
3.5 Sequence Analysis: This set of tools allows you to carry out further, more detailed analysis on your query sequence including evolutionary analysis, identification of mutations, hydropathy regions, CpG islands and compositional biases.
3.6 Some examples of Bioinformatics Tools:
BLAST:
BLAST (Basic Local Alignment Search Tool) comes under the category of homology and similarity tools. It is a set of search programs designed for the Windows platform. It is used to perform fast similarity searches regardless of whether the query is for protein or DNA. Comparison of nucleotide sequences in a database can be performed.
FASTA:
FAST homology searches all sequences. The program is one of the many heuristic algorithms proposed to speed up sequence comparison. The basic idea is to add a fast prescreen step to locate the highly matching segments between two sequences, and then extend these matching segments to local alignments using more rigorous algorithms such as Smith-Waterman.
EMBOSS:
EMBOSS (European Molecular Biology Open Software Suite) is a software-analysis package. It can work with data in a range of formats and also retrieve sequence data transparently from the Web.
3.7 Application of Programmers in Bioinformatics:
JAVA in Bioinformatics:
Since research centers are scattered all around the globe ranging from private to academic settings, and a range of hardware and OSs are being used, Java is emerging as a key player in bioinformatics. Physiome Sciences' computer-based biological simulation technologies and Bioinformatics Solutions' PatternHunter are two examples of the growing adoption of Java in bioinformatics.
Perl in Bioinformatics:
String manipulation, regular expression matching, file parsing, data format interconversion etc are the common text-processing tasks performed in bioinformatics. Perl excels in such tasks and is being used by many developers. BioPerl is a project conducted on bioinformatics using Perl tools.
3.8 Bioinformatics Projects:
BioJava:
The BioJava Project is dedicated to providing Java tools for processing biological data which includes objects for manipulating sequences, dynamic programming, file parsers, simple statistical routines, etc.
BioPerl:
The BioPerl project is an international association of developers of Perl tools for bioinformatics and provides an online resource for modules, scripts and web links for developers of Perl-based software.
3.    Employment Opportunities in Bioinformatics:
4.1 Career Outlook:
The need for bioinformatician to make use of these data is well known among the industrialists and academicians in the area. Demand of these trained and skilled personnel, which has opened up a new carrier option as bioinformaticians, is high in academic institution and in the bioindustries.
4.2 Graduate Employment Opportunities:
There is a growing need nationally and internationally for bioinformticians, especially graduates with a good grounding in computer science and software engineering, and an appreciation of the biological aspects of the problems to be solved. The activities of such individuals will include working closely with bioinformticians to:
§         Elucidate requirements.
§         Develop new algorithms.
§         Implement computer programs and tools for bio data analysis and display of results.
§         Design databases for bio data.
§         Participate in data analysis.
4.3 Career Outlook in India:
There will be 10% annual growth in the Bioinformatics market for years to come. Significantly, the growing interest in bioinformatics, industry watchers say, may even lead to an acute shortage of experts in this segment worldwide, which perhaps also explains the poaching that organizations like CCMB have to face.
4.    Application Areas for Bioinformatics:
Molecular medicine:
The human genome will have profound effects on the fields of biomedical research and clinical medicine. Every disease has a genetic component. This may be inherited or a result of the body's response to an environmental stress which causes alterations in the genome (e.g. cancers, heart disease, and diabetes.). The completion of the human genome means that we can search for the genes directly associated with different diseases and begin to understand the molecular basis of these diseases more clearly. This new knowledge of the molecular mechanisms of disease will enable better treatments, cures and even preventative tests to be developed.
Personalized medicine:
This is the study of how an individual's genetic inheritance affects the body's response to drugs. At present, some drugs fail to make it to the market because a small percentage of the clinical patient population show adverse affects to a drug due to sequence variants in their DNA.
Gene therapy:
In the not too distant future, the potential for using genes themselves to treat disease may become a reality. Gene therapy is the approach used to treat, cure or even prevent disease by changing the expression of a person’s gene.
Microbial genome applications:
The arrival of the complete genome sequences and their potential to provide a greater insight into the microbial world and its capacities could have broad and far reaching implications for environment, health, energy and industrial applications. The US Department of Energy (DOE) initiated the MGP (Microbial Genome Project) to sequence genomes of bacteria useful in energy production, environmental cleanup, industrial processing and toxic waste reduction.
Climate change Studies:
Recently, the DOE (Department of Energy, USA) launched a program to decrease atmospheric carbon dioxide levels. One method of doing so is to study the genomes of microbes that use carbon dioxide as their sole carbon source.
                                                    
                            










No comments: