Study will help our genetic understanding of dangerous new viruses. Image: University of Bristol
Study will help our genetic understanding of dangerous new viruses. Image: University of Bristol
Bristol, UK (Scicasts) – Scientists studying the genes and proteins of human cells infected with a common cold virus have identified a new gene identification technique that could increase the genetic information we hold on animals by around 70 to 80 per cent.
The findings, published in Nature Methods, could revolutionize our understanding of animal genetics and disease, and improve our knowledge of dangerous viruses such as SARS that jump the species barrier from animals to humans.
Modern advances in genome sequencing ― the process of determining the genetic information and variation controlling everything from our eye colour to our vulnerability to certain diseases ― has enabled scientists to uncover the genetic codes of a wide range of animals, plants and insects.
Until now, correctly identifying the genes and proteins hidden inside the genetic material of a newly sequenced species has been a monumental undertaking requiring the careful observation and cataloguing of vast amounts of data about the thousands of individual genes that make up any given animal, plant or insect.
 
Dr. David Matthews, the study’s lead author and a Senior Lecturer in Virology at the University of Bristol’s School of Cellular and Molecular Medicine, said:  “Gene identification is mainly led by computer programmes which search the genome for regions that look like genes already identified in other animals or humans. However, this type of analysis is not always effective.”

The Bristol team has now discovered a more effective way of detecting the genetic information present in animals, plants and insects using cutting-edge analysis tools to directly observe the genes and all the proteins they make.
To prove their technique worked, the researchers conducted an experiment to see how good their process was at gene discovery. Human cells were infected with a well-understood common cold bug to mimic a newly discovered virus. These infected cells were then analyzed using the technique as if they were cells from a newly sequenced organism infected with a newly discovered virus.
The resulting list of “discovered” genes and proteins, when compared to the genetic information already known about humans and cold virus, proved extremely successful and demonstrated the power of this method.
A similar analysis of hamster cells provided directly observed evidence for the existence of thousands of genes and proteins in hamsters in a single relatively inexpensive experiment. Direct evidence for the existence of almost all of theses genes and proteins in hamsters is not available in the ‘official’ lists of hamster genes and proteins.
Dr Matthews added:  “These findings open up the potential to take powerful analysis tools currently used to study human diseases and apply them to study any animal, insect or even plants – something previously either very challenging or simply not possible. This technique will also make it easier and much more efficient for scientists to study anything from farm animals and their diseases to insect pests that damage crops.
“In recent years, a number of dangerous new viruses have been transmitted from animals to humans including Influenza, SARS, Ebola, Hendra and Nipah viruses.  Earlier this year three people became seriously ill and two of them died when they contracted a new SARS-like virus in the Middle East which is thought to have come directly from bats.
“Why bats harbour these viruses with limited ill effect is a mystery as the genetic make-up of these creatures is poorly understood. We are starting to apply our technique to laboratory grown bat cells to analyze the genetic and protein content of bats to gain more insight into their genetics and to understand how they are able to apparently co-exist with these viruses which all too often prove fatal in humans.”



J. ENDICOTT/IMAGES.COM/CORBIS

Nature
478,
143-145
(2011)
doi:10.1038/nj7367-143a
Published online
This article was originally published in the journal Nature
Trainees in bioinformatics and computational biology should seek depth of knowledge over breadth.
This January, Alexander Sczyrba and his colleagues published what was at the time the largest metagenome ever assembled (M. Hess et alScience331, 463–467; 2011). Collecting and collating genetic material from environmental samples is always a challenge; in this case, the metagenome came from parts of a cow's stomach, and contained more than 27,000 biomass-degrading genes and 15 microbe genomes. It totalled 268 gigabases. “We had to develop new algorithms to run analyses on computer clusters, or clouds, as using traditional methods would have taken 80 years on a single computer,” says Sczyrba.
Sczyrba wants to focus his career on similar complex, leading-edge analyses. But the path hasn't been straightforward; when he was looking for a postdoc in 2008, it was tough to find institutions that could generate or analyse such large data sets. He landed a post at the US Department of Energy Joint Genome Institute (JGI) in Walnut Creek, California: a large-scale sequencing facility that offered access to data, computing resources and brain power. In 2010 alone, the JGI sequenced 170 metagenomes.
J. ENDICOTT/IMAGES.COM/CORBIS
Soon, however, big sequencing centres won't be the only sources of data. “With next-generation sequencing, everybody can produce sequences; it's the analysis that is getting more important,” says Sczyrba. Modern biologists need to be able to manage large data sets and explore new computational tools.

Finding a path

Qualified candidates are hard to find, say recruiters in both industry and academia. That may be because, so far, there hasn't been a typical career path for bioinformaticians or computational biologists. “Often we find that it's the people motivated to simply roll up their sleeves and figure out on their own how to work with these data that have the strongest skills,” says Jim Bristow, deputy director of programmes at the JGI. As more departments are established, the often circuitous routes once required to attain such skills will probably be replaced by more direct paths. The challenge is finding a training programme that will help researchers to keep pace in a rapidly changing, technology-driven field.
By conventional definitions, bioinformaticians develop new ways to acquire, organize and analyse biological data, whereas computational biologists develop mathematical models or simulation techniques to work out the data's biological significance. But these lines are blurring, and departments and training programmes are both proliferating and combining the fields.
“The demand for computational-biology training that we have today is way more than was expected a decade ago,” says Burkhard Rost, president of the International Society for Computational Biology, which is based in La Jolla, California.

Not just skin deep

The most obvious training route — pursuing an undergraduate degree in bioinformatics — isn't necessarily the best for a budding researcher. Some undergraduate programmes fail to provide the depth of knowledge sought by employers. “Often these trainees come with great-looking CVs, but when we press them on what they are capable of doing, they tend to be rather weak,” says Nick Goldman, research and training coordinator at the European Bioinformatics Institute in Hinxton, UK. Goldman is most impressed by applicants who have actively pursued training in both informatics and the area of research in which they're interested — for example, someone with a computing degree who has done a molecular-biology project (see'Talent checklist').
Box 1: Basic skills: Talent checklist
  • Be at least conversant in the broad range of disciplines contributing to bioinformatics — from statistics to molecular biology to computer science.
  • Most work, especially in industry, is done in teams, so communication skills are always in demand.
  • Get experience in handling massive data sets. Learn to parse data or run analyses in parallel — using, for example, cloud computing.
  • Learn to write programmes in software languages such as Perl or R.
  • Cultivate a deep knowledge of at least one area of biology. V.G.
Expand
Goldman says that students should be wary of learning about only the latest software or genome-mining tool, without gaining a full understanding of the biological topics. Recruiters want savvy scientists who understand technology's ability to address questions. Steve Cleaver, head of quantitative biology at Novartis Institutes for BioMedical Research in Cambridge, Massachusetts, says that the key to a sustainable career in the field is the ability to turn a scientific question into a statistical hypothesis. “But those who can ride the tech waves are well positioned to find career success,” he adds. Without a doubt, he adds, the next generation of biologists will be more conversant in bioinformatics. “It's all about cross-training — getting the appropriate training in both analytical science and biology during graduate school to make a meaningful contribution,” says Cleaver.
Picking a programme with comprehensive training modules in statistics, computer science and/or biology can be an effective strategy. But Søren Brunak, director of the Center for Biological Sequence Analysis at the Technical University of Denmark in Lyngby, says that researchers should avoid training programmes that focus on just a few data types. With the expansion in high-throughput sequencing of genomes, proteins and metabolites, programmes that focus on a single area, such as genomics, don't adequately prepare students for the job market, says Brunak. “Analyses conducted now are much more reliant on combinations of data types — for example, combining molecular-level data with patient records — than they were before,” he notes.
D. E. GILBERT
Alexander Sczyrba: "We don't know where we'll be in ten years because the technologies and ideas are moving so fast".
Aspiring principal investigators can go one step further to find the best graduate training for the career they want, by deciding whether to focus on developing tools, such as algorithms to analyse data, or applying those tools to turn data into knowledge. “The most important decision a trainee can make is what kind of research programme they want to build,” says Robert Murphy, founding director of a computational-biology PhD programme run jointly between the University of Pittsburgh in Pennsylvania and Carnegie Mellon University, also in Pittsburgh.
The University of California, Los Angeles (UCLA), has a bioinformatics PhD programme designed to shape the tool developers. It accepts only candidates who demonstrate a core strength in an analytical field such as computer science or maths, or have a dual degree combining one of these fields with biology. Christopher Lee, director of the programme, says that many bioinformatics courses are affiliated with data-rich biology labs on campus, supplying the students needed to tackle a flood of data. They often lack, however, the matrix of expertise necessary to conduct innovative analyses. Lee hopes that the UCLA programme will foster such expertise.
A few graduate training programmes, notably those at the Netherlands Bioinformatics Center in Nijmegen, cater to students with backgrounds in either computer science or biology. “We want to train the tool shapers as well as the people more into applying the tools in a biological setting,” says Celia van Gelder, the centre's education project leader. “Over the past 10–20 years, the field of biology has become more computational, with bioinformatics serving as an interdisciplinary field that links researchers who can't otherwise readily talk to one another.” The scope of work is widening, she says. As a result, demand for bioinformatics training continues to increase across Europe — with greater emphasis placed on data analysis at all levels. “We produce trainees who have multidisciplinary training in molecular-biology principles as well as algorithms to deal with data,” says Jaap Heringa, the centre's scientific director for bioinformatics education. “Things move so fast in bioinformatics, we are constantly innovating our courses,” he adds. Murphy agrees; Carnegie Mellon and the University of Pittsburgh offer in-depth training. “We are pretty clear in the application materials that our programme is not for people who want to get enough of a smattering of computational biology to get a job,” says Murphy.

Expanding options

M. VAN ZWAM
Celia van Gelder: "We want to train the tool shapers as well as the people more into applying the tools".
This trend towards creating more comprehensive, interdisciplinary training programmes has gained momentum at biology strongholds in the United States. In July 2010, Dartmouth Medical School in Hanover, New Hampshire, established the Institute for Quantitative Biological Sciences in nearby Lebanon. Its graduate offerings combine modules in bioinformatics, biostatistics and epidemiology. “We have created what we think is a model of the future — training computational-biology students to speak multiple languages beyond bioinformatics,” says the centre's director, Jason Moore. He adds that the key is assuming complexity rather than simplicity when approaching a problem.
In August, Moore secured funding to create a US National Institutes of Health (NIH) Center for Biomedical Research Excellence, through which he will mentor five early-career bioinformatics faculty members, to be recruited over the next 3–4 years. After two years of learning how to secure competitive funding, among other things, trainees will be required to submit an application for an R01 grant, the NIH's main funding mechanism. “We really want to provide a well rounded education so that our new recruits can secure funding for — and conduct — well designed studies in computational biology,” says Moore.
Other medical schools are also taking the plunge. Duke University School of Medicine in Durham, North Carolina, formed its Department of Biostatistics and Bioinformatics in 2000. This year, it opens its first master's programme, says Elizabeth Delong, chair of the department.
And in September, the University of Michigan Medical School in Ann Arbor established a computational-medicine and bioinformatics department to help attract new faculty members and trainees. In June, Emory University School of Medicine in Atlanta, Georgia, launched a biomedical-informatics department with the goal of combining expertise in imaging, computer science and biology to improve patient care. It will recruit four or five researchers over the next few years. “Our particular strength is training computer scientists who want to transition into biomedical informatics, and bringing them together with clinicians to use informatics to treat disease,” says department chair Joel Saltz.
Qualified postdocs remain in demand. “It can be very difficult for individual investigators to hire a postdoc in bioinformatics,” says Tom Tullius, interim chair of the bioinformatics programme at Boston University in Massachusetts. He attributes the paucity of candidates in part to efforts over the past several years to build large teams at high-powered institutes — such as the Broad Institute in Cambridge, Massachusetts, or the Wellcome Trust Sanger Institute in Cambridge, UK — leaving smaller labs struggling to find talent. The growth of training programmes could ease this.
Now sequencing centres won't be the sole providers of data, individual researchers, particularly at medical centres, will have ample data to fuel research and training. “We've passed out of the period of genome projects where there were amazing public data raining down from the heavens; it's now possible to do exciting work without being associated with data-generating centres,” says Lee.
Sczyrba, who begins a junior faculty position in metagenomics at the University of Bielefeld Center for Biotechnology in Germany this autumn, says that unpredictability is what makes the discipline so exciting. “We don't know where we will be in ten years because the technologies and ideas are moving so fast,” he says. As Cleaver notes: “Perhaps the best career strategy is to stay flexible and curious.”
Author informationRelated links




John Hawks at ScienceOnline2012 - Photo by Russ Creech

About the author


JOHN HAWKS is Associate Professor of Anthropology at the University of Wisconsin—Madison. I was trained as a paleoanthropologist, studying human evolution from an integrative perspective. My research focuses on the processes affecting human genetic evolution across the last 6 million years.
I was talking with a scientist last week who is in charge of a massive dataset. He told me he had heard complaints from many of his biologist friends that today's students are trained to be computer scientists, not biologists. Why, he asked, would we want to do that when the amount of data we handle is so trivial?
Now, you have to understand, to this person a dataset of 1000 whole genomes is trivial. He said, don't these students understand that in a few years all the software they wrote to handle these data will be obsolete? They certainly aren't solving interesting problems in computer science, and in a short time, they won't be able to solve interesting problems in biology.
I said, well, yeah. I've been through this once already -- fifteen years ago, the hot thing was setting up a wet lab for sequencing -- or worse, RFLP. That sure looked like a lot of data at the time, and a lot of students spent a lot of time figuring out how to do it. Some of them successfully started careers, got grants, and moved on with the times. Others fell by the wayside. Meanwhile, clusters of people at the DOE, Whitehead Institute, Wellcome Trust and several private companies were spending their time figuring out faster and faster ways of automating sequencing. Now one machine can do the work of ten thousand 1990's graduate students.
Anyway, I've was thinking about that conversation. And then I ran across an article by Nova Spivack, describing the new Wolfram Alpha.
Stephen Wolfram is building something new -- and it is really impressive and significant. In fact it may be as important for the Web (and the world) as Google, but for a different purpose. It's not a "Google killer" -- it does something different. It's an "answer engine" rather than a search engine.
...
Wolfram Alpha is a system for computing the answers to questions. To accomplish this it uses built-in models of fields of knowledge, complete with data and algorithms, that represent real-world knowledge.
For example, it contains formal models of much of what we know about science -- massive amounts of data about various physical laws and properties, as well as data about the physical world.
Based on this you can ask it scientific questions and it can compute the answers for you. Even if it has not been programmed explicity to answer each question you might ask it.
This sounds very pie-in-the-sky. And indeed, commenters on the article (as well as this article by Cycorp head Doug Lenat) come up with lots of questions that would be impossible for such a system to answer.
But I'm not really interested in the things that will stump the system. Compared to restaurant reviews and kinship systems, bioinformatics is pretty simple. Right now, there are two things that make it a multi-year effort to learn: mutually incompatible databases, and the various kludges necessary to model ascertainment bias.
I'm a Mathematica user, and am familiar with its theorem-proving capabilities. Mathematica already has genome lookup utilities, which I use quite often -- it's just easier to do a lookup on my own system than to plow through two or three webpages to get to the query. It really wouldn't take that much to bring intelligent and interactive genome analysis into the system.
Alpha could turn into an online robot armed with basic genetics knowledge. And if not Alpha -- genetics is a logical priority for Wolfram, but it may not be the first or primary one -- certainly some other system using similar technology will emerge. Put it to work on public databases of genetic information, and you have a system that can resolve the incompatibilities by adding semantic knowledge. A bit of effort on existing databases would allow the resolution of discrepancies in ascertainment. Or, more likely, another couple of years of whole-genome sequencing will solve most of ascertainment biases by drowning them in new data.
So it's not a stretch for me to imagine a year from now entering this search query:
"List all human genes with significant evidence of positive selection since the human-chimpanzee common ancestor, where either the GO category or OMIM entry includes 'muscle'"
It seems to me that bioinformatics is what generates the output to that query. What you do with the output of that query is evolutionary biology.
So that raises the obvious question. Tomorrow's high-throughput plain-English bioinformatics tool will do the work of ten thousand 2009 graduate students. If a freely-available (or heck, even a paid) service can do the bioinformatics, what should today's graduate students be learning?
UPDATE (2009-03-19):
Some folks have interesting reactions to this post, including Thomas Mailund and Dan MacArthur. They make good points.
I will add that I'm not arguing against modeling or simulation in biology. There are lots of interesting things in evolutionary biology you can do -- must do, in all practical terms -- with computers. But I don't like the five-year degree program in genetics where only one semester is given to population genetics, and most of the student's time is spent learning scripting, doing data entry, and figuring out ten or twelve database formats.
I come back to my first example -- fifteen years ago, people were telling you how essential and wonderful sequencing would always be. If you're pursuing a five-year degree program and two or three years of postdoc, I hope you're thinking about what skills you'll need fifteen years from now.


According to the new market research report “Bioinformatics Market – Advanced Technologies, Global Forecast and Winning Imperatives”, published by MarketsandMarkets, the global bioinformatics market is expected to reach a $8.3 billion by 2014 at a high CAGR of 24.8%. While knowledge management formed the largest sub market is 2009 at $1.3 billion, the bioinformatics platforms market is expected to have greatest market share in 2014 at an estimated $3.9 billion due to rising demand from U.S. and Europe.
Browse in-depth Table of Content on Bioinformatics Market. 
Early buyers will receive 10% customization of reports 

The bioinformatics platforms market is growing at a significant pace with the increasing demand from U.S. and Europe. This trend is supported by the increasing demand for sequencing platforms with increasing life science research using techniques such as gene expression analysis, sequence analysis, and protein expression analysis.
Bioinformatics uses information technology, statistics, and algorithms to integrate biological data. Pharmaceutical companies are now adopting automated technologies to manufacture effective therapies and drugs due to increasing concerns about drug safety and the stringent regulations that govern clinical trials for drug discovery. Pharmaceutical companies have increased their focus on process improvement and quality in the current competitive scenario, as there is little scope for price escalation and product differentiation.
SCOPE AND FORMAT 
The report analyzes the global bioinformatics market into respective segments: 

  • Bioinformatics platforms
  • Content/knowledge management tools
  • Bioinformatics services
  • Bioinformatics applications.
About MarketsandMarkets 
MarketsandMarkets (M&M) is a global market research and consulting company based in the U.S. We publish strategically analyzed market research reports and serve as a business intelligence partner to Fortune 500 companies across the world. MarketsandMarkets also provides multi-client reports, company profiles, databases, and custom research services.

M&M covers thirteen industry verticals, including advanced materials, automotive and transportation, banking and financial services, biotechnology, chemicals, consumer goods energy and power, food and beverages, industrial automation, medical devices, pharmaceuticals, semiconductor and electronics, and telecommunications and IT.
We at MarketsandMarkets are inspired to help our clients grow by providing apt business insight with our huge market intelligence repository. To know more about us and our reports, please visit our website http://www.marketsandmarkets.com
Contact: 
Mr. Rohan 
North - Dominion Plaza, 
17304 Preston Road, 
Suite 800, Dallas, TX 75252 
Tel: +1-888-6006-441 
Email: sales(at)marketsandmarkets(dot)com 

Read the full story at 




About author 

My Photo
Bioinformatics programmer at a pediatric hospital


Spending $55k for a 512GB machine (Big-Ass Server™ or BAS™) can be a tough sell for a bioinformatics researcher to pitch to a department head.
Speaking as someone who keeps his copy of CLR safely stored in the basement, ready to help rebuild society after a nuclear holocaust, I am painfully aware of the importance of algorithm development in the history of computing, and the possibilities for parallel computing to make problems tractable.

Having recently spent 3 years in industry, however, I am now more inclined to just throw money at problems. In the case of hardware, I think this approach is more effective than clever programming for many of the current problems posed by NGS.

From an economic and productivity perspective, I believe most bioinformatics shops doing basic research would benefit more from having access to a BAS™ than a cluster. Here's why:
  • The development of multicore/multiprocessor machines and memory capacity has outpaced the speed of networks. NGS analyses tends to be more memory-bound and IO-bound rather thanCPU-bound, so relying on a cluster of smaller machines can quickly overwhelm a network.
  • NGS has forced the number of high-performance applications from BLAST and protein structure prediction to doing dozens of different little analyses, with tools that change on a monthly basis, or are homegrown to deal with special circumstances. There isn't time or ability to write each of these for parallel architectures.
If those don't sound very convincing, here is my layman's guide to dealing with the myths you might encounter concerning NGS and clusters:

Myth: Google uses server farms. We should too.


Google has to focus on doing one thing very well: search.

Bioinformatics programmers have to explore a number of different questions for any given experiment. There is not time to develop a parallel solution to many of these questions as they will lead to dead ends.

Many bioinformatic problems, de-novo assembly being a prime example, are notoriously difficult to divide among several machines without being overwhelmed with messaging. You can imagine trying to divide a jigsaw puzzle among friends sitting several tables, you would spend more time talking about the pieces than fitting them together.

Myth: Our development setup should mimic our production setup


An experimental computing structure with a BAS™ allows for researchers to freely explore big data without having to think about how to divide it efficiently. If an experiment is successful and there is the need to scale-up to a clinical or industrial platform, that can happen later.

Myth: Clusters have been around a long time so there is a lot of shell-based infrastructure to distribute workflows


There are tools for queueing jobs, but those are often quite helpless to assist in managing workflows that are written as parallel and serial steps - for example, waiting for steps to finish before merging results.

Various programming languages have features to take advantage of clusters. For example, R has SNOW. But Rsamtools requires you to load BAM files into memory, so a BAS™ is not just preferable for NGS analysis with R, it's required.

Myth: The rise of cloud computing and Hadoop means that homegrown clusters are irrelevant that but also means we don't need a BAS™


The popularity of cloud computing in bioinformatics is also driven by the newfound ability to rent time on a BAS™. The main problem with cloud computing is the bottleneck posed by transferring GBs data to the cloud.

Myth: Crossbow and Myrna are based on Hadoop, we can develop similar tools


Ben Langmead, Cole Trapnell, and Michael Schatz, alums of Steven Salzberg's group at UMD, have developed NGS solutions using the Hadoop MapReduce framework.
  • Crossbow is a Hadoop-based implementation of Bowtie.
  • Myrna is an RNA-Seq pipeline.
  • Contrail is a de novo short read assembler.
These are difficult programs to develop, and these examples are also somewhat limited experimental proofs of concept or are married to components that may be undesirable for certain analyses. The Bowtie stack (Bowtie, Tophat, Cufflinks), while revolutionary in its implementation of Burroughs-Wheeler algorithm, is itself is built around the limitations of computers in the year 2008. For many it lacks the sensitivity to deal with, for example, 1000 Genomes data.

The dynamic scripting languages used most bioinformatics programmers are not as well suited to Hadoop as Java. To imply we can all develop similar tools of this sophistication is unrealistic. Many bioinformatics programs are not even threaded, much less designed to work amongst several machines.

Myth: embarrassingly parallel problems imply a cluster is needed

 

A server with 4 quad-core processors is often adequate for handling these embarrassing problems. Dividing the work just tends to lead to further embarrassments.

 

Here is a particularly telling quote from Biohaskell developer Ketil Malde on Biostar:
In general, I think HPC are doing the wrong thing for bioinformatics. It's okay to spend six weeks to rewrite your meteorology program to take advantage of the latest supercomputer (all of which tend to be just a huge stack of small PCs these days) if the program is going to run continously for the next three years. It is not okay to spend six weeks on a script that's going to run for a couple of days.

In short, I keep asking for a big PC with a bunch of the latest Intel or AMD core, and as much RAM as we can afford.

Myth: We don't have money for a BAS™ because we need a new cluster to handle things like BLAST


IBM System x3850 X5 expandable to 1536GB, mouse not included
Even the BLAST setup we think of as being the essence of parallel (a segmented genome index - every node gets a part of the genome) is often not the one that many institutions have settled on. Many rely on farming out queries to a cluster in which every node has the full genome index in memory.

Secondly, the mpiBLAST appears to be more suited to dividing an index among older machines than today's, which typically have >32GB RAM. Here is a telling FAQ entry: 

I benchmarked mpiBLAST but I don't see super-linear speedup! Why?!

mpiBLAST only yields super-linear speedup when the database being searched is significantly larger than the core memory on an individual node. The super-linear speedup results published in the ClusterWorld 2003 paper describing mpiBLAST are measurements of mpiBLAST v0.9 searching a 1.2GB (compressed) database on a cluster where each node has 640MB of RAM. A single node search results in heavy disk I/O and a long search time.


Nature
470,
295-296
(2011)
doi:10.1038/nj7333-295a
Published online
This article was originally published in the journal NatureWith biological databases growing in size and number, curators are needed to update and correct their contents. For those who prefer computers to pipettes, there are opportunities.
Biologist and self-confessed bookworm Klemens Pichler thinks that he has found his ideal vocation. Pichler is a biocurator at the European Bioinformatics Institute (EBI) in Hinxton, UK, working on the Universal Protein Resource (UniProt) database. Some scientists would find it onerous to spend their days reading papers and sifting through and cross-referencing data. Pichler sees it as satisfying detective work, with a well-organized database as the result.
Biocurators are an unusual type of biologist. Their job is to make sure that the data such as gene or protein sequences entered into large biological databases are standardized and annotated so that other biologists can understand them. “Once you have generated a sequence and identified a gene, there is an enormous amount of pre-existing data that you search that gene against. You need an expert to refine that information and make it usable,” says Owen White, a bioinformatician at the University of Maryland School of Medicine in Baltimore. White developed the first genome-annotation software in 1995, and has been involved in several high-profile genome-sequencing projects.
IMAGES.COM/CORBIS
At present, the number of biocurators is small — the International Society of Biocuration, founded in late 2008, has just 300 members who work at some 100 organizations. But the number is likely to increase as sequencing becomes easier and biological data continue to roll in. By July 2008, more than 18 million articles had been indexed in the PubMed biomedical database, and nucleotide sequences from more than 260,000 organisms had been submitted to the GenBank database (see Nature 455, 47–50; 2008). Started in 2008, the 1000 Genomes project has added to the data influx.
Pichler started work at UniProt after completing a fairly typical early academic career path: a degree in biology at the University of Vienna; postgraduate lab experience at Harvard University in Cambridge, Massachusetts; and a PhD in virology at the University of Erlangen-Nürnberg in Germany, followed by a brief postdoc position there. It was during his postdoc that Pichler realized that he was on the wrong track. “I had grown tired of the frustrations of lab work,” he says. He read around and discovered biocuration; this was the change he had been looking for. “I've always been fond of computers but I never got round to integrating that into my career,” he says. Biocuration, Pichler found, was a way to make use of his training and move towards bioinformatics.
“It's a wonderful career,” says Judy Blake, a bioinformatician at the Jackson Laboratory in Bar Harbor, Maine. Blake is a principal investigator on the Mouse Genome Informatics project, which employs 31 biocurators across multiple sites. She says that biocuration provides access to intellectual science without the stresses and responsibilities of finding funding and producing publishable results. Some researchers-turned-biocurators also relish the opportunity to be more of a generalist after academic careers that had a narrow scope.

Practical understanding

Klemens Pichler: "You have to like reading and delving into matters, rummaging around looking for clues."
Although a PhD is not required, prospective biocurators need to be well trained in biology, with at least an undergraduate degree in a biological science and some related lab work. “Lab experience is important,” says Sandra Orchard, a senior scientific database curator at the EBI. “You can teach people curation but you can't go back and teach them ten years at the bench.” Such experience helps biocurators to understand the data that they're curating and how those data were generated.
Some universities offer specialist degree courses in biological information and the more software-design oriented bioinformatics, but none has a formal curation degree course specific to biological data. General data-curation programmes are available at the University of Illinois at Urbana-Champaign and the Digital Curation Centre in Edinburgh, UK, which offers short courses.
At UniProt, which employs almost 70 curators in Britain, Switzerland and the United States, Pichler spends half his time digging around to find out more about the protein sequences — the order of amino acids in a given protein — that are sent to the project from researchers around the world. He takes all the information he receives with each sequence and compares it with existing entries in the database. He also does a thorough literature search. “You have to be a bit of a bookworm; you have to like reading and delving into matters and rummaging around and looking for clues,” says Pichler. He routinely scours the literature to find, for example, germane bits of information about the structure and function of a protein sequence. Next, he organizes and standardizes that information so others can interpret and understand it. “I concoct a new database entry, which then undergoes several rounds of quality control before it ends up being publicly available,” he says.
The other half of Pichler's job is more technical, veering towards bioinformatics and software. He writes 'rules' so that computer programmes can annotate sequences with the structure and function of the genes or proteins. Researchers can then use these rules on their computers to predict protein function and structure from sequence data. Similar tasks are required for other databases, from those focused on gene-sequencing, such as Blake's mouse-genome project, to efforts such as the Gene Ontology project, which aims to standardize gene representation across species.
The extent of curation depends on the database — the needs of a simple repository for information will differ from those of a comprehensive catalogue that combines information from direct submissions and published literature. Dealing directly with the scientists who produce the data — and can explain and modify the information on request — is easier than having to sift through the literature, says Orchard. “When working from a paper, you are dependent on it being well written in the first place and the data being complete and fully described. This is often not the case,” she says.

International community

Sandra Orchard: "Lab experience is important. You can teach curation but you can't teach ten years at the bench."
Most large databases, and consequently curation jobs, are based in Europe and the United States, but that is changing, says Tadashi Imanishi, leader of the integrated-database and systems-biology team at the Biomedicinal Information Research Center in Tokyo, part of the National Institute of Advanced Industrial Science and Technology. The International Society of Biocuration has helped curators in Japan and other countries be part of the community. “By joining the society, they have the chance to communicate with curators in many other databases in the world,” says Imanishi, noting that Japan now has some 100 biocurators working on projects such as the DNA Database of Japan, which employs about 20 biocurators, and the H-Invitational, an international effort to catalogue all human genes.
At the moment, most jobs are at universities. But industry is beginning to offer biocuration services. For example, Ingenuity Systems in Redwood City, California, founded in 1998 by Stanford University graduate students, employs biocurators in its offices in Germany, Switzerland, France, Britain and Japan. They look after the Ingenuity Knowledge Base, which the company claims is the world's largest curated database of biological networks, documenting the relationships between proteins, genes, complexes, cells, tissue, drugs, disease and biological pathways.
Because of the skew towards academia, one of the biggest challenges to the growing field is its dependence on grant money. “Right now there is poor recognition for the value of curation,” says White. Funding agencies should factor the cost of curation into grants, he says, although this can be difficult given tight budgets and the field's relative infancy. “We're in a very, very competitive market and have to work hard to justify curation to agencies,” he says. Yet, he adds, “this kind of librarianship is critical”. Sequencing may be increasingly cheap and sequenced genomes plentiful, but without curation the data mean little.
Although long-term funding can be elusive, jobs can be lucrative. US biocurators in their first positions earn around $65,000, says Blake — more than a postdoctoral researcher. In Britain, salaries start at around £31,000 (US$48,000). And there is scope for advancement, says Orchard — a biocurator could end up running a database or training users. Curation also could be a doorway to computer programming and bioinformatics. Biocurators need not have any software-engineering expertise, but they do work closely with the people who write the programmes they use, and anyone interested in software design could move in that direction.
Blake says those considering a career in biocuration should know that it will move them away from the lab, which could pose a problem for those wishing to re-establish independent research, build a publication record or find grant funding. “None of these aspects is an integral part of the duties or outcomes of a biocurator position,” she says.
“There's no doubt it's a desk job,” Pichler concedes. But many don't mind. They like the continued focus on science, as well as the occasional opportunity to attend conferences, give a talk or write an academic paper about their database, says Blake. “Curators,” she says, “do novel work that is required by everyone doing science.”

MARI themes

{facebook#YOUR_SOCIAL_PROFILE_URL} {twitter#YOUR_SOCIAL_PROFILE_URL} {google#YOUR_SOCIAL_PROFILE_URL} {pinterest#YOUR_SOCIAL_PROFILE_URL} {youtube#YOUR_SOCIAL_PROFILE_URL} {instagram#YOUR_SOCIAL_PROFILE_URL}
Powered by Blogger.