Universidad Politécnica de Madrid Universidad Politécnica de Madrid

New system for identifying, extracting and retrieving genetic sequences from scientific literature

Research developed at the Universidad Politécnica de Madrid's Facultad de Informática is able to detect and annotate genetic sequences belonging to micro-organisms causing infectious diseases reported in any manuscript. It accepts several paper formats, including PDF, the most common document representation format used by researchers.

Researchers reused the same genetic sequence detection techniques to index all papers stored in PubMed Central (PMC), the bibliographic database that provides free access to biomedical and biological scientific publications. Researchers have associated the detected genetic sequences with each PubMed Central article.

The sequence detection method has high precision (97.98%) and recall (95.77%) rates. The annotation system satisfactorily located a high percentage of associations between micro-organism names and genetic sequences: 83.29% of sequences were correctly annotated with the organism name. Note that annotation was not possible in 15.45% of the cases because the sequences to be annotated did not belong to micro-organisms causing infectious diseases.

Researchers managed to find only 44.32% of gene names because the database did not always contain this information, which may yet to be discovered.

Finally, researchers reused the genetic sequence detection and annotation techniques to associate the genetic sequences that it contains with each PubMed Central article. At the time of indexation, PMC contained 176,672 downloadable articles. XML was used because it takes less time to process this representation format than articles in PDF.

Although the system developed at the UPM's Facultad de Informática is fully operational and provides useful and precise outputs, it opens up new avenues of research for further improvement of both functionality, outcomes and performance.

The results of this research have been published in BMC Bioinformatics. Researchers involved in the project are members of the Department of Artificial Intelligence, Biomedical Informatics Group and Department of Computer Languages and Systems, all from the UPM's Facultad de Informática. The Bioinformatics Unit at the Instituto de Salud Carlos III also participated in the research.

Authors from the UPM's Facultad de Informática are Miguel García-Remesal, Alejandro Cuevas, Guillermo de la Calle, Diana de la Iglesia, David Pérez-Rey, José Crespo and Víctor Maojo. Authors from the Instituto de Salud Carlos III are Victoria López-Alonso, Guillermo López-Campos and Fernando Martín-Sánchez.

The research spawned another paper on the PubDNA Finder application, the first search engine for scientific articles reporting nucleic acid sequences. This research was published in the leading bioinformatics journal, Bioinformatics, and reported in another press release.

Importance of this research

Molecular technologies are often used in clinical practice to identify micro-organisms and detect the presence of virulence factors, antibiotic resistance and parasite-patient interactions. There are a great many technologies that use relatively short nitrogen base strands known as primers and probes.

Both primers and probes are nucleic acid sequences, and there is no standard representation of this type of strands in scientific articles. Scientific literature in the biology field is a major source of information on primers and probes for diagnosing and prescribing treatment for infectious diseases.

The molecular diagnosis of infectious diseases is based on the fact that, in order to reproduce, viruses inject their genome into the affected cell, something which can be viewed as a viral signature. Therefore, physicians can tell whether or not patients are infected with virus X by analysing their DNA and checking whether the viral signature associated with virus X is present. To run such an analysis, they place multiple copies of the probe sequence known to identify the viral signature associated with the micro-organism that they are looking for in a test tube. The probe copies are chemically marked to ease later identification.

The mixture of genetic material is then subjected to high temperatures, which leads to the denaturalization (separation) of the patient's DNA double helix. When temperature returns to normal, the nucleic acid strands reunite with each other (renaturalization). If, at the end of this process, we find that any of the probes have bound (naturalized) to a patient DNA strand (as indicated by the chemical markers), then physicians can conclude that the micro-organism has infected the patient. The primers also play a key role in this process, as they are used to drive the DNA amplification process by the polymerase chain reaction (PCR) technique.

The respective sequences for probes are often documented in the scientific literature. However, physicians have to manually compile, screen and analyse a great many scientific articles to locate these sequences. This is an extremely labour-intensive and time-consuming job.

Over recent years, different text mining, information extraction and knowledge engneering techniques have proved useful for extracting, analysing and visualizing biological information reported in biomedical research literature. Even though text mining applied to biological data is an active research field, these techniques have not yet been used to create methods and tools aiming to automatically extract primers and probes from scientific papers.

This research helps researchers to identify and locate primer and/or probe sequences. This saves a lot of time which they can then spend on improving healthcare quality and/or research. This is why the new system for identifying, extracting and retrieving genetic sequences from scientific literature is so important.

SOURCE: FIUPM