Challenges in Developing an Annotated Mobile Genetic Element Database Workflow
Guan Xue Lee 1, Anna E. Sheppard 1
1 School of Biological Sciences, University of Adelaide, Adelaide, Australia
Bacterial mobile genetic elements (MGEs) exert significant influence on microbial genomes, facilitating horizontal gene transfer and shaping genetic diversity, thereby contributing to bacterial evolution and resilience in diverse environments. Despite extensive research and identification efforts spanning years, the complex nature of these elements, coupled with the evolving field of bacterial MGE annotation, has led to the adoption of varied annotation standards and formats. A custom annotation workflow was designed capable of integrating multiple source MGE databases (Transposon Registry, ISFinder, MEFinder, ICEberg 2.0)containing sequences only (FASTA format) into an enhanced MGE database containing annotations (Genbank format). Deduplication was performed to remove duplicate sequences found within multiple source databases. The initial combined FASTA database amounted to 6520 total sequences which is reduced to 5659 after deduplication. This was performed by extracting accession numbers from FASTA sequences to fetch the corresponding annotations from NCBI. BLAST is used to align and identify relevant sequence regions within query FASTA sequences. Filtering is also performed to retain the best-performing hits where they are used to generate fresh and accurate Genbank annotation. During the development of this workflow, several challenges were encountered. A primary issue was maximising the total number of GenBank entries in the final database compared to the number of initial FASTA sequences provided as input. Another significant challenge involved inconsistencies in MGE annotations, including cases where known insertion sequences shared accession numbers with the parent transposon, yet had higher sequence identity to isoforms of the parent transposon. Additionally, ambiguity in BLAST parameters during the filtering process, particularly with the "max_target_seq" parameter, led to misinterpretations that took considerable effort to be rectified due to the vague description in the BLAST manual.