Dfam is proud to announce the release of Dfam 3.2. This release represents a significant step in the expansion of Dfam by providing early access to uncurated, de novo generated families. As a demonstration of this new capability, we imported a set of 336 RepeatModeler generated libraries produced by Fergal Martin and Denye Ogeh at the European Bioinformatics Institute (EBI). Also in this release, Dfam now provides family alignments to the RepeatMasker TE protein database aiding in the discovery of related families and in the classification of uncurated TEs.
I would like to thank my employer, the Institute of Systems Biology (ISB), who has always been supportive of remote work and especially so during the ongoing COVID-19 pandemic. Thanks especially to: our communications and HR teams, for clear messaging during these difficult times; to those at ISB who are not working from home and are doing COVID-19 research, for helping find effective vaccines and treatments; and to our IT, facilities, janitorial, operations, and other departments, for keeping everything running smoothly.
I haven't really described my work on my blog before, so here goes my (long) elevator pitch:
- Transposable Elements (TEs) are DNA sequences that can copy or move around within a genome, and over time they can end up being a large amount of the DNA. TEs can also become inactive and stop copying or moving, but their DNA still remains as a genetic "fossil". TE biology is a diverse subject, and there are many different aspects to study such as the relationship between TEs and diseases and the ways TEs evolve over time.
- I mainly work on the website and backend behind Dfam, a database of TEs and other repetitive DNA sequences. TE databases are frequently used for masking (removing) TEs from a sequence so that they do not confuse other data analysis. They are also used for annotating a genome to find where Transposable Element activity is or was in the past.
- RepeatModeler is a tool for making a TE library from a genome. It is a de novo tool, which means it does not use existing libraries as a starting point. RepeatModeler makes a library of TE families, but these libraries are uncurated and very crude; they are decent for masking but not very good for annotation.
- A lot of work is needed to curate a de novo library to the same high level of quality as has been done for well-studied species such as humans. This work has always included significant effort by human experts. One long-term goal of our group is to help create easily accessible guidelines and tools for curation, and to automate some of the "chores" involved so that the TE research community can make the most effective use of their time and knowledge.
Our small team at ISB has been hard at work on Dfam 3.2, especially the last two or three months. Dfam now has 40 times as many entries as last year, so this was a perfect opportunity to test our infrastructure. Despite our preparations, the sheer size of the dataset revealed bugs in our scripts and tools and I for one am happy that things have calmed down a bit. With this release we have a baseline data set to curate in the future, developing and sharing our curation techniques and invititing collaboration from the community along the way.
The opinions expressed herein are my own and do not necessarily represent the views of ISB or any of its collaborators.