Lofi Amazon Rainforest Beats to Hack/AI to

AI & ML interests

GainForest EcoHackathon, AI for Biodiversity Track

Welcome to the Lo-fi Amazon Rainforest Beats to Hack/AI to organization page. You can read more about our DNA Identifier Tool here: README.

Introduction

Understanding biodiversity is crucial for effective conservation efforts, yet the endeavor is hampered by the immense number of unidentified species and the inherent complexities of taxonomic identification. Traditional methods of collecting observational data, such as setting up camera traps, are labor-intensive and often impractical, especially in remote or densely forested areas. Consequently, a key question arises: How can we model species distributions and biodiversity without direct observation? environmental DNA (eDNA) samples from water, soil, or sediment allow for the direct extraction of DNA without any traces of the organism itself, offering a much less labor intense way to monitor biodiversity-- that is if the DNA sequences can be identified. Thus, in this project we attempt to incorporate eDNA into machine learning models to aid in species identification and understanding. Furthermore, we develop a foundational barcode model trained on global DNA barcodes, and we investigate how the inclusion of ecological layers influence its predictive capability.

We make the following contributions:

  • We introduce the largest DNA barcode model to date LofiAmazon/BarcodeBERT-Entire_BOLD trained on a comprehensive dataset comprising over five million direct sample sequences from around the world gathered from the Bold System.
  • We also present a unique dataset of species observations from the Amazon rainforest (LofiAmazon/BOLD-Embeddings-Ecolayers-Amazon), which includes sequenced DNA, embeddings generated by our model, and seven ecological layers.
  • Our results demonstrate that even with modest training, BarcodeBERT-Entire_BOLD successfully learns to cluster DNA from different species in the embedding space.
  • Moreover, we show that fine-tuning a downstream model (LofiAmazon/BarcodeBERT-Finetuned-Amazon) using the DNA embeddings AND ecological layer achieves a test accuracy of 82%.