Development of tools for in silico prioritization of pesticide TPs for soil microbial ecotoxicity testing

Knowledge on the environmental biotransformation of organic contaminants is essential for chemical risk management, bioremediation of contaminated sites, and the development of green chemical alternatives. Pesticide transformation products (TPs), formed in the environment, can have similar or even more serious adverse environmental effects than parent pesticides. The project mainly consists of the in silico prediction of pesticide pathways and the development of QSAR models for toxicity evaluations. Regarding to the development of in silico prediction tools, one key challenge is the extraction of reaction rules from biotransformation reaction databases. In our project, we developed an automatic rule generation tool called enviRule, which automatically extract and update rules with three main functional modules, namely, reaction clusterer, reaction adder, and rule generator. It clusters biotransformation reactions into different groups based on the similarities of reaction fingerprints, and then automatically extracts and generalizes rules for each reaction group as SMIRKS. Instead of being arbitrarily determined, the genericity of automatic rules is optimized against the downstream TP prediction task, thus guaranteeing the best possible prediction performance. In terms of QSAR models for toxicity evaluations, we propose to apply deep learning models pre-trained on large chemistry datasets for predicting the soil toxicity of pesticides and their TPs on ammonia oxidizing microorganisms (AOM) with the help of transfer learning. Given the limited accessibility to in vitro test data, the two most prevalent deep learning architectures for chemistry-related problems, namely, graph-based models (e.g., GCN) and sequence-based language models (e.g., BERT) will be pretrained on large unlabeled datasets to distill the necessary knowledge to alleviate the difficulty of model training with small in vitro datasets for toxicity prediction.


Fig. 1: enviRule: An End-to-end System for Automatic Extraction of Reaction Patterns from Environmental Contaminant Biotransformation Pathways.

This project aims to develop an in silico pipeline for a comprehensive evaluation of soil microbial toxicity of pesticides and their transformation products.

Fig. 2: Transformation products prediction.

Transformation products prediction

To extract rules from biotransformation reactions, reactions should be first sent to the Reaction Clusterer module of enviRule, where they will be clustered into different reaction groups based on the similarity of fingerprints. The Rule Generator will then produce one rule for each group of reactions. To update rules, new reactions can be sent to the Reaction Adder module of enviRule, where reactions will either be added into existing reaction groups or into new groups, depending on the similarity of fingerprints. The rules corresponding to the expanded reaction groups will be updated, while new rules will be created for new reaction groups.

Fig. 3: QSAR toxicity.

QSAR toxicity

The QSAR models will be built in two steps, namely, pretraining and fine-tuning. Models will be first trained in an unsupervised manner for the masked atom recovery task, which helps models gain fundamental understanding of chemical structures of compounds. In the second step, models will be trained with labeled toxicity data produced from in vitro tests. The features coming out from the pretrained models will be sent into a classifier, which is built on top of the models. This classifier will make the prediction of whether a compound is toxic to our studied soil microorganisms.


Right now, we have finished working for the first part of the project, which is the prediction of transformation products (TPs) of pesticides. We developed a tool that can automatically extract reaction rules from biotransformation reactions, which will significantly improve TP prediction. According to our experiments, the models trained with automatic rules outperformed the models trained with manually curated rules by 30% in the area under curve (AUC) scores of precision-recall curve of TP prediction. Additionally, when new reactions are added, enviRule recognizes which existing rules need to be adapted and which new rules need to be created based on the comparison of reaction fingerprints. The coverage of newly added reactions was increased from 42% to 76% after automatic rule updates by enviRule. To the best of our knowledge, enviRule is the first tool that implements automatic rule updates to deal with the growing number of reported biotransformation reactions.