The purpose of this document is to introduce you how to use the MOSAICbioacc application. This application is based on the R software1 and especially the rjags
library (version 4.10)2 andthe jagsUI
library (version 1.5.1)3, to estimate parameters of Toxico-Kinetic (TK) models under a Bayesian framework. MOSAICbioacc is developed as an R-Shiny interface (version 1.6.0)4. If you want to be kept informed, please email us: sandrine.charles@univ-lyon1.fr.
The MOSAICbioacc application is a turn-key web tool providing bioaccumulation metrics (BCF/BMF/BSAF) from a TK model fitted to accumulation and depuration data. It is designed to fulfill the requirements of regulators when examining applications for market authorization of active substances.
Toxico-Kinetic/Toxico-Dynamic (TKTD) models are used to describe and predict the toxicity and the effects of chemical substances on individual traits based on experimental data. The TK part describes the relationship between the exposure medium and the organism, considering various processes such as ADME (accumulation, depuration, metabolization and excretion)5. Regulation No 283/2013 (EU)6 defines the data requirements for active substances of plant protection products in marketing authorization applications. In particular, a bioaccumulation study on fish is required following OECD guideline 3057. Achieved in agreement with EFSA’s scientific opinion on good modeling practices8,9, this ready-to-use on-line service allows to easily estimate BCF/BMF/BSAF as required in a regulatory framework, accounting for bioaccumulation of parent compounds and their metabolites through biotransformation. MOSAICbioacc does not expect any input besides the accumulation-depuration data sets according to exposure concentrations. The service automatically fits the TK model, initially defined from the appropriately user data and optimizes the estimation of its parameters. Then, the service provides the corresponding bioaccumulation metrics, as well as all goodness-of-fit criteria required to carefully check the reliability of the results10. All calculations are based on the JAGS software and its companion R packages rjags
2,11 and jagsUI
3.
If you use the web interface (https://mosaic.univ-lyon1.fr/bioacc), you don’t need to install anything.
However, if you want to run the R script (downloadable from the application) by yourself, you need to install:
rjags
package2. You can install it directly from the R software > Tools > Install Packages > rjags or from the CRAN website http://cran.r-project.org/web/packages/rjags/index.html.jagsUI
package3. You can install it directly from the R software > Tools > Install Packages > jagsUI or from the CRAN website https://cran.r-project.org/web/packages/jagsUI/index.html.tidyverse
, gridExtra
, ggmcmc
, GGally
, ggmcmc
, stringr
and DT
.Here is an example of the R code to install the required packages:
if(is.element('rjags', installed.packages()[,1]) == FALSE)
{install.packages('rjags')}
if(is.element('jagsUI', installed.packages()[,1]) == FALSE)
{install.packages('jagsUI')}
if(is.element('tidyverse', installed.packages()[,1]) == FALSE)
{install.packages('tidyverse')}
if(is.element('gridExtra', installed.packages()[,1]) == FALSE)
{install.packages('gridExtra')}
if(is.element('ggmcmc', installed.packages()[,1]) == FALSE)
{install.packages('ggmcmc')}
if(is.element('GGally', installed.packages()[,1]) == FALSE)
{install.packages('GGally')}
if(is.element('stringr', installed.packages()[,1]) == FALSE)
{install.packages('stringr')}
if(is.element('DT', installed.packages()[,1]) == FALSE)
{install.packages('DT')}
When using MOSAICbioacc, the first step is to upload input data (Fig. 1.1):
You can upload your own data (click on “Browse”) by taking care about the format specification of your file. MOSAICbioacc expects to receive data as a .txt file or a .csv file (comma, semicolon or tabular separator). Each line of the table corresponds to a time point for a given replicate and a given exposure concentration of the contaminant. The table must contain the four following columns, with exact header names (Table 1.1):
According to your data, further columns can be added in the file:
Then be careful to the units:
time | conc | expw | replicate |
---|---|---|---|
0 | 0.000 | 0.0044 | 2 |
3 | 0.225 | 0.0044 | 2 |
7 | 0.355 | 0.0044 | 2 |
14 | 0.553 | 0.0044 | 2 |
21 | 0.658 | 0.0044 | 2 |
28 | 0.785 | 0.0044 | 2 |
Once the upload is complete, you have to manually select the corresponding separator, the appropriate time unit and the duration of the accumulation phase.
Instead of using your own data, you can try MOSAICbioacc from several example files that are provided. Three data sets use a simple TK model (one exposure route and one elimination process), whereas two data sets consider a more complex TK model, accounting for metabolization and growth (Fig. 1.2).
Please note that more data are, the more the TK model is complex, and the more the calculations take time. Thus, we indicated the approximative mean time calculation for each example data sets (Fig.1.2).
In case you upload a data set with several exposure concentrations, select the one for which you want to see the results. We propose two types of visualization to check if the file has correctly been uploaded: as a table or a plot (Fig. 1.3 and 1.4).
According to your data, you can visualize the plot for the parent compound, the metabolite(s) and/or for growth (Fig. 1.5).
Don’t forget to select the appropriate duration of the accumulation phase and the growth unit (if required) before to continue (Fig. 1.3 and 1.4, left side).
The most complete model and its corresponding parameters are automatically selected according to the experimental design as given within the uploaded data set (Fig. 2.1). Users can also deselect some of the parameters (based on biological hypotheses related to the most probable exposure route or by neglecting one elimination process, for example). These choices lead to the automatic building of a nested TK sub-model to fit again on the data in clicking on the ‘Refresh’ button. The equations of the most complete model are provided to the users on-line (Fig. 2.2).
For the accumulation phase:
For the depuration phase:
Also, \(\sigma\) are the expected standard deviations of the measured contaminant concentration or growth of the organisms:
A mathematical model is composed of two parts: the deterministic and the stochastic parts. In the case of TK models fitted to concentration measurements, it can be written as follows (Eq. (1)): \[
y = f(x, \theta) + \varepsilon \quad (1)
\] with \(f\) the function describing the mean relationship between \(x\) and \(y\), \(y\) the observed variable, \(x\) the controlled variable, \(\theta\) the parameter vector to estimate and \(\varepsilon\) the random variable describing the variability of the data around the mean tendency.
The deterministic part
The organisms are here considered as single compartments for which a generic first-order kinetic bioaccumulation model can be expressed as follows12 (Eqs. (2) and (3) for the accumulation phase and Eqs. (4) and (5) for the elimination phase):
\(\left\{\begin{array}{l}\frac{d C_p(t)}{d t} = U - (E + M) C_p(t) \quad (2) \\ \frac{d C_{m_\ell}(t)}{d t} = k_{m_\ell} C_p(t) - k_{e_\ell} C_{m_\ell}(t) \quad \forall \ell=1 \ldots M \quad (3) \end{array}\right. \quad \text { for } 0 \leqslant t \leqslant t_{c}\)
\(\left\{\begin{array}{l}\frac{d C_p(t)}{d t} = - (E + M) C_p(t) \quad (4)\\ \frac{d C_{m_\ell}(t)}{d t} = k_{m_\ell} C_p(t) - k_{e_\ell} C_{m_\ell}(t) \quad \forall \ell=1\ldots M \quad (5)\end{array}\right.\quad\; \ \ \ \ \ \text { for } t>t_{c}\)
with:
Symbol | Meaning |
---|---|
\(I\) | total number of exposure sources |
\(J\) | total number of elimination processes |
\(L\) | total number of metabolites |
\(i\) | index of exposure sources, \(i = 1 \dots I\) |
\(j\) | index of elimination processes, \(j = 1 \dots J\) |
\(\ell\) | index of metabolites, \(\ell = 1 \dots L\) |
\(t\) | time (expressed in time units) |
\(c_i\) | exposure concentration of route \(i\) (in \(\mu g.m L^{-1}\)) |
\(C_p(t)\) | internal concentration of the parent compound at time \(t\) (in \(\mu g.g^{-1}\)) |
\(C_{m_\ell}(t)\) | internal concentration of metabolite \(\ell\) (in \(\mu g.g^{-1}\)) |
\(k_{u_i}\) | uptake rate of exposure source \(i\) (expressed per time units) |
\(k_{e_j}\) | elimination rates of elimination process \(j\) (expressed per time units) |
\(k_{e_{m\ell}}\) | elimination rates of metabolite \(\ell\) (expressed per time units) |
\(k_{m_\ell}\) | metabolization rate of metabolite \(\ell\) (expressed per time units) |
\(t_c\) | duration of the accumulation phase |
\(U = \sum_{i=1}^{I} k_{u_i} c_i\) | sum of all uptake terms |
\(E = \sum_{j=1}^{J} k_{e_j}\) | sum of all elimination terms for the parent compound |
\(M = \sum_{\ell=1}^{L} k_{m_\ell}\) | sum of all elimination terms for metabolite \(\ell\) |
The simplest model in MOSAICbioacc application considers only one exposure route (for example by water, parameter \(k_{u_w}\)) with the corresponding elimination rate (excretion, parameter \(k_{e_e}\)), as given by Eq. (6):
\[
\left \{\begin{array}{l}
\frac{dC_p(t)}{dt} = k_{u_w} c_{w} - k_{e_e} C_p(t) \quad \text { for } 0 \leqslant t \leqslant t_{c} \\
\frac{dC_p(t)}{dt} = - k_{e_e} C_p(t) \quad\quad\quad\quad\quad\: \: \: \text { for } t>t_{c}
\end{array}\right. \quad (6)
\]
where \(k_{u_w}\) is the uptake rate from water (\(time^{-1}\)), \(c_w\) is the mean contaminant concentration in water (\(\mu g.mL^{-1}\)) and \(k_{e_e}\) is the rate related to the excretion process (\(time^{-1}\)).
The more complex model in MOSAICbioacc application considers a maximum of four exposure routes (water: \(k_{u_w}\), pore water: \(k_{u_{pw}}\), sediment: \(k_{u_s}\) and food: \(k_{u_f}\)) and a maximum of three elimination processes (excretion, \(k_{e_e}\), biotransformation: \(k_{m\ell}\), growth dilution: \(k_{e_g}\)), as given by Eq. (7) for parent compound, Eq. (8) for metabolite \(\ell\) and Eq. (9) for growth (Von Bertalanffy equations, classically used for fishes):
\(\left \{\begin{array}{l} \frac{dC_p(t)}{dt} = k_{u_w} c_{w} + k_{u_{pw}} c_{pw} + k_{u_s} c_{s} + k_{u_f} c_{f} - (k_{e_e} + k_{e_g} + k_{m_\ell}) C_p(t) \quad \text { for } 0 \leqslant t \leqslant t_{c} \\ \frac{dC_p(t)}{dt} = - (k_{e_e} + k_{e_g} + k_{m_\ell}) C_p(t) \quad\quad\quad\quad\quad\: \: \: \text { for } t>t_{c}\end{array}\right. \:(7)\)
\(\frac{dC_{m_\ell}(t)}{dt} = k_{m_\ell} C_p(t) - k_{e_{m\ell}} C_{m_\ell}(t) \quad \quad (8)\)
\(\frac{dG(t)}{dt} = k_{e_g} (g_{max}- G(t)) \quad (9)\)
with \(G(t)\) the measured growth of the organism at time \(t\) (in growth unit) and \(g_{max}\) the asymptotic growth \(\ell\) (in \(\mu g.g^{-1}\)).
More details and especially the exact solutions of these equations are given here.
The stochastic part
We assumed a Gaussian (normal) probability distribution of the concentration measurements within the organism as follows (Eqs. (10) to (12)):
\(\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad C_{obs_p}(t) \sim \mathcal{N}(C_p(t),\sigma_p^2) \quad (10)\)
\(\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad C_{obs_{m_\ell}}(t) \sim \mathcal{N}(C_{m_\ell}(t),\sigma_{m_\ell}^2) \quad (11)\)
\(\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad G_{obs}(t) \sim \mathcal{N}(G(t),\sigma_g^2) \quad (12)\)
where \(\mathcal{N}\) stands for the normal law, \(C_{obs_p}(t)\) and \(C_{obs_{m\ell}}(t)\) correspond to the measured parent and metabolite \(\ell\) concentrations within the organism measured at time \(t\), \(C_p(t)\) is the parent compound concentration at time \(t\) predicted by the model based on to Eqs. (2) and (4), \(C_{m_\ell}(t)\) is the concentration of metabolite \(\ell\) at time \(t\) predicted by the model based on Eqs. (3) and (5), \(G_{obs}(t)\) is the measured growth of the organism at time \(t\), \(G(t)\) is the growth at time \(t\) predicted by the model based on Eq. (9), \(\sigma\) are the expected standard deviations of the measured contaminant concentration or growth of the organisms (\(\mu g.g^{-1}\) or growth unit).
A Directed Acyclic Graph (DAG) is given on Fig. 2.3, which symbolize the deterministic links between parameters and variables for the generic TK model (Eqs.(2) to (5)) and the stochastic links between the observed and predicted data (Eqs. (11) to (13)).
In MOSAICbioacc, prior choice is hidden to the user. However, we give here some information to help the user to understand the model behind. Before conducting an experimental study, prior distributions are defined for each parameter according to information available from the literature, expert knowledge and/or previous experiments. Depending on the sources where the information come from, informative, semi-informative or non-informative prior distributions can be used. If a parameter was already estimated in previous studies or if previous data are available, a prior distribution can easily be characterized with an appropriate probability distribution. However, if no information is available but an order of magnitude is (positive only, for example), it is possible to use a weakly informative prior. If any information is available on the order of magnitude of a parameter, its prior is preferably defined on a decimal logarithm scale in order to consider with equal probability both low and high expected values.
As MOSAICbioacc application has to be the most generic as possible, priors were assumed to be non-informative log10-uniform distributions within [-5, 5] for all uptake and elimination rate constants. For growth, a uniform prior distribution [\(10 \times mean(G)/4\), \(10 \times mean(G)\)] was assumed for parameter \(g_{max}\) and a uniform prior distribution [\(0\), \((10 \times mean(G))/(8-10^{-10})\)] for parameter \(g_{0}\). We assumed a non-informative [\(0\),\(A\)] uniform prior for all \(\sigma\) with a large A, here defined as five times the maximum internal measured concentration, which is then removed from the data set, as usually proceeded13.
The Bayesian approach considers that data are fixed but that the parameters are unknown random variables following a probabilistic distribution. This leads to the following practical implications: (i) the Bayesian process optimises the probability of parameter vector \(\theta\) given the data set \(\mathrm{Y}\) used for calibration (the so-called posterior distribution) not only the likelihood (see below); (ii) there is a need to provide reasonable prior information, then updating this information by accounting for the data. Below is a short introduction to Bayesian principles14.
In short, the Bayesian approach requires the following steps:
• Choose the prior distributions on parameters based on previous results, literature or expert knowledge (without looking at the data to fit): \(\mathrm{P}(\theta)\);
• Define the probabilistic model from the data, that is the random variable whose data would be one realisation assuming known values of parameters, namely the likelihood: \(\mathrm{P(Y} \, \mid\, \theta)\);
• Calculate the joint posterior distribution of the parameters given the data via the Bayes formula: \(\mathrm{P}(\theta \, \mid \, \mathrm{Y})\);
• Provide statistical summaries of parameter estimates (namely, appropriate quantiles);
• Get any function of the parameter estimates as posterior probability distribution, like for example BCF calculations or predictions of new observations.
Basic principles
The keystone of the Bayesian approach is the Bayes formula (Eq. (13): \[
\mathrm{P}(\theta \mid \mathrm{Y})=\frac{\mathrm{P}(\theta) \mathrm{P}(\mathrm{Y} \mid \theta)}{\mathrm{P}(\mathrm{Y})} \quad (13)
\] where Y are the observed data; \(\mathrm{P}(\theta \mid \mathrm{Y})\) is the joint posterior distribution of parameter vector ; \(\mathrm{P}(\mathrm{Y} \mid \theta)\) is the likelihood of the data given the parameters; \(\mathrm{P}(\theta)\) is the joint prior distribution of parameter vector \(\theta\). Given that \(\mathrm{P(Y)}\) is known and fixed, it is often not considered as it does not depend on and will not influence the posterior distribution. Hence (Eq. (14)): \[
\mathrm{P}(\theta \mid \mathrm{Y}) \simeq \mathrm{P}(\theta) \mathrm{P}(\mathrm{Y} \mid \theta) \quad (14)
\]
with \(\mathrm{P}(\theta)\mathrm{P}(Y|\theta)\) the unormalised posterior density leading to (Eq. (15)):
\[
\mathrm{P(Y)}=\int \mathrm{P}(\theta) \mathrm{P(Y} \mid \theta) d \theta \quad (15)
\] The prior distribution \(\mathrm{P}(\theta)\) expresses the available parameter information without knowing the observed data, while the posterior distribution \(\mathrm{P}(\theta \, \mid\, \mathrm{Y})\) combines this prior information (which may be more or less informative depending on what is known about the value of the parameters beforehand) with evidence from the data (expressed through the likelihood) into a joint posterior density probability distribution for the parameters. The overall expectation is to get a narrower posterior distribution compared to the prior one after the computing of the Bayesian algorithm: the difference between the two distributions reflects the information provided by the data. When the non-informative prior is vague (translated for example into a flat uniform distribution), and the data sufficiently informative, the results are similar than those obtained under a frequentist approach.
Joint posterior distribution
The joint posterior distribution has the dimension of the number of parameters times the number of iterations within the Monte Carlo Markov Chain (MCMC) chains. It can be plotted in planes of parameter pairs to visualise correlations between parameters. In an example case with two binormally distributed parameters, the joint posterior distribution can be plotted in the 2D-parameter space as illustrated by ellipses on Fig. 2.4; in this example, parameters \(\theta_1\) and \(\theta_2\) appear slightly positively correlated. From the joint posterior distribution, the marginal posterior distributions for each parameter (as illustrated by grey normal distributions on bottom and left sides of Fig. 2.4) can be extracted. Then, from the marginal posterior distributions, some statistical summaries on parameter estimates can be extracted, usually the median (illustrated by vertical and horizontal plain grey lines on Fig. 2.4) as well as 2.5% and 97.5% quantiles to serve as 95% credible intervals (illustrated by vertical and horizontal dotted grey lines on Fig. 2.4). Another advantage of having the joint posterior distribution is that the posterior distribution of any function of the parameters can also be obtained.
Parameter uncertainties
One implication of adopting a Bayesian approach is that the uncertainty on a parameter can easily be expressed as a probability distribution from which a credible interval (also called a Bayesian confidence interval) can be extracted. For example, the 95% credible interval delimits a range of values where the parameters lies with a 95% probability.
Numerical computation
Many numerical methods have been developed to approximately compute the joint posterior distribution, mainly based on simulations by Monte Carlo Markov Chain (MCMC) sampling methods used to generate random numbers from complex joint distributions. MCMC algorithms are a general method based on drawing values of parameter vector \(\theta\) from approximate distributions and then correcting those draws to better approximate the target posterior distribution \(\mathrm{P}(\theta \, \mid\, \mathrm{Y})\). The sampling is done sequentially, with the distribution of the sampled draws depending only on the last value drawn; hence, the draws form a Markov Chain. The key to the method success, however, is not the Markov property but rather the fact that the approximate distributions are improved at each step of the simulation, in the sense that it finally converges to the target posterior distribution after an enough number of iterations. Indeed, with MCMC algorithms, the simulation process must run long enough so that the distribution of the current draws is close enough to the desired target posterior distribution.
MCMC algorithms use random walk algorithms, among which the Metropolis algorithm (and its generalisation, the Metropolis–Hasting algorithm) as an adaptation of a random walk with an acceptance/rejection rule to converge to the specified target distribution15,16. The Gibbs sampler is a special case of the Metropolis–Hastings algorithm applicable when the joint distribution is not known explicitly, or when it is difficult to directly sample from, while the conditional distribution of each parameter is known and it is easy (or at least, easier) to sample from17.
Several tools are available to automatically perform these computations. In MOSAICbioacc, JAGS11 (version 4.3.0. (2017-08-10)) and R software1 (version 4.0.2 (2020-06-22)) are used. The models are fitted to bioaccumulation data using Bayesian inference via Monte Carlo Markov Chain (MCMC) sampling based on a Gibbs-type algorithm. For each model, we start by running a short sampling phase with three chains (5,000 iterations after a burn-in phase of 10,000 iterations) then using the Raftery and Lewis18 method to set the necessary thinning and number of iterations to reach an accurate estimation of the joint posterior distribution.
Once data uploaded and the model stated, you can click on button “Calculate and Display.” We recommend you to pay attention to the selected exposure concentration before to proceed to calculations. Besides, calculations can take several minutes for the most complicated models.
You will find at the end of this user guide an appendix (section 5) which gathers together other types of results which can be obtained with other data sets and how to interpret them.
To illustrate the result section, here we used the example file ‘Pimephales_two.csv,’ where Pimephales promelas are exposed to a highly hydrophobic substance (logKow = 9) spiked water at \(0.0044\) \(\mu g.mL^{-1}\) and where the duration of the accumulation phase is equal to 49 days.
As a first result, we provide the bioaccumulation metrics (BCF, BCFpw, BSAF and BMF). Please, note that this section is reactive according to your data. If required, change tab to get the other bioaccumulation metrics (Fig. 3.1).
As a first result, we provide the kinetic bioconcentration factor (\(BCF_{k}\)). The BCF at steady-state (\(BCF_{ss}\)) can be asked if you consider a steady-state is almost reached at the end of the accumulation phase. These factors are mathematically given by Equations (16) and (17) respectively:
\[ BCF_{k} =\frac{k_{u_w}}{\sum_{j=1}^{J} k_{e_j} + \sum_{\ell=1}^{L} k_{m_\ell}} \quad (16) \]
\[ BCF_{ss} =\frac{C_p(t_{c})}{c_{w}} \quad (17) \]
where \(C_p(t_c)\) is the contaminant concentration within the organism at steady-state that is at the end of the accumulation phase (\(\mu g. g^{-1}\)) and \(c_w\) is the contaminant concentration in water (\(\mu g . mL^{-1}\)).
BCF are given as probability distributions (Fig. 3.2) and summarized with their median and their 95% uncertainty limits (95% credible intervals, Table 3.1).
2.5% | 50% | 97.5% | |
---|---|---|---|
\(BCF_k\) | 232 | 276 | 339 |
\(BCF_{ss}\) | 147 | 232 | 315 |
In the above example, the steady-state is almost reached at time \(t_c\) (Fig. 3.3), thus it is reasonable to ask for the \(BCF_{ss}\).
Note that the OECD guideline 3057 reports that “greater emphasis must be on kinetic BCF estimate (when possible) next to estimating the BCF at steady-state.” Thus we recommend to preferentially consider the \(BCF_k\) rather than the \(BCF_{ss}\).
When appropriate, we provide the kinetic pore water bioconcentration factor (\(BCF_{pw_{k}}\)).Again, the corresponding BCF at steady-state (\(BCF_{pw_{ss}}\)) can be asked. These factors are mathematically given by Equations (18) and (19) respectively:
\[ BCF_{pw_{k}} =\frac{k_{u_{pw}}}{\sum_{j=1}^{J} k_{e_j} + \sum_{\ell=1}^{L} k_{m_\ell}} \quad (18) \]
\[ BCF_{pw_{ss}} =\frac{C_p(t_{c})}{c_{pw}} \quad (19) \]
where \(C_p(t_c)\) is the contaminant concentration within the organism at time \(t_c\) (\(\mu g. g^{-1}\)) and \(c_{pw}\) is the contaminant concentration in pore water (\(\mu g . mL^{-1}\)). BCFpw are given as probability distributions and summarized with their median and their 95% uncertainty limits (95% credible intervals).
When appropriate, we provide the kinetic biota-sediment factor (\(BSAF_{k}\)) and the BSAF at steady-state (\(BSAF_{ss}\)) can be asked. These factors are mathematically given by Equations (20) and (21) respectively:
\[ BSAF_{k} =\frac{k_{u_s}}{\sum_{j=1}^{J} k_{e_j} + \sum_{\ell=1}^{L} k_{m_\ell}} \quad (20) \]
\[ BSAF_{ss} =\frac{C_p(t_{c})}{c_{s}} \quad (21) \]
where \(C_p(t_c)\) is the contaminant concentration within the organism at time \(t_c\) (\(\mu g. g^{-1}\)) and \(c_s\) is the contaminant concentration in sediment (\(\mu g . g^{-1}\)). BSAF are given as probability distributions and summarized with their median and their 95% uncertainty limits (95% credible intervals).
When appropriate, we provide the kinetic biomagnification factor (\(BMF_{k}\)) and the BMF at steady-state (\(BSAF_{ss}\)) can be asked. These factors are mathematically given by Equations (22) and (23) respectively:
\[ BMF_{k} =\frac{k_{u_f}}{\sum_{j=1}^{J} k_{e_j} + \sum_{\ell=1}^{L} k_{m_\ell}} \quad (22) \]
\[ BMF_{ss} =\frac{C_p(t_{c})}{c_{f}} \quad (23) \]
where \(C_p(t_c)\) is the contaminant concentration within the organism at time \(t_c\) (\(\mu g. g^{-1}\)) and \(c_f\) is the contaminant concentration in food (\(\mu g . g^{-1}\)). BMF are given as probability distributions and summarized with their median and their 95% uncertainty limits (95% credible intervals).
A coefficient of variation (\(CV\)) for each bioaccumulation metric can be calculated as follows (Eq. (24)):
\[ CV =\frac{Q_{k}97.5-Q_{k}2.5}{4 \times Q_{k}50} \quad (24) \]
with \(Q_{k}2.5\), \(Q_{k}50\) and \(Q_{k}97.5\) the 2.5%, 50% and 97.5% quantiles of the kinetic bioaccumulation metric. Based on our experiment, the coefficient of variation is expected to not exceed \(0.5\).
If the bioaccumulation metric at steady-state is asked, the corresponding \(CV\) is given (Eq. (25)):
\[ CV =\frac{Q_{ss}97.5-Q_{ss}2.5}{4 \times Q_{ss}50} \quad (25) \]
with \(Q_{ss}2.5\), \(Q_{ss}50\) and \(Q_{ss}97.5\) the 2.5%, 50% and 97.5% quantiles of the steady-state bioaccumulation metric.
We first provide the fitted curve superimposed to the observations (Fig. 3.3, black dots): the orange plain line is the median curve, the gray zone is the uncertainty band delimited by 2.5% and 97.5% quantiles of predictions in orange dotted lines. This section is reactive according to your data: if there is biotransformation or growth, the fitted curve superimposed to the observations for parent compound, metabolite(s) and/or growth data are given (for example from the example file “Male_Gammarus_seanine.csv,” Fig. 3.4).
From the joint posterior distribution, we can obtain the marginal posterior distributions for each parameter, which can be summarized by their medians and their 95% credible intervals (Table 3.2).
2.5% | 50% | 97.5% | ||
---|---|---|---|---|
\(k_{u_w}\) | 7.307 | 10.55 | 15.34 | \(d^{-1}\) |
\(k_{e_e}\) | 0.02288 | 0.03837 | 0.06033 | \(d^{-1}\) |
\(\sigma_p\) | 0.1243 | 0.1685 | 0.2464 | \(\mu g.g^{-1}\) |
Goodness-of-fit criteria are given below in our prioritised order; the Posterior Predictive Check (PPC) and the prior-posterior comparison are the most important to check; if they do not correspond to the expectation, you must consider your results with an even more particular attention. As an indication, if at least two criteria are fulfilled, the results obtained can be considered as good enough. We suggest that you refer to the appendix at the end of the document for more information about cases not in accordance with the classical expectations (section 5).
The PPC shows the observed values against their corresponding estimated predictions (black dots), along with their 95% credible interval (vertical segments). If the fit is correct, we expect to see 95% of the data within the intervals. Ideally, observations and predictions should coincide, so we would expect to see black dots along the first bisector \(y = x\) (plain black line). The 95% credible intervals are colored in green if they overlap this line, in red otherwise. In the following example (Fig. 3.5), 95.24% of the measured concentrations (\(n=20/21\)) are in the 95% credible intervals of their predictions.
An example of prior and posterior distributions is illustrated in Fig. 3.6. The prior distribution is represented by the gray area and the posterior distribution by the orange area. The precision of the model parameter estimation can be visualized by comparing prior and posterior distributions: the overall expectation is to get a narrower posterior distribution compared to the prior one, what reflects that data contributed enough to precisely estimate parameters. In the example of Fig. 3.6, marginal posterior distributions for \(k_{u_w}\) and \(k_{e_e}\) are narrower (orange area) than their respective prior distributions (grey area). Within the application, you can chose to visualize either the deterministic or the stochastic parameters by selecting the corresponding tab.
MOSAICbioacc gives a colored matrix in order to highlight correlations between parameters (Fig. 3.7, 3.8). This output allows you to see at a glance the most correlated or anti-correlated parameters, in order to diagnose potential problems of precision due to highly correlated parameters.
Correlations between parameters can also be visualized by projecting the joint posterior distribution in a plot matrix with planes of parameter pairs (Fig. 3.9, lower triangular elements), marginal posterior distribution of each model parameter (Fig. 3.9, diagonal) and Pearson correlation coefficients (Fig. 3.9, upper triangular elements). Correlations are expected to be low (reflected by “potatoid” shapes of density lines in orange, e.g., \(k_{e_e}\) and \(\sigma_p\) in Fig. 3.9); a leaning elliptical shape translates high correlations (positive if leaning to the right, e.g., \(k_{u_w}\) and \(k_{e_e}\) in Fig. 3.9, negative if leaning to the left).
In the example of Fig. 3.9, \(k_{u_w}\) and \(k_{e_e}\) are highly positively correlated (\(r = 0.941\)), meaning that the estimate obtained for one of these two parameters will strongly influence the estimate of the other parameter. This high correlation may be due to the model structure (e.g., by definition \(k_{u_w}\) and \(k_{e_e}\) are correlated), or comes from data not fully appropriate to precisely estimate the parameters.
Convergence of the MCMC can be checked with the Gelman-Rubin diagnostic expressed with the potential scale reduction factor (PSRF). Approximate convergence is diagnosed when the PSRF is close to \(1.0\) (Fig. 3.10)19. In the example of Fig. 3.10, the PSRF is equal to \(1.0\) for each model parameter, thus the convergence of the MCMC was correctly achieved.
Information criteria offer a computationally appealing way of estimating the generalization performance of the model. A fully Bayesian criterion is the widely applicable information criterion (WAIC) by Watanabe a penalized deviance statistics accounting for the uncertainty in the parameters and can be used also for singular models. WAIC is widely used in model comparison for a same dataset (e.g., with or without \(k_{e_e}\)). Sub-models with lower WAIC values will be preferred.
For example, for highly hydrophobic substances, two models can be compared, one considering water and food exposure (which we call here model (a)) and one model considering food-only exposure due to the physico-chemical properties of the substance (model (b)). If the WAIC of model (a) is smaller than the WAIC of model (b), then model (a) will be preferred.
In this version of MOSAICbioacc, the users can deselect some of the parameters (based on biological hypotheses related to the most probable exposure route or by neglecting one elimination process, for example). In these specific cases, tested TK sub-models (e.g., with all selected parameters corresponding to the complete TK model and sub-model with some parameter(s) deselected) can be compared with the WAIC criterion for each tested sub-model.
This criterion, denoted DIC, is a penalized deviance statistics accounting for the number of parameters for use in model comparison for a same data set (e.g., with or without \(k_{e_e}\)). Sub-models with lower DIC values will be preferred18. DIC value can be negative. However, DIC value itself is not important, what matters is the difference between two DICs, what can help in deciding which model is the most appropriate.
For example, for highly hydrophobic substances, two models can be compared, one considering water and food exposure (which we call here model (a)) and one model considering food-only exposure due to the physico-chemical properties of the substance (model (b)). If the DIC of model (a) is smaller than the DIC of model (b), then model (a) will be preferred.
In this version of MOSAICbioacc, the users can deselect some of the parameters (based on biological hypotheses related to the most probable exposure route or by neglecting one elimination process, for example). In these specific cases, tested TK sub-models (e.g., with all selected parameters corresponding to the complete TK model and sub-model with some parameter(s) deselected) can be compared with the DIC criterion. You can get more information of the use of the DIC in Ratier et al. (2019)12.
A traceplot is also an essential plot for assessing convergence and diagnosing of MCMC. It shows the time series of the sampling process leading to the joint posterior distribution. Different colors are used for each of the chains (here three) to assess the within-chain convergence. The user must check whether all MCMC converge towards the same distribution limit (overlapping of the chains). This can be verified visually by observing the simulated values for each node of interest as a function of the number of iterations (Fig. 3.11). In the following example, the three MCMC overlap and converge towards the same distribution limit for each model parameter. Thus, the algorithm has suitably converged.
You can download all plots as displayed by the application in several formats (.png, .jpg, .pdf, .svg, .tiff and .eps): equations, bioaccumulation metrics, fitting results and goodness-of-fit criteria. They can be downloadable separately or simultaneously in a .zip file.
You can also download table results in .txt or .csv, as for example the BCF numerical values and the joint posterior distribution for all parameters (columns) and all iterations of the MCMC algorithm (lines).
You can easily download a full report of your calculations, which summarize all the results in a .pdf, .html or .docx file. We warn you that the creation of the report may take some time depending on your data set.
The R script can be downloaded as a gateway to identically reproduce all calculations provided within the application (this guarantees the full reproducibility of the results) and to perform further calculations directly within the R software with your own computer.
In practice, you may encounter situations where the results are not “ideal,” conversely to those presented in sections 1 to 5 of this document. This appendix will allow you to better interpret the results in such cases. For the goodness-of-fit criteria, we suggest to consider as enough the fact that two criteria as given by MOSAICbioacc are receivable.
If you asked for the \(BCF_{ss}\), \(BCF_{pw_{ss}}\), \(BSAF_{ss}\) or \(BMF_{ss}\) but the median value and the 95% credible interval are not of the same order of magnitude than for the \(BCF_{k}\), \(BCF_{pw_{k}}\), \(BSAF_{k}\) or \(BMF_{k}\), then ensure the accumulation phase really reached the steady-state (i.e., at least three successive measured concentrations with no statistical differences).
On Fig. 5.1, you can see a counter-example for asking for the \(BCF_{ss}\):
Indeed, the steady-state was not reached at the end of the accumulation phase (day 4, Fig. 5.2). Thus, the median value for \(BCF_{ss}\) is totally different from the one of \(BCF_{k}\). The same consideration can be applied for the other bioaccumulation metrics.
Large 95% credible intervals can sometimes be obtained for some parameters, especially for parameter \(k_{e_e}\) (Table 5.1). Such a situation leads to non precise estimate of these parameters, what should question their use for predictions. This can be due to the model structure or the experimental data themselves. Thus, it is an information to consider when you interpret the fitting results.
2.5% | 50% | 97.5% | ||
---|---|---|---|---|
\(k_{u_w}\) | 641.9 | 727.9 | 840.5 | \(d^{-1}\) |
\(k_{e_e}\) | 1.629e-05 | 0.006325 | 0.02038 | \(d^{-1}\) |
\(\sigma_p\) | 0.04673 | 0.06299 | 0.09 | \(\mu g.g^{-1}\) |
If the fit is correct, it is expected to get 95% of the data within the 95% credible intervals of their predictions. So, if the range of the percentage of data within the credible intervals is between 92 and 96%, calculations and predictions can be considered as good enough. If the percentage is under 92%, predictions are considered as underestimated. If the percentage is upper 96%, predictions are considered as overestimated.
On Fig. 5.3, you can see a counter-example with large uncertainties of the model predictions leading to 100% of the data within their credible intervals.
We remind you that prior distributions are defined by default to be the most generic as possible. However, it can happen that your data would require other prior distributions (e.g., inspired by literature, expert knowledge or by a previous study leading to parameter estimations outside of the default values as used in MOSAICbioacc).
The precision of the model parameter estimation can be visualized by comparing prior and posterior distributions: the overall expectation is to get a narrower posterior distribution compared to the prior one, what reflects that data contributed enough to precisely estimate parameters.
If one of the posterior distribution for a model parameter has bounds close to the lower or the upper bound of the prior distributions (e.g. \(log10(\theta) \simeq -5\) or \(log10(\theta) \simeq 5\) with \(\theta\) being one of the model parameter), then the prior distribution may be not well defined. For example, if a distribution tail is observed at these bounds as illustrated on the left of Fig. 5.4 (here the distribution tail of \(log10(k_{e_e})\) is too close to \(-5\), the lower bound), then the inference process needs to be questioned. Conversely, the fit can be considered as correct if you obtain prior and posterior distributions as illustrated on the right of Fig. 5.4.
In MOSAICbioacc, it is not possible to change the prior distributions of parameters directly within the application. To do this, we suggest you to download the R code and to change the prior distributions directly in the R software. We remind you that to define the prior distributions you should not have a look at your data, but only on previous experiments, literature data or expert knowledge.
You can also check this goodness-of-fit criterion through the estimated values of the model parameters. If one of the model parameter is closed to the lower or the upper bound of its 95% credible interval (\(\theta\simeq10^{-5}\) or \(\theta\simeq10^{5}\) with \(\theta\) one of the model parameter), the prior distribution may be not appropriate. As for example illustrated in Table 5.1, the distribution tail of \(log10(k_{e_e})\) is too closed to the lower bound \(-5\).
If a high correlation is obtained between two parameters (e.g., more than 0.7 or less than -0.7 for the Pearson correlation coefficient), it is an information to consider, not necessarily a bad result. It means that the estimate obtained for one of these two parameters will strongly influence the estimate of the other one. Such a high correlation may be due to the model structure itself (for example, by definition \(k_{u_w}\) and \(k_{e_e}\) are correlated, Fig. 5.5, or to the data so that it cannot be avoided).
Sometimes, you may get a bimodal posterior distribution for one or several parameters what translates into a double maximum on density plots (e.g., parameters \(k_{u_w}\) and \(k_{e_e}\) in Fig. 5.5). To make the application as generic as possible, we defined priors for each parameter the most global as possible. However, depending on the experimental conditions, the parameters may not be really included within the prior chosen range of values. In such a case, we recommend you to contact sandrine.charles@univ-lyon1.fr if you are not experimented with Bayesian inference and R software, or to change the downloadable R script by yourself.
This criterion must be as close as possible to 1 for each model parameter to ensure that the between-chain variability is small compared to the within-chain variability. Based on our experience, from a value of 1.03, the results should be questioned. Most often, such a case appears when priors are not fully appropriate or when the data do not contain enough information. One of the solution may be to increase the number of iterations in the MCMC by using the R script directly.
The WAIC or the DIC are not criteria to consider to check the goodness-of-fit. However, they are crucial criteria to consider when two models or more are compared after fitting on a same data set.
In practice, users just need to choose the parameters they want to appear in sub-models. According to the experimental conditions, several sub-models can indeed be considered and compared depending on the hypotheses to test either on the exposure routes or on the elimination processes. Organisms may have been exposed via several media (water and sediment in the following case study. By default, MOSAICbioacc fits the full TK model. Then users can test different TK sub-models, for example sub-models with only one exposure route (e.g., water or sediment, and compare them to the full model based on both the Deviance Information Criteria (DIC) and the Watanabe-Akaike information criterion (WAIC) delivered by MOSAICbioacc.
In order to illustrate this framework, the data set of Physa acuta exposed to AgNO3 spiked water and clean sediment for 7 days20 was uploaded in MOSAICbioacc.
First hypothesis: complete TK model
A first run of analyses was performed where parameters were automatically selected according to the corresponding experimental design (Eqs. A1 and A2), as given within the uploaded data set (i.e., water and sediment for the exposure routes).
\(\begin{array}{c} \frac{d C_{p}(t)}{d t}=k_{u_w} \times c_{w}+k_{u_s} \times c_{s}-\left(k_{ee}\right) \times C_{p}(t) \quad \text { for } 0 \leq t \leq t_{c} \quad (A1)\\ \frac{d C_{p}(t)}{d t}=-\left(k_{e_e}\right) \times C_{p}(t) \quad \quad \quad\quad \quad \quad\quad\quad \quad\quad \: \: \: \quad\text { for } t>t_{c} \quad (A2) \end{array}\)
Second hypothesis: only account for water exposure
A second run of analyses was performed on the same data set that for the first run, but without considering the uptake rate from sediment (Eqs. A3 and A4), because the concentration of the chemical in the sediment is at environmental level.
\(\begin{array}{c} \frac{d C_{p}(t)}{dt}=k_{u_w} \times c_{w}-\left(k_{e_e}\right) \times C_{p}(t) \quad \text { for } 0 \leq t \leq t_{c} \quad (A3)\\ \frac{d C_{p}(t)}{dt}=-\left(k_{e_e}\right) \times C_{p}(t)\quad \quad \quad\quad \quad \quad\quad\text { for } t>t_{c} \quad (A4) \end{array}\)
For each analyses, the parameter estimates are summarized in Table . The WAIC and the DIC are penalized deviance statistics accounting for the number of parameters for use in model comparison for a same data set19. In this example (Table ), the WAIC and DIC are similar between the two runs, as well as the estimations of (\(k_{e_e}\)), considering or not the sediment exposure route. Thus, for this data set, and applying the parsimony principle, it can be deduced that P. acuta accumulate AgNO3 principally by water. Silva et al. (2020)20 also concluded that when accounting for double exposure via both water and clean sediment, water was likely to be the main route.
Run 1 | Run 2 | ||
---|---|---|---|
\(k_{u_w}\) | days\(^{-1}\) | \(\boldsymbol{1.492}\) [\(1.874 \times 10^{-5}\);\(60480\)] | \(\boldsymbol{679}\) [\(568.9\);\(972.6\)] |
\(k_{u_s}\) | days\(^{-1}\) | \(\boldsymbol{624.6}\) [\(30.6\);\(903.6\)] | \(\boldsymbol{0}\) |
\(k_{e_e}\) | days\(^{-1}\) | \(\boldsymbol{5.605 \times 10^{-3}}\) [\(1.421\times10^{-5}\);\(0.07212\)] | \(\boldsymbol{3.952\times10^{-3}}\) [\(1.382\times10^{-5}\);\(0.07106\)] |
WAIC | - | \(177.9\) | \(178\) |
DIC | - | \(178.3\) | \(178.6\) |
You must check whether the MCMC converge towards the same distribution limit (overlapping of the chains). As shown in Fig. 5.6, the three MCMC do not overlap and do not converge towards the same distribution limit for two model parameters (\(k_{u_w}\) and \(k_{e_e}\)). If at least two of the other criteria are good enough, this can be disregarded. If not, your experimental data could be not sufficient to performed bioaccumulation metric calculations and to estimate parameters of a TK model.
Bioconcentration factor (BCF): BCF is a parameter describing bioaccumulation of water-associated organic compounds or metals into tissues of ecological receptors.
Biomagnification factor (BMF): BMF is a parameter describing bioaccumulation of food (or in the predator’s prey)-associated organic compounds or metals into tissues of ecological receptors.
Biota-Sediment Accumulation Factor (BSAF): BSAF is a parameter describing bioaccumulation of sediment-associated organic compounds or metals into tissues of ecological receptors.
Credible Interval (CI): A credible interval is the interval in which an parameter has a given probability. It is the Bayesian equivalent of the confidence interval.
Directed Acyclic Graph (DAG): It symbolize the deterministic links between parameters and variables for a model and the stochastic links between the observed and predicted data.
Monte Carlo Markov Chain (MCMC): A method which comprise a class of algorithms for sampling from a probability distribution. By constructing a Markov chain that has the desired distribution as its equilibrium distribution, one can obtain a sample of the desired distribution by recording states from the chain.
Potential Scale Reduction Factor (PSRF): Gelman-Rubin diagnostic to check the convergence of the MCMC.
Toxico-Kinetic model (TK model): Toxico-kinetics is the mathematical description of the uptake and disposition of a chemical in the organism. TK modeling is usually implemented by describing the time course of the amount or concentration of the parent substance and its metabolites in all the organism.