Parameter Estimation Methods

The main function of PEtab.jl is to create parameter estimation problems and provide runtime-efficient gradient and Hessian functions for estimating unknown model parameters using suitable numerical optimization algorithms. Specifically, the parameter estimation problems considered by PEtab.jl are on the form:

\[\min_{\mathbf{x} \in \mathbb{R}^N} -\ell(\mathbf{x}), \quad \text{subject to} \\ \mathbf{lb} \leq \mathbf{x} \leq \mathbf{ub}\]

Where, since PEtab.jl works with likelihoods (see the API documentation on PEtabObservable), $-\ell(\mathbf{x})$ is a negative log-likelihood, and $\mathbf{lb}$ and $\mathbf{ub}$ are the lower and upper parameter bounds. For a good introduction to parameter estimation for ODE models in biology (which is applicable to other fields as well), see [9].

This advanced section of the documentation focuses on PEtab.jl's parameter estimation functionality, and before reading this part, we recommended the starting tutorial. Specifically, this section of the documentation covers available and recommended optimization algorithms, how to plot optimization results, and how to perform automatic model selection. First though, it covers how to perform parameter estimation. While the PEtabODEProblem contains all the necessary information for wrapping a suitable optimizer to solve the problem (see here), manually wrapping optimizers is cumbersome. Therefore, PEtab.jl provides convenient wrappers for:

Single-start parameter estimation
Multi-start parameter estimation
Creating an OptimizationProblem to access the solvers in Optimization.jl

As a working example, we use the Michaelis-Menten enzyme kinetics model from the starting tutorial. Even though the code below presents the model as a ReactionSystem, everything works the same if the model is provided as an ODESystem.

using Catalyst, PEtab

# Create the dynamic model
rn = @reaction_network begin
    @parameters S0 c3=1.0
    @species S(t)=S0
    c1, S + E --> SE
    c2, SE --> S + E
    c3, SE --> P + E
end
speciemap = [:E => 50.0, :SE => 0.0, :P => 0.0]

# Observables
@unpack E, S = rn
obs_sum = PEtabObservable(S + E, 3.0)
@unpack P = rn
@parameters sigma
obs_p = PEtabObservable(P, sigma)
observables = Dict("obs_p" => obs_p, "obs_sum" => obs_sum)

# Parameters to estimate
p_c1 = PEtabParameter(:c1)
p_c2 = PEtabParameter(:c2)
p_s0 = PEtabParameter(:S0)
p_sigma = PEtabParameter(:sigma)
pest = [p_c1, p_c2, p_s0, p_sigma]

# Simulate measurement data with 'true' parameters
using OrdinaryDiffEq, DataFrames
ps = [:c1 => 1.0, :c2 => 10.0, :c3 => 1.0, :S0 => 100.0]
u0 = [:S => 100.0, :E => 50.0, :SE => 0.0, :P => 0.0]
tspan = (0.0, 10.0)
oprob = ODEProblem(rn, u0, tspan, ps)
sol = solve(oprob, Rodas5P(); saveat = 0:0.5:10.0)
obs_sum = (sol[:S] + sol[:E]) .+ randn(length(sol[:E]))
obs_p = sol[:P] + .+ randn(length(sol[:P]))
df_sum = DataFrame(obs_id = "obs_sum", time = sol.t, measurement = obs_sum)
df_p = DataFrame(obs_id = "obs_p", time = sol.t, measurement = obs_p)
measurements = vcat(df_sum, df_p)

model = PEtabModel(rn, observables, measurements, pest; speciemap = speciemap)
petab_prob = PEtabODEProblem(model)

Single-Start Parameter Estimation

Single-start parameter estimation is an approach where a numerical optimization algorithm is run from a starting point x0 until it hopefully reaches a local minimum. When performing parameter estimation, the objective function generated by a PEtabODEProblem expects the parameters to be in a specific order. The most straightforward way to obtain a correctly ordered vector is via the get_x function:

PEtab.get_x — Function

get_x(prob::PEtabODEProblem; linear_scale = false)::ComponentArray

Get the nominal parameter vector with parameters in the correct order expected by prob for parameter estimation/inference. Nominal values can optionally be specified when creating a PEtabParameter, or in the parameters table if the problem is provided in the PEtab standard format.

For ease of interaction (e.g., changing values), the parameter vector is returned as a ComponentArray. For how to interact with a ComponentArray, see the documentation and the ComponentArrays.jl documentation.

Multi-Start Parameter Estimation

Multi-start parameter estimation is an approach where n parameter estimation runs are initiated from n random starting points. The rationale is that a subset of these runs should, hopefully, converge to the global optimum. While simple, empirical benchmark studies have shown that this method performs well for ODE models in biology [2, 2].

The first step for multi-start parameter estimation is to generate n starting points. While random uniform sampling may initially seem like a good approach, random points tend to cluster. Instead, it's better to use a Quasi-Monte Carlo method, such as Latin hypercube sampling, to generate more spread-out starting points. This approach has been shown to improve performance [2]. The difference can quite clearly be seen generating 100 random points and 50 Latin hypercube-sampled points on the plane.

using Distributions, QuasiMonteCarlo, Plots
s1 = QuasiMonteCarlo.sample(100, [-1.0, -1.0], [1.0, 1.0], Uniform())
s2 = QuasiMonteCarlo.sample(100, [-1.0, -1.0], [1.0, 1.0], LatinHypercubeSample())
p1 = plot(s1[1, :], s1[2, :], title = "Uniform sampling", seriestype=:scatter)
p2 = plot(s2[1, :], s2[2, :], title = "Latin Hypercube Sampling", seriestype=:scatter)
plot(p1, p2)

For a PEtabODEProblem, Latin hypercube sampled points within the parameter bounds can be generated with the get_startguesses function:

PEtab.get_startguesses — Function

get_startguesses([rng::AbstractRNG], prob::PEtabODEProblem, n::Integer; kwargs...)

Generate n random parameter vectors within the parameter bounds in prob.

rng is optional and if omitted defaults to Random.default_rng(). If n = 1, a single random vector is returned. For n > 1, a vector of random parameter vectors is returned. In both cases, parameter vectors are returned as a ComponentArray. For details on how to interact with a ComponentArray, see the documentation and the ComponentArrays.jl documentation.

See also calibrate and calibrate_multistart.

Keyword Arguments

sampling_method = LatinHypercubeSample(): Method for sampling a diverse (spread out) set of parameter vectors. Any algorithm from QuasiMonteCarlo is allowed, but the default LatinHypercubeSample is recommended as it usually performs well.
sample_prior::Bool = true: Whether to sample random parameter values from the prior distribution if a parameter has a prior.
allow_inf::Bool = false: Whether to return parameter vectors for which the likelihood cannot be computed (typically happens because the ODEProblem cannot be solved). Often it only makes sense to use starting points with a computable likelihood for parameter estimation, hence it typically does not make sense to change this option.

source

For our working example, we can generate 50 starting guesses with:

x0s = get_startguesses(petab_prob, 50)

In principle, x0s can now be used together with calibrate to perform multi-start parameter estimation. But, to further simplify this process, PEtab.jl provides a convenient function, calibrate_multistart, which combines start-guess generation and parameter estimation in one step:

PEtab.calibrate_multistart — Function

calibrate_multistart([rng::AbstractRng], prob::PEtabODEProblem, alg, nmultistarts::Integer;
                     nprocs = 1, dirsave=nothing, kwargs...)::PEtabMultistartResult

Perform nmultistarts parameter estimation runs from randomly sampled starting points using the optimization algorithm alg to estimate the unknown model parameters in prob.

A list of available and recommended optimisation algorithms (alg) can be found in the package documentation and in the calibrate documentation.

As with get_startguesses, the rng controlling the generation of starting points is optional; if omitted, Random.default_rng() is used. For reproducible starting points, pass a seeded rng (e.g., MersenneTwister(42)).

If nprocs > 1, the parameter estimation runs are performed in parallel using the pmap function from Distributed.jl with nprocs processes. If parameter estimation on a single process (nprocs = 1) takes longer than 5 minutes, we strongly recommend setting nprocs > 1, as this can greatly reduce runtime. Note that nprocs should not be larger than the number of cores on the computer.

If dirsave is provided, intermediate results for each run are saved in dirsave. It is strongly recommended to provide dirsave for larger models, as parameter estimation can take hours (or even days!),and without dirsave, all intermediate results will be lost if something goes wrong.

Different ways to visualize the parameter estimation result can be found in the documentation.

Keyword Arguments

sampling_method = LatinHypercubeSample(): Method for sampling a diverse (spread out) set of starting points. See the documentation for get_startguesses, which is the function used for generating starting points.
sample_prior::Bool = true: See the documentation for get_startguesses.
options = DEFAULT_OPTIONS: Configurable options for alg. See the documentation for calibrate.

source

Two important keyword arguments for calibrate_multistart are dirsave and nprocs. If nprocs > 1, the parameter estimation runs are performed in parallel using the pmap function from Distributed.jl with nprocs processes. Even though pmap introduces some overhead because it must load and compile the code on each process, setting nprocs > 1 often reduces runtime when the parameter estimation is expected to take longer than 5 minutes. Meanwhile, dirsave specifies an optional directory to continuously save the results from each individual run. We strongly recommend providing such a directory, as parameter estimation for larger models can take hours or even days. If something goes wrong with the computer during that time, it is, to put it mildly, frustrating to lose all the results. For our working example, we can perform 50 multistarts in parallel on two processes with:

ms_res = calibrate_multistart(petab_prob, IPNewton(), 50; nprocs = 2,
                              dirsave="path_to_save_directory")

PEtabMultistartResult
---------------- Summary ---------------
min(f)                = 7.58e+01
Parameters estimated  = 4
Number of multistarts = 50
Optimiser algorithm   = Optim_IPNewton
Results saved at path_to_save_directory

The results are returned as a PEtabMultistartResult, which, in addition to printout statistics, contains relevant information for each run:

PEtab.PEtabMultistartResult — Type

PEtabMultistartResult

Parameter estimation statistics from multi-start optimization with calibrate_multistart.

Fields

xmin: Best minimizer across all runs.
fmin: Best minimum across all runs.
alg: Parameter estimation algorithm.
nmultistarts: Number of parameter estimation runs.
sampling_method: Sampling method used for generating starting points.
dirsave: Path of directory where parameter estimation run statistics are saved if dirsave was provided to calibrate_multistart.
runs: Vector of PEtabOptimisationResult with the parameter estimation results for each run.

PEtabMultistartResult(dirres::String; which_run::String="1")

Import multistart parameter estimation results saved at dirres.

Each time a new optimization run is performed, results are saved with unique numerical endings. Results from a specific run can be retrieved by specifying the numerical ending with which_run.

source

Finally, a common approach to evaluate the result of multi-start parameter estimation is through plotting. One widely used evaluation plot is the waterfall plot, which shows the final objective values for each run:

plot(ms_res; plot_type=:waterfall)

In the waterfall plot, each plateau corresponds to different local optima (represented by different colors). Since many runs (dots) are found on the plateau with the smallest objective value, we can be confident that the global optimum has been found. In addition to waterfall plots, more plotting options can be found on this page.

Creating an OptimizationProblem

Optimization.jl is a Julia package that provides a unified interface for over 100 optimization algorithms (see their documentation for the complete list). While Optimization.jl is undoubtedly useful, it is currently undergoing heavy updates, so at the moment we do not recommend it as the default choice for parameter estimation.

The central object in Optimization.jl is the OptimizationProblem, and PEtab.jl directly supports converting a PEtabODEProblem into an OptimizationProblem:

PEtab.OptimizationProblem — Function

OptimizationProblem(prob::PEtabODEProblem; box_constraints::Bool = true)

Create an Optimization.jl OptimizationProblem from prob.

To use algorithms not compatible with box constraints (e.g., Optim.jl NewtonTrustRegion), set box_constraints = false. Note that with this option, optimizers may move outside the bounds, which can negatively impact performance. More information on how to use an OptimizationProblem can be found in the Optimization.jl documentation.

source

For our working example, we can create an OptimizationProblem with:

using Optimization
opt_prob = OptimizationProblem(petab_prob)

OptimizationProblem. In-place: true
u0: ComponentVector{Float64}(log10_c1 = 1.0, log10_c2 = 2.6989704386302837, log10_S0 = 2.6989704386302837, log10_sigma = 2.6989704386302837)

Given a start-guess x0, we can then estimate the parameters using, for example, Optim.jl's ParticleSwarm() method, with:

using OptimizationOptimJL
opt_prob.u0 .= x0
res = solve(opt_prob, Optim.ParticleSwarm())

retcode: Failure
u: ComponentVector{Float64}(log10_c1 = 1.9645945812029193, log10_c2 = 3.0, log10_S0 = 1.9998090325686793, log10_sigma = 0.0629146867411708)

which returns an OptimizationSolution. For more information on options and how to interact with OptimizationSolution, see the Optimization.jl documentation.

References

[2]: A. Raue, M. Schilling, J. Bachmann, A. Matteson, M. Schelke, D. Kaschek, S. Hug, C. Kreutz, B. D. Harms, F. J. Theis and others. Lessons learned from quantitative dynamical modeling in systems biology. PloS one 8, e74335 (2013).
[3]: H. Hass, C. Loos, E. Raimundez-Alvarez, J. Timmer, J. Hasenauer and C. Kreutz. Benchmark problems for dynamic modeling of intracellular processes. Bioinformatics 35, 3073–3082 (2019).
[9]: A. F. Villaverde, D. Pathirana, F. Fröhlich, J. Hasenauer and J. R. Banga. A protocol for dynamic model calibration. Briefings in bioinformatics 23, bbab387 (2022).