Tools for generating synthetic data#

Functions for creating synthetic proxy signals/stratigraphic observations and evaluating model performance for synthetic tests.

`make_excursion`	Function for generating a synthetic proxy signal that contains a number of user-specified excursions.
`synthetic_sections`	Function for generating synthetic proxy observations and age constraints using a predefined proxy signal.
`synthetic_observations_from_prior`	Given age constraints for a set of stratigraphic sections in `ages_df`, generate synthetic proxy observations by sampling the model prior.
`synthetic_signal_from_prior`	Draws synthetic signals from the model prior, and returns the signal conditioned over the points in `ages`.
`quantify_signal_recovery`	Calculates the likelihood of the true proxy signal (for synthetic tests, where the true signal is known) conditioned on the posterior (default) or prior proxy signal inference.
`sample_age_recovery`	Calculates the likelihood of the true sample ages (for synthetic tests, where the true age of each sample is known) given draws from the posterior (default) or prior.
`sample_age_residuals`	Calculates the residual (for each draw) between the true age and the posterior (default) or prior age of each sample.
`synthetic_signal_to_df`	Helper function for generating artificial sample and age data using `synthetic_sections()`.

stratmc.synthetics.make_excursion(time, amplitude, baseline=0, rising_time=None, rate_offset=True, excursion_duration=None, min_duration=1, smooth=False, smoothing_factor=10, seed=None)[source]#

Function for generating a synthetic proxy signal that contains a number of user-specified excursions.

Parameters:

time: numpy.array(float): Time vector over which to generate proxy signal.
amplitude: float, list(float), or numpy.array(float): Amplitude of excursion; pass a list or array to generate multiple excursions.
baseline: float, optional: Baseline proxy value. Defaults to 0.
rising_time: float, list(float), or numpy.array(float), optional: Fraction of excursion duration spent on the rising limb (linear increase/decrease toward peak). Must be between 0 and 1. If not provided, randomly generated if rate_offset is True and set to 0.5 if rate_offset is False. Pass a list to specify different rising times for each excursion.
rate_offset: bool, optional: If False, rising and falling limbs of excursion have equal duration. If True, the fraction of the excursion duration spent on the rising limb is set by rising_time. Defaults to False.
excursion_duration: float, list(float), or numpy.array(float), optional: Duration of excursion; pass a list or array to generate multiple excursions. Random if not provided.
min_duration: float, optional: Minimum excursion duration if excursion_duration is not provided. Defaults to 1.
smooth: bool, optional: Whether to smooth excursion peaks. Defaults to False.
smoothing_factor: float, optional: Smoothing factor if smooth is True; higher values produce smoother signals. Defaults to 10.
seed: int, optional: Random seed used to generate signal.

Returns:

interp_proxy: np.array: Tracer signal interpolated to points in the time vector

stratmc.synthetics.quantify_signal_recovery(full_trace, true_signal, proxy='d13c', mode='posterior')[source]#

Calculates the likelihood of the true proxy signal (for synthetic tests, where the true signal is known) conditioned on the posterior (default) or prior proxy signal inference. The likelihood is evaluated at each age (the posterior signal and the true signal must be evaluated at the same ages). Provides a measure of signal recovery.

Parameters:

full_trace: arviz.InferenceData or list(arviz.InferenceData): An arviz.InferenceData object containing the full set of prior and posterior samples from get_trace() in stratmc.inference. If passed as a list, the posterior draws for all traces will be combined when calculating posterior_likelihood.
true_signal: np.array: True values for the proxy signal, evaluated at the same ages as the posterior signal in full_trace.
proxy: str, optional: Tracer signal to evaluate. Defaults to ‘d13c’.
mode: str, optional: Whether to use the posterior or prior to calculate signal recovery. Defaults to ‘posterior’.

Returns:

posterior_likelihood: np.array: Array of posterior likelihoods (evaluated at each age).

stratmc.synthetics.sample_age_recovery(full_trace, sample_df, sections=None, mode='posterior')[source]#

Calculates the likelihood of the true sample ages (for synthetic tests, where the true age of each sample is known) given draws from the posterior (default) or prior. Provides a measure of age model recovery.

Parameters:

full_trace: arviz.InferenceData or list(arviz.InferenceData): An arviz.InferenceData object containing the full set of prior and posterior samples from get_trace() in stratmc.inference. If passed as a list, the posterior draws for all traces will be combined when calculating posterior_likelihood.
sample_df: pandas.DataFrame: pandas.DataFrame containing proxy data for synthetic sections.
sections: list(str) or numpy.array(str), optional: List of sections to evaluate. Defaults to all sections in sample_df.
mode: str, optional: Whether to use the posterior or prior age models. Defaults to ‘posterior’.

Returns:

posterior_likelihood: dict{float} or np.array(float): Posterior likelihoods for the true age of each sample. Returned as an array if only one section is evaluated, or a dictionary of arrays if multiple sections are evaluated.

stratmc.synthetics.sample_age_residuals(full_trace, sample_df, sections=None, mode='posterior')[source]#

Calculates the residual (for each draw) between the true age and the posterior (default) or prior age of each sample.

Parameters:

full_trace: arviz.InferenceData or list(arviz.InferenceData): An arviz.InferenceData object containing the full set of prior and posterior samples from get_trace() in stratmc.inference. If passed as a list, the posterior draws for all traces will be combined when calculating age_residuals.
sample_df: pandas.DataFrame: pandas.DataFrame containing proxy data for synthetic sections.
sections: list(str) or numpy.array(str), optional: List of sections to evaluate. Defaults to all sections in sample_df.
mode: str, optional: Whether to use the posterior or prior age models. Defaults to ‘posterior’.

Returns:

age_residuals: np.array or dict{np.array}: Sample age residuals; shape is (number of samples, number of posterior draws). Returned as an array if only one section is evaluated, or a dictionary of arrays if multiple sections are evaluated.

stratmc.synthetics.synthetic_observations_from_prior(age_vector, ages_df, sample_heights=None, uniform_heights=False, samples_per_section=20, proxies=['d13c'], proxy_std=0.1, seed=None, ls_dist='Wald', ls_min=0, ls_mu=20, ls_lambda=50, ls_sigma=50, var_sigma=10, white_noise_sigma=0.1, gp_mean_mu=0, gp_mean_sigma=10, approximate=False, hsgp_m=15, hsgp_c=1.3, offset_type='section', offset_prior='Laplace', offset_alpha=0, offset_beta=1, offset_sigma=1, offset_mu=0, offset_b=2, noise_type='section', noise_prior='HalfCauchy', noise_beta=1, noise_sigma=1, noise_nu=1, jitter=0.001, **kwargs)[source]#

Given age constraints for a set of stratigraphic sections in ages_df, generate synthetic proxy observations by sampling the model prior. Accepts all arguments that can be passed to build_model() in stratmc.model.

Parameters:

age_vector: np.array(float): Vector of ages at which to evaluate synthetic proxy signal(s).
ages_df: pandas.DataFrame: pandas.DataFrame containing age constraints for synthetic sections.
sample_heights: dict{list(float) or numpy.array(float)}, optional: Sample heights for each stratigraphic section in ages_df; must be a dictionary with section names as keys. Defaults to None, which results in either uniformly spaced or randomly spaced sample heights (depending on the uniform_heights argument).
uniform_heights: bool, optional: Whether to generate uniformly spaced (set to True) or randomly spaced (set to False) sample heights if dictionary of sample_heights not provided. Defaults to False (randomly spaced samples).
samples_per_section: int or dict(int), optional: Number of samples per section to generate if sample_heights not provided; either an integer (if the same for all sections) or a dictionary with section names as keys. Defaults to 20.
proxies: list(str), optional: List of proxies to generate synthetic observations for. Defaults to d13c.
proxy_std: float or dict(float), optional: Measurement uncertainty for each proxy; pass a dictionary of floats with the elements of proxies as keys to use a different value for each proxy, or an integer to use the same value for all proxies. Defaults to 0.1.
seed: int, optional: Seed to use while generating synthetic observations.

Returns:

signals: dict(float): Tracers signals drawn from the model prior (evaluated at the points in age_vector) used to generate synthetic observations; dictionary keys are proxies.
sample_df: pandas.DataFrame: pandas.DataFrame containing proxy data for synthetic stratigraphic sections.
prior: arviz.InferenceData: An arviz.InferenceData object containing the prior draw from the model used to generate synthetic observations.
model: pymc.Model: pymc.model.core.Model object used to generate synthetic observations.

stratmc.synthetics.synthetic_sections(true_time, true_proxy, num_sections, num_samples, max_section_thickness, proxies=['d13c'], noise=False, noise_amp=0.1, min_constraints=2, max_constraints=3, seed=None, **kwargs)[source]#

Function for generating synthetic proxy observations and age constraints using a predefined proxy signal.

Parameters:

true_time: numpy.array(float): True time vector for input signal.
true_proxy: numpy.array(float) or dict{numpy.array(float)}: True proxy vector for input signal. If generating synthetic data for multiple proxies, pass as a dictionary with proxy names as keys.
num_sections: int: Number of synthetic sections to generate.
num_samples: int: Number of samples per synthetic section.
max_section_thickness: float: Maximum thickness of synthetic sections.
proxies: str or list(str), optional: Column name(s) for synthetic proxy observations in sample_df. Defaults to ‘d13c’.
noise: bool, optional: Whether to add white noise to proxy observations. Defaults to False.
noise_amp: float or dict{float}, optional: Amplitude of white noise added to proxy observations (if noise is True). To specify a different noise amplitude for each proxy, pass as a dictionary with proxy names as keys. Defaults to 0.1.
min_constraints: int, optional: Minimum number of age constraints per synthetic section (must be at least 2). Defaults to 2.
max_constraints: int, optional: Maximum number of age constraints per synthetic section. Defaults to 3.
seed: int, optional: Random seed used to generate synthetic sections.

Returns:

sample_df: pandas.DataFrame: pandas.DataFrame containing proxy data for synthetic sections.
ages_df: pandas.DataFrame: pandas.DataFrame containing age constraints for synthetic sections.

stratmc.synthetics.synthetic_signal_from_prior(ages, num_signals=100, ls_dist='Wald', ls_min=0, ls_mu=20, ls_lambda=50, ls_sigma=50, var_sigma=10, gp_mean_mu=0, gp_mean_sigma=5, seed=None)[source]#

Draws synthetic signals from the model prior, and returns the signal conditioned over the points in ages. To generate both signals and synthetic stratigraphic sections, instead use synthetic_observations_from_prior().

Parameters:

ages: numpy.array(float): Array of ages over which to condition the signal.
num_signals: int, optional: Number of signals to draw from prior. Defaults to 100.
ls_dist: str, optional: Prior distribution for the lengthscale hyperparameter of the exponential quadratic covariance kernel (pymc.gp.cov.ExpQuad); set to Wald (pymc.Wald) or HalfNormal (pymc.HalfNormal). Defaults to Wald with mu = 20 and lambda = 50; to change mu and lambda, pass the ls_mu and ls_lambda parameters. For HalfNormal, the variance defaults to sigma = 50; change by passing ls_sigma.
ls_min: float, optional: Minimum value for the lengthscale hyperparameter of the pymc.gp.cov.ExpQuad covariance kernel; shifts the lengthscale prior by ls_min. Defaults to 0.
ls_mu: float, optional: Mean (mu) of the pymc.gp.cov.ExpQuad lengthscale prior if ls_dist = `Wald`. Defaults to 20.
ls_lambda: float, optional: Relative precision (lam) of the pymc.gp.cov.ExpQuad lengthscale hyperparameter prior if ls_dist = `Wald`. Defaults to 50.
ls_sigma: float, optional: Scale parameter (sigma) of the pymc.gp.cov.ExpQuad lengthscale hyperparameter prior if ls_dist = `HalfNormal`. Defaults to 50.
var_sigma: float, optional: Scale parameter (sigma’) of the covariance kernel variance hyperparameter prior, which is a :class:`pymc.HalfNormal distribution. Defaults to 10.
gp_mean_mu: float, optional: Mean (mu) of the GP mean function prior, which is a pymc.Normal distribution. Defaults to 0.
gp_mean_sigma: float, optional: Standard deviation (sigma) of the GP mean function prior, which is a pymc.Normal distribution. Defaults to 5.
seed: int, optional: Random seed used to generate signals.

Returns:

signal: numpy.ndarray(float): Array with shape ages x number of signals containing the n = num_signals synthetic signals drawn from the prior.

stratmc.synthetics.synthetic_signal_to_df(proxy_vec, heights, section_ages, section_names, ages, age_std, age_heights, age_section_names, proxies=['d13c'])[source]#

Helper function for generating artificial sample and age data using synthetic_sections().

Parameters:

proxy_vec: np.array(float) or dict{np.array(float)}: Array of proxy observations. Pass as a dictionary if more than one proxy.
heights: np.array(float): Array of heights corresponding to proxy observations in proxy_vec.
section_ages: np.array(float): Array of ages corresponding to proxy observations in proxy_vec.
section_names: np.array(str): Array of section names corresponding to proxy observations in proxy_vec.
ages: np.array(float): Array of age constraints.
age_std: np.array(float): Array of uncertainties for each age constraint in ages.
age_heights: np.array(float): Array of heights for each age constraint in ages.
age_section_names: np.array(str): Array of section names corresponding to age constraints in ages.
proxies: str or list(str), optional: Name(s) of proxies. Defaults to d13c.

Returns:

sample_df: pandas.DataFrame: pandas.DataFrame containing proxy data for synthetic sections.
ages_df: pandas.DataFrame: pandas.DataFrame containing age constraints for synthetic sections.