Loading and processing data#

Functions for importing and processing proxy and age constraint data.

`load_data`	Import and pre-process proxy data and age constraints from .csv files formatted according to the Data table formatting guidelines.
`combine_data`	Helper function for merging `pandas.DataFrame` objects containing proxy observations or age constraints.
`downsample`	Subsample a set of proxy observations.
`load_object`	Custom load command for pickle (.pkl) object (variables can be saved as .pkl files with `save_object()`).
`load_trace`	Custom load command for NetCDF file containing a trace (`arviz.InferenceData` object saved with `save_trace()`).
`save_object`	Save variable as a pickle (.pkl) object.
`save_trace`	Save trace (`arviz.InferenceData` object) as a NetCDF file.
`combine_traces`	Helper function for combining multiple `arviz.InferenceData` objects (saved as NetCDF files) that contain prior and posterior samples for the same inference model (sampled with `get_trace()` in `stratmc.inference`).
`drop_chains`	Remove a subset of chains from a `arviz.InferenceData` object.
`thin_trace`	Remove a subset of draws from a `arviz.InferenceData` object.
`accumulation_rate`	Calculate apparent sediment accumulation rate between successive samples (if `method = 'successive'`) or every possible sample pairing (`method = 'all'`).
`clean_data`	Helper function for cleaning sample data before running an inversion.
`depth_to_height`	Helper function for converting depth in core to height in section.
`combine_duplicates`	Helper function for combining multiple proxy measurements from the same stratigraphic horizon.
`get_boundaries`	Helper function for `downsample()`.
`remove_extra_bounds`	Helper function for `downsample()`; removes duplicate or extraneous boundaries from list of candidate cluster boundaries.

stratmc.data.accumulation_rate(full_trace, sample_df, ages_df, method='all', age_model='posterior', include_age_constraints=True, **kwargs)[source]#

Calculate apparent sediment accumulation rate between successive samples (if method = 'successive') or every possible sample pairing (method = 'all').

Note that if method = 'all', rate is returned in mm/year, and duration is returned in years. If method = 'successive', rate is returned in m/Myr, and duration is returned in Myr. Input data are assumed to have units of meters and millions of years. Used as input to sadler_plot() and accumulation_rate_stratigraphy() in stratmc.plotting.

Parameters:

full_trace: arviz.InferenceData: An arviz.InferenceData object containing the full set of prior and posterior samples from get_trace() in stratmc.inference.
sample_df: pandas.DataFrame: pandas.DataFrame containing all proxy data.
ages_df: pandas.DataFrame: pandas.DataFrame containing age constraints from all sections.
method: str, optional: Whether to calculate accumulation rates between every possible sample pairing (‘all’), or between successive samples (‘successive’); defaults to ‘all’.
age_model: str, optional: Whether to calculate accumulation rates using the the posterior or prior age model for each section; defaults to ‘posterior’.
include_age_constraints: bool, optional: Whether to include radiometric age constraints in accumulation rate calculations; defaults to True.
sections: list(str) or numpy.array(str), optional: List of sections to include. Defaults to all sections in sample_df.

Returns:

rate_df: pandas.DataFrame: pandas.DataFrame containing sediment accumulation rates and associated durations.

stratmc.data.clean_data(sample_df, ages_df, proxies, sections)[source]#

Helper function for cleaning sample data before running an inversion. Sets Exclude? to True for samples with no relevant proxy observations, removes sections where all samples have been excluded, and drops excluded age constraints.

Parameters:

sample_df: pandas.DataFrame: pandas.DataFrame containing proxy data for all sections.
ages_df: pandas.DataFrame: pandas.DataFrame containing age constraints for all sections.
proxies: str or list(str): Proxies to include in the inference.
sections: list(str) or numpy.array(str): List of sections to include in the inference (as named in sample_df and ages_df).

Returns:

sample_df: pandas.DataFrame: pandas.DataFrame containing cleaned proxy data for all sections.
ages_df: pandas.DataFrame: pandas.DataFrame containing cleaned age constraint data for all sections.

stratmc.data.combine_data(dataframes)[source]#

Helper function for merging pandas.DataFrame objects containing proxy observations or age constraints. Data are merged using the section and height columns.

Parameters:

dataframes: list(pandas.DataFrame): List of pandas.DataFrame objects to merge.

Returns:

merged_data: pandas.DataFrame: pandas.DataFrame containing merged data.

stratmc.data.combine_duplicates(sample_df, proxies, proxy_sigma_default=0.1, combine_no_superposition=False)[source]#

Helper function for combining multiple proxy measurements from the same stratigraphic horizon. For each horizon with multiple proxy values, replaces the proxy value with the mean, and replaces the standard deviation with the combined uncertainty (proxy_std values summed in quadrature) for all measurements. The standard deviation of the population of proxy values for each horizon is stored in the proxy_population_std column of sample_df (in build_model(), the uncertainty of each proxy observation is modeled as the proxy_std and proxy_population_std values summed in quadrature).

Parameters:

sample_df: pandas.DataFrame: pandas.DataFrame containing proxy data for all sections.
proxies: list(str): List of proxies to include in the inference.
proxy_sigma_default: float or dict{float}, optional: Measurement uncertainty (\(1\sigma\)) to use for proxy observations if not specified in proxy_std column of sample_df. To set a different value for each proxy, pass a dictionary with proxy names as keys. Defaults to 0.1.
combine_no_superposition: bool, optional: Whether to combine samples without superposition information by averaging their proxy values; defaults to False.

Returns:

sample_df: pandas.DataFrame: pandas.DataFrame containing proxy data with duplicates combined.

stratmc.data.combine_traces(trace_list)[source]#

Helper function for combining multiple arviz.InferenceData objects (saved as NetCDF files) that contain prior and posterior samples for the same inference model (sampled with get_trace() in stratmc.inference). The arviz.InferenceData objects are concatenated along the chain dimension such that if two traces with 8 chains each are concatenated, the new combined trace will have 16 chains.

Parameters:

trace_list: list(str): List of paths to arviz.InferenceData objects (saved as NetCDF files) to be merged.

Returns:

combined_trace: arviz.InferenceData: New arviz.InferenceData object containing the prior and posterior draws for all traces in trace_list.

stratmc.data.depth_to_height(sample_df, ages_df)[source]#

Helper function for converting depth in core to height in section.

Parameters:

sample_df: pandas.DataFrame: pandas.DataFrame containing proxy data for all sections.
ages_df: pandas.DataFrame: pandas.DataFrame containing age constraints for all sections.

Returns:

sample_df: pandas.DataFrame: pandas.DataFrame containing proxy data for all sections, with depth in core converted to height in section.
ages_df: pandas.DataFrame: pandas.DataFrame containing age constraints for all sections, with depth in core converted to height in section.

stratmc.data.downsample(sample_df, ages_df, N=5000, likelihood_ratio_min=0.5, proxy='d13c', keep='best', keep_seed=None, resample_with_lowest_n=True, flexible_n=True, best_criteria='corr_coef', split_environments=True, **kwargs)[source]#

Subsample a set of proxy observations. Calculates the likelihood of the original stratigraphic signal given the subsampled signal and uncertainty in the data. Returns the solution that meets the mean likelihood ratio minimum with the lowest number of downsampled data points. See input parameter descriptions for additional details.

Parameters:

sample_df: pandas.DataFrame: pandas.DataFrame containing proxy data for all sections.
ages_df: pandas.DataFrame: pandas.DataFrame containing age constraints for all sections.
N: int: Number of random sample groupings to test. Defaults to 5,000.
likelihood_ratio_min: float or dict{float}, optional: Minimum acceptable likelihood ratio. For each section, the algorithm selects the smallest downsampled data set that meets this threshold. If multiple solutions with this minimum number of data points exist, then the solution with the highest correlation coefficient is selected if keep is ‘best’, while a random one of these solutions is selected if keep is ‘random’. Must be in [0, 1]; defaults to 0.5. Pass as a dictionary to specify a different value for each section.
keep: str: If there are multiple solutions that satisfy likelihood_ratio_min using the minimum possible number of data points, whether to return the best one of these solutions (‘best’), or a random solution (‘random’). Defaults to ‘best’.
flexible_n: bool: Whether to consider solutions with 1 more data point than the minimum. Defaults to True.
resample_with_lowest_n: bool: Whether to generate another N random solutions with the minimum number of data points required to staisfy likelihood_ratio_min (or one more than the minimum number of data points, if flexible_n = True). Improves exploration of the solution space. Defaults to True.
best_criteria: str: Which metric to use to identify the best solution among the candidate solutions that meet or exceed likelihood_ratio_min (if mode is ‘best’). Either ‘likelihood_ratio’ (mean likelihood ratio) or ‘corr_coef’ (maximum Pearson correlation coefficient); defalts to ‘corr_coef’.
split_environments: bool: Whether to insert breaks between different depositional environments (using ‘Depositional Environment’ column in sample_df). Defaults to True.
proxy: str, optional: Proxy to downsample. Defaults to ‘d13c’.
sections: list(str) or numpy.array(str), optional: List of sections to downsample. Defaults to all sections in sample_df.

Returns:

downsampled_data: pandas.DataFrame: pandas.DataFrame containing downsampled proxy data. All samples are still included in the DataFrame, but samples that were excluded during downsampling are marked Exclude? = True.
solution_likelihood_ratios: dict: Dictionary with the mean likelihood ratio for chosen solutions; keys are section names.
solution_corr_coefs: dict: Dictionary with the correlation coefficients for chosen solutions; keys are section names.

stratmc.data.drop_chains(full_trace, chains)[source]#

Remove a subset of chains from a arviz.InferenceData object.

Parameters:

full_trace: arviz.InferenceData: An arviz.InferenceData object containing the full set of prior and posterior samples from get_trace() in stratmc.inference.
chains: list or np.array of int: Indices of chains to remove from full_trace.

Returns:

full_trace_clean: arviz.InferenceData: Copy of full_trace without the chains specified in chains.

stratmc.data.get_boundaries(sample_df, ages_df, proxy, section, environment=True, depositional_ages=True, superposition=True)[source]#

Helper function for downsample(). Returns list of height boundaries where the target section must be split into different groups. By default, inserts breaks between samples from different depositional environments, around groups of samples with the same depositional age, and around groups of samples without superposition information.

Parameters:

sample_df: pandas.DataFrame: pandas.DataFrame containing all proxy data.
ages_df: pandas.DataFrame: pandas.DataFrame containing age constraints from all sections.
proxy: str: Name of proxy to be downsampled.
section: str: Name of target section.
environment: bool: Whether to insert breaks between different depositional environments (requires ‘Depositional Environment’ column in sample_df). Defaults to True.
depositional_ages: bool: Whether to insert breaks around groups of samples with the same depositional age constraint. Defaults to True.
superposition:: Whether to insert breaks around groups of samples without superposition information. Defaults to True.

Returns:

boundary_heights: numpy.array: Array containing required cluster boundaries.

stratmc.data.load_data(sample_file, ages_file, proxies=['d13c'], proxy_sigma_default=0.1, drop_excluded_samples=True, drop_excluded_ages=True, combine_no_superposition=False)[source]#

Import and pre-process proxy data and age constraints from .csv files formatted according to the Data table formatting guidelines. To combine data from different .csv files, load each file separately and then combine the DataFrames with combine_data().

By default, samples marked Exclude? = True will be dropped from the data table. If sample_file.csv includes multiple proxy observations from the same stratigraphic horizon (for a given proxy), then all measurements marked Exclude? = False and superposition? = True `` will be combined using :py:meth:`combine_duplicates() <stratmc.data.combine_duplicates>`. Samples marked ``superposition? = False will remain separate, and their order will be randomized within the inference model. These default behaviors can be modified by passing the drop_excluded_samples and combine_no_superposition arguments.

Parameters:

sample_file: str: Path to .csv file containing proxy data for all sections (without ‘.csv’ extension).
ages_file: str: Path to .csv file containing age constraints for all sections (without ‘.csv’ extension).
proxies: str or list(str), optional: proxy names (must match column headers in sample_file.csv); defaults to ‘d13c’.
proxy_sigma_default: float or dict{float}, optional: Measurement uncertainty (\(1\sigma\)) to use for proxy observations if not specified in proxy_std column of sample_df. To set a different value for each proxy, pass a dictionary with proxy names as keys. Defaults to 0.1.
drop_excluded_samples: bool, optional: Whether to remove samples with Exclude? = True from the sample_df; defaults to True. If excluded samples are not dropped, their ages will be passively tracked within the inference model (but they will not be considered during the proxy signal reconstruction).
drop_excluded_ages: bool, optional: Whether to remove ages with Exclude? = True from the ages_df; defaults to True.
combine_no_superposition: bool, optional: Whether to combine samples without superposition information by averaging their proxy values; defaults to False.

Returns:

sample_df: pandas.DataFrame: pandas.DataFrame containing proxy data for all sections.
ages_df: pandas.DataFrame: pandas.DataFrame containing age constraints for all sections.

stratmc.data.load_object(path)[source]#

Custom load command for pickle (.pkl) object (variables can be saved as .pkl files with save_object()).

Parameters:

path: str: Path to saved .pkl file (without the ‘.pkl’ extension).

Returns:

var:: Variable saved in path.

stratmc.data.load_trace(path)[source]#

Custom load command for NetCDF file containing a trace (arviz.InferenceData object saved with save_trace()).

Parameters:

path: str: Path to saved NetCDF file (without the ‘.nc’ extension).

Returns:

trace: arviz.InferenceData: Trace saved as NetCDF file.

stratmc.data.remove_extra_bounds(heights, boundaries)[source]#

Helper function for downsample(); removes duplicate or extraneous boundaries from list of candidate cluster boundaries.

Parameters:

proxy: numpy.array: array containing proxy values for samples in group
height: pandas.DataFrame: array containing heights for samples in group
bounds: np.array: array containing heights of boundaries between groups

Returns:

centroid: np.array: Array containing centroid coordinates: [proxy_center, height_center]

stratmc.data.save_object(var, path)[source]#

Save variable as a pickle (.pkl) object.

Parameters:

var:: Variable to be saved.
path: str: Location (including the file name, without ‘.pkl’ extension) to save var.

stratmc.data.save_trace(trace, path)[source]#

Save trace (arviz.InferenceData object) as a NetCDF file.

Parameters:

trace: arviz.InferenceData: An arviz.InferenceData object containing the full set of prior and posterior samples from build_model() in stratmc.model (the output of get_trace() in stratmc.inference).
path: str: Location (including the file name, without ‘.nc’ extension) to save trace.

stratmc.data.thin_trace(full_trace, drop_freq=2)[source]#

Remove a subset of draws from a arviz.InferenceData object. Only applies to groups associated with the posterior (the prior draws will not be affected).

Parameters:

full_trace: arviz.InferenceData: An arviz.InferenceData object containing the full set of prior and posterior samples from get_trace() in stratmc.inference.
drop_freq: int: Frequency of draw removal. For example, 2 will remove every other draw, while 4 will remove every fourth draw.

Returns:

thinned_trace: arviz.InferenceData: Thinned version of full_trace.

Loading and processing data#

This Page