Loading and processing data#
Functions for importing and processing proxy and age constraint data.
Import and pre-process proxy data and age constraints from .csv files formatted according to the Data table formatting guidelines. |
|
Helper function for merging |
|
Subsample a set of proxy observations. |
|
Custom load command for pickle (.pkl) object (variables can be saved as .pkl files with |
|
Custom load command for NetCDF file containing a trace ( |
|
Save variable as a pickle (.pkl) object. |
|
Save trace ( |
|
Helper function for combining multiple |
|
Remove a subset of chains from a |
|
Remove a subset of draws from a |
|
Calculate apparent sediment accumulation rate between successive samples (if |
|
Helper function for cleaning sample data before running an inversion. |
|
Helper function for converting depth in core to height in section. |
|
Helper function for combining multiple proxy measurements from the same stratigraphic horizon. |
|
Helper function for |
|
Helper function for |
- stratmc.data.accumulation_rate(full_trace, sample_df, ages_df, method='all', age_model='posterior', include_age_constraints=True, **kwargs)[source]#
Calculate apparent sediment accumulation rate between successive samples (if
method = 'successive') or every possible sample pairing (method = 'all').Note that if
method = 'all', rate is returned in mm/year, and duration is returned in years. Ifmethod = 'successive', rate is returned in m/Myr, and duration is returned in Myr. Input data are assumed to have units of meters and millions of years. Used as input tosadler_plot()andaccumulation_rate_stratigraphy()instratmc.plotting.- Parameters:
- full_trace: arviz.InferenceData
An
arviz.InferenceDataobject containing the full set of prior and posterior samples fromget_trace()instratmc.inference.- sample_df: pandas.DataFrame
pandas.DataFramecontaining all proxy data.- ages_df: pandas.DataFrame
pandas.DataFramecontaining age constraints from all sections.- method: str, optional
Whether to calculate accumulation rates between every possible sample pairing (‘all’), or between successive samples (‘successive’); defaults to ‘all’.
- age_model: str, optional
Whether to calculate accumulation rates using the the posterior or prior age model for each section; defaults to ‘posterior’.
- include_age_constraints: bool, optional
Whether to include radiometric age constraints in accumulation rate calculations; defaults to
True.- sections: list(str) or numpy.array(str), optional
List of sections to include. Defaults to all sections in
sample_df.
- Returns:
- rate_df: pandas.DataFrame
pandas.DataFramecontaining sediment accumulation rates and associated durations.
- stratmc.data.clean_data(sample_df, ages_df, proxies, sections)[source]#
Helper function for cleaning sample data before running an inversion. Sets
Exclude?toTruefor samples with no relevant proxy observations, removes sections where all samples have been excluded, and drops excluded age constraints.- Parameters:
- sample_df: pandas.DataFrame
pandas.DataFramecontaining proxy data for all sections.- ages_df: pandas.DataFrame
pandas.DataFramecontaining age constraints for all sections.- proxies: str or list(str)
Proxies to include in the inference.
- sections: list(str) or numpy.array(str)
List of sections to include in the inference (as named in
sample_dfandages_df).
- Returns:
- sample_df: pandas.DataFrame
pandas.DataFramecontaining cleaned proxy data for all sections.- ages_df: pandas.DataFrame
pandas.DataFramecontaining cleaned age constraint data for all sections.
- stratmc.data.combine_data(dataframes)[source]#
Helper function for merging
pandas.DataFrameobjects containing proxy observations or age constraints. Data are merged using thesectionandheightcolumns.- Parameters:
- dataframes: list(pandas.DataFrame)
List of
pandas.DataFrameobjects to merge.
- Returns:
- merged_data: pandas.DataFrame
pandas.DataFramecontaining merged data.
- stratmc.data.combine_duplicates(sample_df, proxies, proxy_sigma_default=0.1, combine_no_superposition=False)[source]#
Helper function for combining multiple proxy measurements from the same stratigraphic horizon. For each horizon with multiple proxy values, replaces the proxy value with the mean, and replaces the standard deviation with the combined uncertainty (
proxy_stdvalues summed in quadrature) for all measurements. The standard deviation of the population of proxy values for each horizon is stored in theproxy_population_stdcolumn ofsample_df(inbuild_model(), the uncertainty of each proxy observation is modeled as theproxy_stdandproxy_population_stdvalues summed in quadrature).- Parameters:
- sample_df: pandas.DataFrame
pandas.DataFramecontaining proxy data for all sections.- proxies: list(str)
List of proxies to include in the inference.
- proxy_sigma_default: float or dict{float}, optional
Measurement uncertainty (\(1\sigma\)) to use for proxy observations if not specified in
proxy_stdcolumn ofsample_df. To set a different value for each proxy, pass a dictionary with proxy names as keys. Defaults to 0.1.- combine_no_superposition: bool, optional
Whether to combine samples without superposition information by averaging their proxy values; defaults to
False.
- Returns:
- sample_df: pandas.DataFrame
pandas.DataFramecontaining proxy data with duplicates combined.
- stratmc.data.combine_traces(trace_list)[source]#
Helper function for combining multiple
arviz.InferenceDataobjects (saved as NetCDF files) that contain prior and posterior samples for the same inference model (sampled withget_trace()instratmc.inference). Thearviz.InferenceDataobjects are concatenated along thechaindimension such that if two traces with 8 chains each are concatenated, the new combined trace will have 16 chains.- Parameters:
- trace_list: list(str)
List of paths to
arviz.InferenceDataobjects (saved as NetCDF files) to be merged.
- Returns:
- combined_trace: arviz.InferenceData
New
arviz.InferenceDataobject containing the prior and posterior draws for all traces intrace_list.
- stratmc.data.depth_to_height(sample_df, ages_df)[source]#
Helper function for converting depth in core to height in section.
- Parameters:
- sample_df: pandas.DataFrame
pandas.DataFramecontaining proxy data for all sections.- ages_df: pandas.DataFrame
pandas.DataFramecontaining age constraints for all sections.
- Returns:
- sample_df: pandas.DataFrame
pandas.DataFramecontaining proxy data for all sections, with depth in core converted to height in section.- ages_df: pandas.DataFrame
pandas.DataFramecontaining age constraints for all sections, with depth in core converted to height in section.
- stratmc.data.downsample(sample_df, ages_df, N=5000, likelihood_ratio_min=0.5, proxy='d13c', keep='best', keep_seed=None, resample_with_lowest_n=True, flexible_n=True, best_criteria='corr_coef', split_environments=True, **kwargs)[source]#
Subsample a set of proxy observations. Calculates the likelihood of the original stratigraphic signal given the subsampled signal and uncertainty in the data. Returns the solution that meets the mean likelihood ratio minimum with the lowest number of downsampled data points. See input parameter descriptions for additional details.
- Parameters:
- sample_df: pandas.DataFrame
pandas.DataFramecontaining proxy data for all sections.- ages_df: pandas.DataFrame
pandas.DataFramecontaining age constraints for all sections.- N: int
Number of random sample groupings to test. Defaults to 5,000.
- likelihood_ratio_min: float or dict{float}, optional
Minimum acceptable likelihood ratio. For each section, the algorithm selects the smallest downsampled data set that meets this threshold. If multiple solutions with this minimum number of data points exist, then the solution with the highest correlation coefficient is selected if
keepis ‘best’, while a random one of these solutions is selected ifkeepis ‘random’. Must be in[0, 1]; defaults to 0.5. Pass as a dictionary to specify a different value for each section.- keep: str
If there are multiple solutions that satisfy
likelihood_ratio_minusing the minimum possible number of data points, whether to return the best one of these solutions (‘best’), or a random solution (‘random’). Defaults to ‘best’.- flexible_n: bool
Whether to consider solutions with 1 more data point than the minimum. Defaults to
True.- resample_with_lowest_n: bool
Whether to generate another N random solutions with the minimum number of data points required to staisfy
likelihood_ratio_min(or one more than the minimum number of data points, ifflexible_n = True). Improves exploration of the solution space. Defaults toTrue.- best_criteria: str
Which metric to use to identify the best solution among the candidate solutions that meet or exceed
likelihood_ratio_min(ifmodeis ‘best’). Either ‘likelihood_ratio’ (mean likelihood ratio) or ‘corr_coef’ (maximum Pearson correlation coefficient); defalts to ‘corr_coef’.- split_environments: bool
Whether to insert breaks between different depositional environments (using ‘Depositional Environment’ column in
sample_df). Defaults toTrue.- proxy: str, optional
Proxy to downsample. Defaults to ‘d13c’.
- sections: list(str) or numpy.array(str), optional
List of sections to downsample. Defaults to all sections in
sample_df.
- Returns:
- downsampled_data: pandas.DataFrame
pandas.DataFramecontaining downsampled proxy data. All samples are still included in the DataFrame, but samples that were excluded during downsampling are markedExclude? = True.- solution_likelihood_ratios: dict
Dictionary with the mean likelihood ratio for chosen solutions; keys are section names.
- solution_corr_coefs: dict
Dictionary with the correlation coefficients for chosen solutions; keys are section names.
- stratmc.data.drop_chains(full_trace, chains)[source]#
Remove a subset of chains from a
arviz.InferenceDataobject.- Parameters:
- full_trace: arviz.InferenceData
An
arviz.InferenceDataobject containing the full set of prior and posterior samples fromget_trace()instratmc.inference.- chains: list or np.array of int
Indices of chains to remove from
full_trace.
- Returns:
- full_trace_clean: arviz.InferenceData
Copy of
full_tracewithout the chains specified inchains.
- stratmc.data.get_boundaries(sample_df, ages_df, proxy, section, environment=True, depositional_ages=True, superposition=True)[source]#
Helper function for
downsample(). Returns list of height boundaries where the target section must be split into different groups. By default, inserts breaks between samples from different depositional environments, around groups of samples with the same depositional age, and around groups of samples without superposition information.- Parameters:
- sample_df: pandas.DataFrame
pandas.DataFramecontaining all proxy data.- ages_df: pandas.DataFrame
pandas.DataFramecontaining age constraints from all sections.- proxy: str
Name of proxy to be downsampled.
- section: str
Name of target section.
- environment: bool
Whether to insert breaks between different depositional environments (requires ‘Depositional Environment’ column in
sample_df). Defaults toTrue.- depositional_ages: bool
Whether to insert breaks around groups of samples with the same depositional age constraint. Defaults to
True.- superposition:
Whether to insert breaks around groups of samples without superposition information. Defaults to
True.
- Returns:
- boundary_heights: numpy.array
Array containing required boundaries.
- stratmc.data.load_data(sample_file, ages_file, proxies=['d13c'], proxy_sigma_default=0.1, drop_excluded_samples=True, drop_excluded_ages=True, combine_no_superposition=False)[source]#
Import and pre-process proxy data and age constraints from .csv files formatted according to the Data table formatting guidelines. To combine data from different .csv files, load each file separately and then combine the DataFrames with
combine_data().By default, samples marked
Exclude? = Truewill be dropped from the data table. Ifsample_file.csvincludes multiple proxy observations from the same stratigraphic horizon (for a given proxy), then all measurements markedExclude? = Falseandsuperposition? = True `` will be combined using :py:meth:`combine_duplicates() <stratmc.data.combine_duplicates>`. Samples marked ``superposition? = Falsewill remain separate, and their order will be randomized within the inference model. These default behaviors can be modified by passing thedrop_excluded_samplesandcombine_no_superpositionarguments.- Parameters:
- sample_file: str
Path to .csv file containing proxy data for all sections (without ‘.csv’ extension).
- ages_file: str
Path to .csv file containing age constraints for all sections (without ‘.csv’ extension).
- proxies: str or list(str), optional
proxy names (must match column headers in
sample_file.csv); defaults to ‘d13c’.- proxy_sigma_default: float or dict{float}, optional
Measurement uncertainty (\(1\sigma\)) to use for proxy observations if not specified in
proxy_stdcolumn ofsample_df. To set a different value for each proxy, pass a dictionary with proxy names as keys. Defaults to 0.1.- drop_excluded_samples: bool, optional
Whether to remove samples with
Exclude? = Truefrom thesample_df; defaults toTrue. If excluded samples are not dropped, their ages will be passively tracked within the inference model (but they will not be considered during the proxy signal reconstruction).- drop_excluded_ages: bool, optional
Whether to remove ages with
Exclude? = Truefrom theages_df; defaults toTrue.- combine_no_superposition: bool, optional
Whether to combine samples without superposition information by averaging their proxy values; defaults to
False.
- Returns:
- sample_df: pandas.DataFrame
pandas.DataFramecontaining proxy data for all sections.- ages_df: pandas.DataFrame
pandas.DataFramecontaining age constraints for all sections.
- stratmc.data.load_object(path)[source]#
Custom load command for pickle (.pkl) object (variables can be saved as .pkl files with
save_object()).- Parameters:
- path: str
Path to saved .pkl file (without the ‘.pkl’ extension).
- Returns:
- var:
Variable saved in
path.
- stratmc.data.load_trace(path)[source]#
Custom load command for NetCDF file containing a trace (
arviz.InferenceDataobject saved withsave_trace()).- Parameters:
- path: str
Path to saved NetCDF file (without the ‘.nc’ extension).
- Returns:
- trace: arviz.InferenceData
Trace saved as NetCDF file.
- stratmc.data.remove_extra_bounds(heights, boundaries)[source]#
Helper function for
downsample(); removes duplicate or extraneous boundaries from list of candidate cluster boundaries.- Parameters:
- proxy: numpy.array
array containing proxy values for samples in group
- height: pandas.DataFrame
array containing heights for samples in group
- bounds: np.array
array containing heights of boundaries between groups
- Returns:
- centroid: np.array
Array containing centroid coordinates: [proxy_center, height_center]
- stratmc.data.save_object(var, path)[source]#
Save variable as a pickle (.pkl) object.
- Parameters:
- var:
Variable to be saved.
- path: str
Location (including the file name, without ‘.pkl’ extension) to save
var.
- stratmc.data.save_trace(trace, path)[source]#
Save trace (
arviz.InferenceDataobject) as a NetCDF file.- Parameters:
- trace: arviz.InferenceData
An
arviz.InferenceDataobject containing the full set of prior and posterior samples frombuild_model()instratmc.model(the output ofget_trace()instratmc.inference).- path: str
Location (including the file name, without ‘.nc’ extension) to save
trace.
- stratmc.data.thin_trace(full_trace, drop_freq=2)[source]#
Remove a subset of draws from a
arviz.InferenceDataobject. Only applies to groups associated with the posterior (the prior draws will not be affected).- Parameters:
- full_trace: arviz.InferenceData
An
arviz.InferenceDataobject containing the full set of prior and posterior samples fromget_trace()instratmc.inference.- drop_freq: int
Frequency of draw removal. For example, 2 will remove every other draw, while 4 will remove every fourth draw.
- Returns:
- thinned_trace: arviz.InferenceData
Thinned version of
full_trace.