Empirical distributions and PMFs (probability mass functions)

The apop_pmf model wraps an apop_data set so it can be read as an empirical model, with a likelihood function (equal to the associated weight for observed values and zero for unobserved values), a random number generator (which simply makes weighted random draws from the data), and so on. Setting it up is a model estimation from data like any other, done via apop_estimate(`your_data`

, apop_pmf).

You have the option of cleaning up the data before turning it into a PMF. For example...

apop_data_pmf_compress(your_data); //remove duplicates

apop_data_sort(your_data);

apop_vector_normalize(your_data->weights); //weights sum to one

These are largely optional.

- The CDF is calculated based on the percent of the weights between the zeroth row of the PMF and the row specified. This generally makes more sense after apop_data_sort.
- Compression produces a corresponding improvement in efficiency when first calculating CDFs, but is otherwise not necessary.
- Sorting or normalizing is not necessary for making draws or getting a likelihood or log likelihood.

It is the `weights`

vector that holds the density represented by each row; the rest of the row represents the coordinates of that density. If the input data set has no `weights`

segment, then I assume that all rows have equal weight.

For a PMF model, the `parameters`

are `NULL`

, and the `data`

itself is used for calculation. Therefore, modifying the data post-estimation can break some internal settings set during estimation. If you modify the data, throw away any existing PMFs (via apop_model_free) and re-estimate a new one.

Using apop_data_pmf_compress puts the data into one bin for each unique value in the data set. You may instead want bins of fixed with, in the style of a histogram, which you can get via apop_data_to_bins. It requires a bin specification. If you send a `NULL`

binspec, then the offset is zero and the bin size is big enough to ensure that there are bins from minimum to maximum. The binspec will be added as a page to the data set, named `"<binspec>"`

. See the apop_data_to_bins documentation on how to write a custom bin spec.

There are a few ways of testing the claim that one distribution equals another, typically an empirical PMF versus a smooth theoretical distribution. In both cases, you will need two distributions based on the same binspec.

For example, if you do not have a prior binspec in mind, then you can use the one generated by the first call to the histogram binning function to make sure that the second data set is in sync:

apop_data_to_bins(first_set, NULL);

You can use apop_test_kolmogorov or apop_histograms_test_goodness_of_fit to generate the appropriate statistics from the pairs of bins.

Kernel density estimation will produce a smoothed PDF. See apop_kernel_density for details. Or, use apop_vector_moving_average for a simpler smoothing method.

- apop_data_pmf_compress() : merge together redundant rows in a data set before calling apop_estimate(
`your_data`

, apop_pmf); optional. - apop_vector_moving_average() : smooth a vector (e.g.,
`your_pmf->data->weights`

) via moving average. - apop_histograms_test_goodness_of_fit() : goodness-of-fit via statistic
- apop_test_kolmogorov() : goodness-of-fit via Kolmogorov-Smirnov statistic
- apop_kl_divergence() : measure the information loss from one (typically empirical) distribution to another distribution.