Patterns in static

Apophenia

Models

Enumerations

enum  apop_bernoulli {
  Name, Input format, Post-estimate data, Post-estimate parameters,
  Post-estimate info, RNG
}
 
enum  apop_beta {
  Name, Input format, Parameter format, Post-estimate data,
  Post-estimate info, RNG
}
 
enum  apop_binomial { Input format, Parameter format, Post-estimate data, RNG }
 
enum  apop_coordinate_transform { Name, Input format, Post-estimate data, Settings }
 
enum  apop_cross {
  Name, Input format, Parameter format, Post-estimate data,
  Settings
}
 
enum  apop_dconstrain {
  Name, Input format, Parameter format, Post-estimate data,
  RNG, Settings, Examples
}
 
enum  apop_dirichlet {
  Name, Input format, Parameter format, Post-estimate data,
  RNG
}
 
enum  apop_exponential {
  Name, Input format, Parameter format, Post-estimate data,
  RNG, CDF
}
 
enum  apop_gamma {
  Name, Input format, Parameter format, Post-estimate data,
  RNG
}
 
enum  apop_improper_uniform {
  Name, Input format, Parameter format, Post-estimate data,
  Post-estimate parameters, RNG, CDF
}
 
enum  apop_iv {
  Name, Input format, Prep routine, Parameter format,
  Post-estimate data, Examples
}
 
enum  apop_kernel_density {
  Name, Input format, Parameter format, Post-estimate data,
  RNG, CDF, Settings, Examples
}
 
enum  apop_loess {
  Name, Input format, Parameter format, Post-estimate data,
  Post-estimate parameters, Predict
}
 
enum  apop_logit {
  Name, Input format, Prep routine, Parameter format,
  Post-estimate data, RNG
}
 
enum  apop_lognormal {
  Name, Input format, Parameter format, Post-estimate data,
  Post-estimate info, RNG
}
 
enum  apop_mixture {
  Name, Input format, Parameter format, Post-estimate data,
  RNG, Settings, Examples
}
 
enum  apop_multinomial {
  Name, Input format, Parameter format, Post-estimate data,
  Post-estimate info, RNG
}
 
enum  apop_multivariate_normal {
  Name, Input format, Parameter format, Post-estimate data,
  Post-estimate info, RNG
}
 
enum  apop_normal {
  Name, Input format, Parameter format, Post-estimate data,
  Post-estimate info, Predict, RNG, Settings
}
 
enum  apop_ols {
  Name, Input format, Prep routine, Parameter format,
  Post-estimate data, Post-estimate parameter model, Post-estimate info, RNG,
  Examples
}
 
enum  apop_pmf {
  Name, Input format, Parameter format, Post-estimate data,
  Post-estimate parameters, RNG, CDF, Settings
}
 
enum  apop_poisson {
  Name, Input format, Parameter format, Post-estimate data,
  Post-estimate parameters, Post-estimate info, RNG
}
 
enum  apop_probit {
  Name, Input format, Prep routine, Parameter format,
  Post-estimate data, RNG
}
 
enum  apop_t_distribution { Name, Input format, Parameter format, Post-estimate data }
 
enum  apop_uniform {
  Name, Input format, Parameter format, Post-estimate data,
  Post-estimate info
}
 
enum  apop_yule {
  Name, Input format, Parameter format, Post-estimate data,
  RNG, Settings
}
 
enum  apop_zipf {
  Name, Input format, Parameter format, Post-estimate data,
  RNG, Settings
}
 

Detailed Description

This section is a detailed description of the stock models that ship with Apophenia. It is a reference. For an explanation of what to do with an apop_model, see Models.

The primary questions one has about a model in practice are what format the input data should take and what to expect of an estimated output.

Generally, the input data consists of an apop_data set where each row is a single observation. Details beyond that are listed below.

The output after running apop_estimate to produce a fitted model are generally found in three places: the vector of the output parameter set, its matrix, or a new settings group. The basic intuition is that if the parameters are always a short list of scalars, they are in the vector; if there exists a situation where they could take matrix form, the parameters will be in the matrix; if they require more structure than that, they will be a settings group.

If the basic structure of the apop_data set is unfamiliar to you, see Data sets, which will discuss the basic means of getting data out of a struct. For example, the estimated apop_normal distribution has the mean in position zero of the vector and the standard deviation in position one, so they could be extracted as follows:

apop_data *d = apop_text_to_data("sample data from before")
double mu = apop_data_get(out>parameters, 0);
double sigma = apop_data_get(out>parameters, 1);
//What is the p-value of test whose null hypothesis is that μ=3.3?
printf ("pval=%g\n", apop_test(3.3, "normal", mu, sigma);

See Models for discussion of how to pull settings groups using Apop_settings_get (for one item) or apop_settings_get_group (for a full settings group).

Model Documentation

apop_model apop_bernoulli

The Bernoulli model: A single random draw with probability $p$.

Name 

Bernoulli distribution

Input format 

The Bernoulli parameter $p$ is the percentage of non-zero values in the matrix. Its variance is $p(1-p)$.

Post-estimate data 

Unchanged.

Post-estimate parameters 

$p$ is the only element in the vector (e.g., get its value via double p = apop_data_get(outmodel->parameters);). A <Covariance> page has the variance of $p$ in the (0,0)th element of the matrix.

Post-estimate info 

Reports log likelihood.

RNG 

Returns zero or one.

apop_model apop_beta

The beta distribution has two parameters and is restricted to data between zero and one. You may also find apop_beta_from_mean_var to be useful.

Name 

Beta distribution

Input format 

Any arrangement of scalar values.

Parameter format 

A vector, v[0]= $\alpha$; v[1]= $\beta$

Post-estimate data 

Unchanged.

Post-estimate info 

Reports log likelihood.

RNG 

Produces a scalar $\in[0,1]$.

apop_model apop_binomial

The multi-draw generalization of the Bernoulli, or the two-bin special case of the Multinomial distribution.

It is implemented as an alias of the apop_multinomial model, except that it has an explicit CDF, we know it has two parameters, and its draw method returns a scalar. I.e., .vsize==2 and .dsize==1.

Input format 

Each row of the matrix is one observation, consisting of two elements. The number of draws of type zero (sometimes read as `misses' or `failures') are in column zero, the number of draws of type one (`hits', `successes') in column one.

Parameter format 

a vector, v[0]= $n$; v[1]= $p_1$. Thus, $p_0$ isn't written down; see apop_multinomial for further discussion. If you input $v[1]>1$ and apop_opts.verbose >=1, the log likelihood function will throw a warning. Post-estimate, will have a <Covariance> page with the covariance matrix for the $p$s ( $n$ effectively has no variance).

Post-estimate data 

Unchanged.

RNG 

The RNG returns a single number representing the success count, not a vector of length two giving both the failure bin and success bin. This is notable because it differs from the input data format, but it tends to be what people expect from a Binomial RNG. For draws with both dimensions (or situations where draws are fed back into the model), use an apop_multinomial model with .vsize =2.

Apply a coordinate transformation of the data to produce a distribution over the transformed data space. This is sometimes called a Jacobian transformation.

Here is an example that replicates the Lognormal distribution. Note the use of apop_model_copy_set to set up a model with the given settings.

/* A Lognormal distribution is a transform of the Normal distribution, where
the data space of the Normal is exponentiated. Thus, to get back to the original data space, take the log of the current data.
*/
#include <apop.h>
#define Diff(a, b) assert(fabs((a)-(b)) < 1e-2);
//Use this function to produce test data below.
apop_data *draw_exponentiated_normal(double mu, double sigma, double draws){
apop_model *n01 = apop_model_set_parameters(apop_normal, mu, sigma);
gsl_rng *r = apop_rng_alloc(13);
for (int i=0; i< draws; i++) apop_draw(gsl_vector_ptr(d->vector,i), r, n01);
apop_vector_exp(d->vector);
return d;
}
// The transformed-to-base function and its derivative for the Jacobian:
apop_data *rev(apop_data *in){ return apop_map(in, .fn_d=log, .part='a'); }
/*The derivative of the transformed-to-base function. */
double inv(double in){return 1./in;}
double rev_j(apop_data *in){ return fabs(apop_map_sum(in, .fn_d=inv, .part='a')); }
int main(){
apop_model *ct = apop_model_coordinate_transform(
.transformed_to_base= rev, .jacobian_to_base=rev_j,
.base_model=apop_normal);
//Apop_model_add_group(ct, apop_parts_wanted);//Speed up the MLE.
//make fake data
double mu=2, sigma=1;
apop_data *d = draw_exponentiated_normal(mu, sigma, 2e5);
//If we correctly replicated a Lognormal, mu and sigma will be right:
apop_model *est = apop_estimate(d, ct);
Diff(apop_data_get(est->parameters, 0), mu);
Diff(apop_data_get(est->parameters, 1), sigma);
/*The K-L divergence between our Lognormal and the stock Lognormal
should be small. Try it with both the original params and the estimated ones. */
apop_model *ln = apop_model_set_parameters(apop_lognormal, mu, sigma);
ln2->parameters = est->parameters;
Diff(apop_kl_divergence(ln, ln2,.draw_ct=1000), 0);
Diff(apop_kl_divergence(ln, est,.draw_ct=1000), 0);
}
Name 

Fill me

Input format 

The input data is sent to the first model, so use the input format for that model.

Post-estimate data 

Unchanged.

Settings 

apop_coordinate_transform_settings

apop_model apop_cross

A cross product of models. Generate via apop_model_cross .

For the case when you need to bundle two uncorrelated models into one larger model. For example, the prior for a multivariate normal (whose parameters are a vector of means and a covariance matrix) is a Multivariate Normal-Wishart pair.

Name 

Cross product of models

Input format 

There are two means of handling the input format. If the settings group attached to the data set has a non-NULL splitpage element, then append the second data set as an additional page to the first data set, and name the second set with the name you listed in splitpage; see the example.

If splitpage is NULL, then I will send the same data set to both models.

Parameter format 

currently NULL; check the sub-models for their parameters.

Post-estimate data 

Unchanged.

Settings 

apop_cross_settings

apop_model apop_dconstrain

A model that constrains the base model to within some data constraint. E.g., truncate $P(d)$ to zero for all $d$ outside of a given constraint. Generate using apop_model_dconstrain .

The log likelihood works by using the base_model log likelihood, and then scaling it based on the part of the base model's density that is within the constraint. If you have an easy means of specifying what that density is, please do, as in the example. If you do not, the log likelihood will calculate it by making draw_ct random draws from the base model and checking whether they are in or out of the constraint. Because this default method is stochastic, there is some loss of precision.

The previous scaling is stored in the apop_dconstrain settings group. Get/set via:

1 double scale = Apop_settings_get(your_model, apop_dconstrain, scale);
2 Apop_settings_set(your_model, apop_dconstrain, scale, 0);

If scale is zero, because that is the default or because you set it as above, then I recalculate the scale. If the value of the parameters changed since scale was last calculated, I recalculate. If you made other relevant changes to the scale, then you may need to manually zero out scale so it can be recalculated.

Here is an example that makes a few draws and estimations from data-constrained models. Note the use of apop_model_set_settings to prepare the constrained models.

Name 

Data-constrained model

Input format 

That of the base model.

Parameter format 

That of the base model. In fact, the parameters element is a pointer to the base model parameters, so both are modified simultaneously.

Post-estimate data 

Unchanged.

RNG 

Draw from the base model; if the draw is outside the constraint, throw it out and try again.

Settings 

apop_dconstrain_settings

Examples 
#include <apop.h>
//The constraint function.
double over_zero(apop_data *in, apop_model *m){
return apop_data_get(in) > 0;
}
//The optional scaling function.
double in_bounds(apop_model *m){
double z = 0;
gsl_vector_view vv = gsl_vector_view_array(&z, 1);
return 1- apop_cdf(&((apop_data){.vector=&vv.vector}), m);
}
int main(){
/*Set up a Normal distribution, with data truncated to be nonnegative.
This version doesn't use the in_bounds function above, and so the
default scaling function is used.*/
gsl_rng *r = apop_rng_alloc(213);
apop_model *norm = apop_model_set_parameters(apop_normal, 1.2, 0.8);
apop_model *trunc = apop_model_set_settings(apop_dconstrain,
.base_model=apop_model_copy(norm),
.constraint=over_zero, .draw_ct=5e4, .rng=r);
//make draws. Currently, you need to prep the model first.
apop_prep(NULL, trunc);
apop_data *d = apop_model_draws(trunc, 1e5);
//Estimate the parameters given the just-produced data:
apop_model *est = apop_estimate(d, trunc);
assert(apop_vector_distance(est->parameters->vector, norm->parameters->vector)<1e-1);
//Generate a data set that is truncated at zero using alternate means
apop_data *normald = apop_model_draws(apop_model_set_parameters(apop_normal, 0, 1), 5e4);
for (int i=0; i< normald->matrix->size1; i++){
double *d = apop_data_ptr(normald, i);
if (*d < 0) *d *= -1;
}
//this time, use an unparameterized model, and the in_bounds fn
apop_model *re_trunc = apop_model_set_settings(apop_dconstrain,
.base_model=apop_normal,
.constraint=over_zero, .scaling=in_bounds);
apop_model *re_est = apop_estimate(normald, re_trunc);
assert(apop_vector_distance(re_est->parameters->vector,
apop_vector_fill(gsl_vector_alloc(2), 0, 1))<1e-1);
}
apop_model apop_dirichlet

A multivariate generalization of the Beta distribution.

Name 

Dirichlet distribution

Input format 

Each row of your data matrix is a single observation.

Parameter format 

The estimated parameters are in the output model's parameters->vector. The size of the model is determined by the width of your input data set, so later RNG draws, &c will match in size.

Post-estimate data 

Unchanged.

RNG 

A call to gsl_ran_dirichlet. Output format is identical to the input data format.

apop_model apop_exponential

The Exponential distribution.

$Z(\mu,k) = \sum_k 1/\mu e^{-k/\mu} $
$ln Z(\mu,k) = \sum_k -\ln(\mu) - k/\mu $
$dln Z(\mu,k)/d\mu = \sum_k -1/\mu + k/(\mu^2) $

Some write the function as: $Z(C,k) = \ln C C^{-k}. $ If you prefer this form, just convert your parameter via $\mu = {1\over \ln C}$ (and convert back from the parameters this function gives you via $C=\exp(1/\mu)$).

Name 

Exponential distribution

Input format 

One scalar observation per row (in the matrix or vector). See also apop_data_rank_compress for means of dealing with one more input data format.

Parameter format 

$\mu$ is in the zeroth element of the vector.

Post-estimate data 

Unchanged.

RNG 

Just a wrapper for gsl_ran_exponential.

CDF 

Returns a scalar draw.

apop_model apop_gamma

$G(x, a, b) = {1\over (\Gamma(a) b^a)} x^{a-1} e^{-x/b}$

$ln G(x, a, b)= -ln \Gamma(a) - a ln b + (a-1)ln(x) + -x/b$

$d ln G/ da = -\psi(a) - ln b + ln(x) $ (also, $d ln \gamma = \psi$)

$d ln G/ db = -a/b + x/(b^2) $

Name 

Gamma distribution

Input format 

A scalar, in the vector or matrix elements of the input apop_data set.

See also apop_data_rank_compress for means of dealing with one more input data format.

Parameter format 

First two elements of the vector are $$ and $$.

Post-estimate data 

Unchanged.

RNG 

A wrapper for gsl_ran_gamma, which returns a scalar.

See the notes for apop_exponential on a popular alternate form.

The improper uniform returns $P(x) = 1$ for every value of x, all the time (and thus, log likelihood(x)=0). It has zero parameters.

Name 

Improper uniform distribution

Input format 

Ignored.

Parameter format 

NULL

Post-estimate data 

Unchanged.

Post-estimate parameters 

NULL

RNG 

The draw function makes no sense, and therefore sets the value in *out to NAN, returns 1, and prints a warning if apop_opts.verbose >=1.

CDF 

Half of the distribution is less than every given point, so the CDF always returns 0.5. One could perhaps make an argument that this should really be infinity, but a half is more in the spirit of the distribution's use to represent a lack of information.

apop_model apop_iv

Instrumental variable regression

Operates much like the apop_ols model, but the input parameters also need to have a table of substitutions (like the addition of the .instruments setting in the example below).

Which columns substitute where can be specified in your choice of two ways. The first is to use the vector element of the apop_data set to list the column numbers to be substituted (the dependent variable is zero; first independent column is one), and then one column for each item to substitute.

The second method, if the vector of the instrument apop_data set is NULL, is to use the column names to find the matching columns in the base data to substitute. This is generally more robust and/or convenient.

  • If the instruments data set is NULL or empty, I'll just run OLS.
  • The apop_lm_settings group has a destroy_data setting. If you set that to 'y', I will overwrite the column in place, saving the trouble of copying the entire data set.
Name 

instrumental variables

Input format 

See the discussion on the apop_ols page regarding its prep routine. See above regarding the .instruments elment of the attached apop_lm_settings group.

Prep routine 

See the discussion on the apop_ols page regarding its prep routine.

Parameter format 

As per apop_ols

Post-estimate data 

Unchanged.

Examples 
/* Instrumental variables are often used to deal with variables measured with noise, so
this example produces a data set with a column of noisy data, and a separate instrument
measured with greater precision, then sets up and runs an instrumental variable regression.
To guarantee that the base data set has noise and the instrument is cleaner, the
procedure first generates the clean data set, then copies the first column to the
instrument set, then the add_noise function inserts Gaussian noise into the base
data set. Once the base set and the instrument set have been generated, the setup for
the IV consists of adding the relevant names and using Apop_model_add_group to add a
lm (linear model) settings group with an .instrument=instrument_data element.
In fact, the example sets up a sequence of IV regressions, with more noise each
time.
*/
#include <apop.h>
#define Diff(L, R, eps) Apop_stopif(fabs((L)-(R)>=(eps)), return, 0, "%g is too different \
from %g (abitrary limit=%g).", (double)(L), (double)(R), eps);
int datalen =1e4;
//generate a vector that is the original vector + noise
void add_noise(gsl_vector *in, gsl_rng *r, double size){
apop_model *nnoise = apop_model_set_parameters(apop_normal, 0, size);
apop_data *nd = apop_model_draws(nnoise, in->size);
gsl_vector_add(in, Apop_cv(nd, 0));
/*for (int i=0; i< in->size; i++){
double noise;
apop_draw(&noise, r, nnoise);
*gsl_vector_ptr(in, i) += noise;
}*/
apop_model_free(nnoise);
}
void test_for_unbiased_parameter_estimates(apop_model *m, double tolerance){
Diff(apop_data_get(m->parameters, 0, -1), -1.4, tolerance);
Diff(apop_data_get(m->parameters, 1, -1), 2.3, tolerance);
}
int main(){
gsl_rng *r = apop_rng_alloc(234);
apop_data *data = apop_data_alloc(datalen, 2);
for(int i=0; i< datalen; i++){
apop_data_set(data, i, 1, 100*(gsl_rng_uniform(r)-0.5));
apop_data_set(data, i, 0, -1.4 + apop_data_get(data,i,1)*2.3);
}
apop_name_add(data->names, "dependent", 'c');
apop_name_add(data->names, "independent", 'c');
apop_model_show(oest);
//the data with no noise will be the instrument.
gsl_vector *col1 = Apop_cv(data, 1);
apop_data *instrument_data = apop_data_alloc(data->matrix->size1, 1);
gsl_vector_memcpy(Apop_cv(instrument_data, 0), col1);
apop_name_add(instrument_data->names, "independent", 'c');
Apop_model_add_group(apop_iv, apop_lm, .instruments = instrument_data);
//Now add noise to the base data four times, and estimate four IVs.
int tries = 4;
apop_model *ests[tries];
for (int nscale=0; nscale<tries; nscale++){
add_noise(col1, r, nscale==0 ? 0 : pow(10, nscale-tries));
ests[nscale] = apop_estimate(data, apop_iv);
if (nscale==tries-1){ //print the one with the largest error.
printf("\nnow IV:\n");
apop_model_show(ests[nscale]);
}
}
/* Now test. The parameter estimates are unbiased.
As we add more noise, the covariances expand.
Test that the ratio of one covariance matrix to the next
is less than one, though these are typically very much
smaller than one (as the noise is an order of magnitude
larger in each case), and the ratios will be identical
for each j, k below. */
test_for_unbiased_parameter_estimates(ests[0], 1e-6);
for (int i=1; i<tries; i++){
test_for_unbiased_parameter_estimates(ests[i], 1e-3);
gsl_matrix *cov = apop_data_get_page(ests[i-1]->parameters, "<Covariance>")->matrix;
gsl_matrix *cov2 = apop_data_get_page(ests[i]->parameters, "<Covariance>")->matrix;
gsl_matrix_div_elements(cov, cov2);
for (int j =0; j< 2; j++)
for (int k =0; k< 2; k++)
assert(gsl_matrix_get(cov, j, k) < 1);
}
}
apop_model apop_kernel_density

The kernel density smoothing of a PMF or histogram.

At each point along the histogram, put a distribution (default: Normal(0,1)) on top of the point. Sum all of these distributions to form the output distribution.

Setting up a kernel density consists of setting up a model with the base data and the information about the kernel model around each point. This can be done using the apop_model_set_settings function to get a copy of the base apop_kernel_density model and add a apop_kernel_density_settings group with the appropriate information; see the main function of the example below.

Name 

kernel density estimate

Input format 

One observation per line. Each row in turn will be passed through to the elements of kernelbase and optional set_params function, so follow the format of the base model.

Parameter format 

None

Post-estimate data 

Unchanged.

RNG 

Randomly selects a data point, then randomly draws from that sub-distribution. Returns 0 on success, 1 if unable to pick a sub-distribution (meaning the weights over the distributions are somehow broken), and 2 if unable to draw from the sub-distribution.

CDF 

Sums the CDF to the given point of all the sub-distributions.

Settings 

apop_kernel_density_settings, including:

  • data a data set, which, if not NULL and base_pmf is NULL, will be converted to an apop_pmf model.
  • base_pmf This is the preferred format for input data. It is the histogram to be smoothed.
  • kernelbase The kernel to use for smoothing, with all parameters set and a p method. Popular favorites are apop_normal and apop_uniform.
  • set_params A function that takes in a single number and the model, and sets the parameters accordingly. The function will call this for every point in the data set. Here is the default, which is used if this is NULL. It simply sets the first element of the model's parameter vector to the input number; this is appropriate for a Normal distribution, where we want to center the distribution on each data point in turn.
1 static void apop_set_first_param(apop_data *in, apop_model *m){
2  apop_data_set(m->parameters, .val= apop_data_get(in));
3 }

See the sample code for for a Uniform[0,1] recentered around the first element of the PMF matrix.

Examples 

This example sets up and uses KDEs based on Normal and Uniform distributions.

/* This program draws ten random data points, and then produces two kernel density
estimates: one based on the Normal distribution and one based on the Uniform.
It produces three outputs:
--stderr shows the random draws
--kerneldata is a file written with plot data for both KDEs
--stdout shows instructions to gnuplot, so you can pipe:
./kernel | gnuplot -persist
Most of the code is taken up by the plot() and draw_some_data() functions, which are
straightforward. Notice how plot() pulls the values of the probability distributions
at each point along the scale.
The set_uniform_edges function sets the max and min of a Uniform distribution so that the
given point is at the center of the distribution.
The first KDE uses the defaults, which are based on a Normal distribution with std dev 1;
the second explicitly sets the .kernel and .set_fn for a Uniform.
*/
#include <apop.h>
void set_uniform_edges(apop_data * r, apop_model *unif){
apop_data_set(unif->parameters, 0, -1, r->matrix->data[0]-0.5);
apop_data_set(unif->parameters, 1, -1, r->matrix->data[0]+0.5);
}
void plot(apop_model *k, apop_model *k2){
apop_data *onept = apop_data_alloc(1,1);
FILE *outtab = fopen("kerneldata", "w");
for (double i=0; i<20; i+=0.01){
apop_data_set(onept, .val=i);
fprintf(outtab, "%g %g %g\n", i, apop_p(onept, k), apop_p(onept, k2));
}
fclose(outtab);
printf("plot 'kerneldata' using 1:2\n"
"replot 'kerneldata' using 1:3\n");
}
apop_data *draw_some_data(){
apop_model *uniform_0_20 = apop_model_set_parameters(apop_uniform, 0, 20);
apop_data *d = apop_model_draws(uniform_0_20, 10);
apop_data_print(apop_data_sort(d), .output_pipe=stderr);
return d;
}
int main(){
apop_data *d = draw_some_data();
apop_model *k2 = apop_model_set_settings(apop_kernel_density,
.base_data=d,
.set_fn = set_uniform_edges,
.kernel = apop_uniform);
plot(k, k2);
}
apop_model apop_loess

Regression via loess smoothing

This uses a somewhat black-box routine, first written by Chamberlain, Devlin, Grosse, and Shyu in 1988, to fit a smoothed series of quadratic curves to the input data, thus producing a curve more closely fitting than a simple regression would.

The curve is basically impossible to describe using a short list of parameters, so the representation is in the form of the predicted vector of the expected data set; see below.

From the 1992 manual for the package: ``The method we will use to fit local regression models is called loess, which is short for local regression, and was chosen as the name since a loess is a deposit of fine clay or silt along a river valley, and thus is a surface of sorts. The word comes from the German löss, and is pronounced löíss.''

Name 

Loess smoothing

Input format 

The data is basically OLS-like: the first column of the data is the dependent variable to be explained; subsequent variables are the independent explanatory variables. Thus, your input data can either have a dependent vector plus explanatory matrix, or a matrix where the first column is the dependent variable.

Unlike with OLS, I won't move your original data, and I won't add a 1, because that's not really the loess custom. You can of course set up your data that way if you like.

If your data set has a weights vector, I'll use it.

In any case, all data is copied into the model's apop_loess_settings. The code is primarily FORTRAN code from 1988 converted to C; the data thus has to be converted into a relatively obsolete internal format.

Parameter format 

Unused.

Post-estimate data 

Unchanged.

Post-estimate parameters 

None.

Predict 

Fills in the zeroth column (ignoring and overwriting any data there), and adds an additional page to the input apop_data set named "<Confidence>" with a lower and upper CI for each point.

apop_model apop_logit

Apophenia makes no distinction between the bivariate logit and the multinomial logit. This does both.

The likelihood of choosing item $j$ is: $e^{x\beta_j}/ (\sum_i{e^{x\beta_i}})$

so the log likelihood is $x\beta_j - ln(\sum_i{e^{x\beta_i}})$

Name 

Logit

Input format 

The first column of the data matrix this model expects is zeros, ones, ..., enumerating the factors; to get there, try apop_data_to_factors; if you forget to run it, I'll run it on the first data column for you. The remaining columns are values of the independent variables. Thus, the model will return [(data columns)-1] $\times$[(option count)-1] parameters. Column names list factors in the dependent variables; row names list the independent variables.

Prep routine 

You will probably want to convert some column of your data into factors, via apop_data_to_factors. If you do, then that adds a page of factors to your data set (and of course adjusts the data itself). If I find a factor page, I will use that info; if not, then I will run apop_data_to_factors on the first column (the vector if there is one, else the first column of the matrix.)

Also, if there is no vector, then I will move the first column of the matrix, and replace that matrix column with a constant column of ones, just like with OLS.

Parameter format 

As above.

Post-estimate data 

Unchanged.

RNG 

Much like the apop_ols RNG, qv. Returns the category drawn.

Here is an artifical example which clarifies the simplest use of the model:

#include <apop.h>
#include <unistd.h>
char *testfile = "logit_test_data";
//generate a fake data set.
//Notice how the first column is the outcome, just as with standard regression.
void write_data(){
FILE *f = fopen(testfile, "w");
fprintf(f, "\
outcome,A, B \n\
0, 0, 0 \n\
1, 1, 1 \n\
1, .7, .5 \n\
1, .7, .3 \n\
1, .3, .7 \n\
\n\
1, .5, .5 \n\
0, .4, .4 \n\
0, .3, .4 \n\
1, .1, .3 \n\
1, .3, .1 ");
fclose(f);
}
int main(){
write_data();
apop_data *d = apop_text_to_data(testfile);
Apop_model_add_group(apop_logit, apop_mle, .tolerance=1e-5);
unlink(testfile);
/* Apophenia's test suite checks that this code produces
values close to canned values. As a human, you probably
just want to print the results to the screen. */
apop_model_show(est);
assert(fabs(apop_data_get(est->parameters, .rowname="1")- -1.155026) < 1e-6);
assert(fabs(apop_data_get(est->parameters, .rowname="A")- 4.039903) < 1e-6);
assert(fabs(apop_data_get(est->parameters, .rowname="B")- 1.494694) < 1e-6);
}

Here is an example using data from a U.S. Congressional vote, including one text variable that has to be converted to factors, and one to convert to dummies. A loop then calculates the customary p-values.

// See http://modelingwithdata.org/arch/00000160.htm for context and analysis.
#ifdef Datadir
#define DATADIR Datadir
#else
#define DATADIR "."
#endif
#include <apop.h>
int main(){
//read the data to db, get the desired columns,
//prep the two categorical variables
apop_text_to_db( DATADIR "/" "amash_vote_analysis.csv" , .tabname="amash");
apop_data *d = apop_query_to_mixed_data("mmmtt", "select 0, ideology,log(contribs+10) as contribs, vote, party from amash");
apop_data_to_factors(d); //0th text col -> 0th matrix col
apop_data_to_dummies(d, .col=1, .type='t', .append='y');
//Estimate a logit model, get covariances,
//calculate p values under popular Normality assumptions
Apop_model_add_group(apop_logit, apop_parts_wanted, .covariance='y');
apop_model_show(out);
for (int i=0; i< out->parameters->matrix->size1; i++){
printf("%s pval:\t%g\n",out->parameters->names->row[i],
apop_test(apop_data_get(out->parameters, i), "normal", 0, sqrt(apop_data_get(out->parameters->more, i, i))));
}
}
apop_model apop_lognormal

The log likelihood function for lognormal distributions:

$f = exp(-(ln(x)-\mu)^2/(2\sigma^2))/ (x\sigma\sqrt{2\pi})$

$ln f = -(ln(x)-\mu)^2/(2\sigma^2) - ln(x) - ln(\sigma\sqrt{2\pi})$

Name 

Lognormal distribution

Input format 

A scalar in the the matrix or vector element of the input apop_data set.

Parameter format 

Zeroth vector element is the mean of the logged data set; first is the standard deviation of the logged data set.

Post-estimate data 

Unchanged.

Post-estimate info 

Reports log likelihood.

RNG 

An Apophenia wrapper for the GSL's Normal RNG, exponentiated.

apop_model apop_mixture

The mixture model transformation: a linear combination of multiple models.

Use apop_model_mixture to produce one of these models. In the examples below, some are generated from unparameterized input models with a form like

1 apop_model *mf = apop_model_mixture(apop_model_copy(apop_normal), apop_model_copy(apop_normal));
2 Apop_settings_set(mf, apop_mixture, find_weights, 'y');
3 Apop_model_add_group(mf, apop_mle, .starting_pt=(double[]){.5, .5, 50, 5, 80, 5},
4  .step_size=3, .tolerance=1e-6);
5 apop_model_show(apop_estimate(dd, mf));

Or, one can skip the estimation and use already-parameterized models as input to apop_model_mixture, e.g.:

1 apop_model *r_ed = apop_model_mixture(apop_model_set_parameters(apop_normal, 54.6, 5.87),
2  apop_model_set_parameters(apop_normal, 80.1, 5.87));
3 apop_data *wts = apop_data_falloc((2), 0.36, 0.64);
4 Apop_settings_add(r_ed, apop_mixture, weights, wts->vector);
5 printf("LL=%g\n", apop_log_likelihood(dd, r_ed));

Notice that the weights vector has to be added after the call to apop_model_mixture. If none is given, then equal weights are assigned to all components of the mixture.

One can think of the estimation in the un-parameterized case as a missing-data problem: each data point originated in one distribution or the other, and if we knew with certainty which data point came from which distribution, then the estimation problem would be trivial: just generate the subsets and call apop_estimate(dataset1, model1), ..., apop_estimate(datasetn, modeln) separately. But the assignment of which element goes where is unknown information, which we guess at using an expectation-maximization (EM) algorithm. The standard algorithm starts with an initial set of parameters for the models, and assigns each data point to its most likely model. It then re-estimates the model parameters using their subsets. The standard algorithm, see e.g. this PDF, repeats until it arrives at an optimum.

Thus, the log likelihood method for this model includes a step that allocates each data point to its most likely model, and calculates the log likelihood of each observation using its most likely model. [It would be a valuable extension to extend this to not-conditionally IID models. Commit 1ac0dd44 in the repository had some notes on this, now removed.] As a side-effect, it calculates the odds of drawing from each model (the vector λ). Following the above-linked paper, the probability for a given observation under the mixture model is its probability under the most likely model weighted by the previously calculated $\lambda$ for the given model.

Apophenia implements the EM algorithm as a constrained optimization(!). The constraint check repositions the vector of weights to that calculated at the last step, then the log likelihood calculates the likelihood as above, including the expected value of the weights vector for the next step. Thus, Apophenia casts the Expectation step as a step repositioning the maximization's constraint and its associated penalties.

Estimations of mixture distributions can be sensitive to initial conditions. You are encouraged to try a sequence of random starting points for your model parameters. Some authors recommend plotting the data and eyeballing a guess as to the model parameters.

  • The default is to take the weight assigned to each distribution as fixed. If you want to use the EM algorithm to move the weights as described above, specify this via
    1 Apop_settings_set(your_model, apop_mixture, find_weights, 'y');
  • A kernel density is a mixture of a large number of homogeneous models, where each is typically centered around a point in your data. For such situations, apop_kernel_density will be easier to use.
Name 

Mixture of models

Input format 

The same data gets sent to each of the component models of the mixture. Each row is an observation, and the estimation routine assumes that models are conditionally IID (i.e., having chosen what component of the mixture the observation comes from, its likelihood can be calculated independently of all other observations).

Parameter format 

The parameters are broken out in a readable form in the settings group, so your best bet is to use those. See the sample code for usage.
The parameter element is a single vector piling up all elements, beginning with the first $n-1$ weights, followed by an apop_data_pack of each model's parameters in sequence. Because all elements are in a single vector, one could run a maximum likelihood search for all components (including the weights) at once. The log_likehood, estimate, and other methods unpack this vector into its component parts for you.

Post-estimate data 

Unchanged.

RNG 

Uses the weights to select a component model, then makes a draw from that component. The model's dsize (draw size) element is set when you set up the model in the model's prep method (automatically called by apop_estimate, or call it directly) iff all component models have the same dsize.

Settings 

apop_mixture_settings

Examples 

The first example uses a text file faith.data, in the tests directory of the distribution.

#ifdef Datadir
#define DATADIR Datadir
#else
#define DATADIR "."
#endif
#include <apop.h>
/* This replacement for apop_model_print(in) demonstrates retrieval of the useful
settings: the weights (λ) and list of estimated models. It is here only for
demonstration purposes---it is what apop_model_print(your_mix) will do.
*/
void show_mix(apop_model *in){
printf("The weights:\n");
printf("\nThe models:\n");
for (apop_model **m = ms->model_list; *m; m++) //model_list is a NULL-terminated list.
apop_model_print(*m, stdout);
}
int main(){
apop_text_to_db( DATADIR "/" "faith.data", "ff");
apop_data *dd = apop_query_to_data("select waiting from ff");
apop_model *mf = apop_model_mixture(apop_model_copy(apop_normal), apop_model_copy(apop_normal));
Apop_settings_set(mf, apop_mixture, find_weights, 'y');//Use the EM algorithm to search for optimal weights.
/* The process is famously sensitive to starting points. Try many random points, or
eyeball the distribution's plot and guess at the starting values. */
Apop_model_add_group(mf, apop_mle, .starting_pt=(double[]){.5, .5, 50, 5, 80, 5},
.step_size=3, .tolerance=1e-6);
apop_model *mfe = apop_estimate(dd, mf);
apop_model_print(mfe, stdout);
printf("LL=%g\n", apop_log_likelihood(dd, mfe));
printf("\n\nValues calculated in the source paper, for comparison.\n");
apop_model_set_parameters(apop_normal, 54.61364, 5.869089),
apop_model_set_parameters(apop_normal, 80.09031, 5.869089));
apop_data *wts = apop_data_falloc((2), 0.3608498, 0.6391502);
Apop_settings_add(r_ed, apop_mixture, weights, wts->vector);
show_mix(r_ed);
printf("LL=%g\n", apop_log_likelihood(dd, r_ed));
}

This example begins with a fixed mixture distribution, and makes assertions about the characteristics of draws from it.

#include <apop.h>
/*
Use apop_model_mixture to generate a hump-filled distribution, then find
the most likely data points and check that they are near the humps.
*/
//Produce a 2-D multivariate normal model with unit covariance and given mean
apop_model *produce_fixed_mvn(double x, double y){
out->parameters = apop_data_falloc((2, 2, 2),
x, 1, 0,
y, 0, 1);
out->dsize = 2;
return out;
}
int main(){
//here's a mean/covariance matrix for a standard multivariate normal.
produce_fixed_mvn(5, 6),
produce_fixed_mvn(-5, -4),
produce_fixed_mvn(0, 1));
apop_prep(NULL, many_humps);
int len = 100000;
apop_data *d = apop_model_draws(many_humps, len);
gsl_vector *first = Apop_cv(d, 0);
printf("mu=%g\n", apop_mean(first));
assert(fabs(apop_mean(first)- 0) < 5e-2);
gsl_vector *second = Apop_cv(d, 1);
printf("mu=%g\n", apop_mean(second));
assert(fabs(apop_mean(second)- 1) < 5e-2);
/* Use the ML imputation routine to search for the input value with the highest
log likelihood. Do the search via simulated annealing. */
gsl_matrix_set_all(x->matrix, NAN);
apop_opts.stop_on_warning='v';
apop_ml_impute(x, many_humps);
printf("Optimum found at:\n");
apop_data_show(x);
assert(fabs(apop_data_get(x, .col=0)- 0) + fabs(apop_data_get(x, .col=1) - 1) < 1e-2);
}
apop_model apop_multinomial

The $n$–option generalization of the Binomial distribution.

Name 

Binomial distribution

Input format 

Each row of the matrix is one observation: a set of draws from a single bin. The number of draws of type zero are in column zero, the number of draws of type one in column one, et cetera.

  • You may have a set of several Bernoulli-type draws, which could be summed together to form a single Binomial draw. The apop_data_to_dummies function (using the .keep_first='y' option), to split a single column of numbers into a sequence of columns, may help with this.
Parameter format 

The parameters are kept in the vector element of the apop_model parameters element. parameters->vector->data[0]==n; parameters->vector->data[1...]==p_1....

The numeraire is bin zero, meaning that $p_0$ is not explicitly listed, but is $p_0=1-\sum_{i=1}^{k-1} p_i$, where $k$ is the number of bins. Conveniently enough, the zeroth element of the parameters vector holds $n$, and so a full probability vector can easily be produced by overwriting that first element. For example:

1 apop_model *estimated = apop_estimate(your_data, apop_multinomial);
2 int n = apop_data_get(estimated->parameters);
3 apop_data_set(estimated->parameters, .val=1 - (apop_sum(estimated->parameters)-n));

And now the parameter vector is a proper list of probabilities.

  • Because an observation is a single row, the number of bins, $k$ is set to equal the length of the first row (counting both vector and matrix elements, as appropriate). The covariance matrix will be $k \times k$.
  • Each row should sum to $N$, the number of draws. The estimation routine doesn't check this, but instead uses the average sum across all rows.
Post-estimate data 

Unchanged.

Post-estimate info 

Reports log likelihood.

RNG 

Returns a single vector of length $k$, the result of an imaginary tossing of $N$ balls into $k$ urns, with the given probabilities.

This is the multivariate generalization of the Normal distribution.

Name 

Multivariate normal distribution

Input format 

Each row of the matrix is an observation.

Parameter format 

An apop_data set whose vector element is the vector of means, and whose matrix is the covariances.

If you had only one dimension, the mean would be a vector of size one, and the covariance matrix a $1\times 1$ matrix. This differs from the setup for apop_normal, which outputs a single vector with $\mu$ in element zero and $\sigma$ in element one.

After estimation, the <Covariance> page gives the covariance matrix of the means.

Post-estimate data 

Unchanged.

Post-estimate info 

Reports log likelihood.

RNG 

From Devroye (1986), p 565.

apop_model apop_normal

You know it, it's your attractor in the limit, it's the Gaussian distribution.

$N(\mu,\sigma^2) = {1 \over \sqrt{2 \pi \sigma^2}} \exp (-x^2 / 2\sigma^2)$

$\ln N(\mu,\sigma^2) = (-(x-\mu)^2 / 2\sigma^2) - \ln (2 \pi \sigma^2)/2 $

$d\ln N(\mu,\sigma^2)/d\mu = (x-\mu) / \sigma^2 $

$d\ln N(\mu,\sigma^2)/d\sigma^2 = ((x-\mu)^2 / 2(\sigma^2)^2) - 1/2\sigma^2 $

See also the apop_multivariate_normal.

Name 

Normal distribution

Input format 

A scalar, in the vector or matrix elements of the input apop_data set.

Parameter format 

Parameter zero (in the vector) is the mean, parmeter one is the standard deviation (i.e., the square root of the variance). After estimation, a page is added named <Covariance> with the 2 $\times$ 2 covariance matrix for these two parameters.

Post-estimate data 

Unchanged.

Post-estimate info 

Reports the log likelihood.

Predict 

apop_predict(NULL, estimated_normal_model) returns the expected value. The ->more element holds an apop_data set with the title <Covariance>, whose matrix holds the covariance of the mean.

RNG 

A wrapper for the GSL's Normal RNG.

Settings 

None.

apop_model apop_ols

Ordinary least squares. Weighted least squares is also handled by this model.

Name 

Ordinary Least Squares

Input format 

See the notes on the prep routine.

If you provide weights in your_input_data->weights, then I will use them appropriately. That is, the apop_ols model really implements Weighted Least Squares, but in most cases weights==NULL and the math reduces to the special case of Ordinary Least Squares.

Prep routine 

If your input data has no vector element, then column zero of the matrix is taken to be the dependent variable. This routine moves the dependent variable to the vector, and replaces column zero with a column of all ones, indicating a constant term. This is the norm for OLS, and is probably what you want. The easiest way to generate data for this sort of process is via a query like apop_query_to_matrix("select depvar, independent_var1, independent_var2 from dataset").

If your data has a vector element, then the prep routines won't try to force something to be there. That is, nothing will be moved, and no constant column generated. If you don't want to use a constant column, or your data has already been prepped by an estimation, then this is what you want. See apop_query_to_mixed_data for an easy way to generate a data set like this via queries.

Parameter format 

A vector of OLS coefficients. Coefficient zero refers to the constant column, if any. The vector of the output will therefore be of size data->size2.

The estimation routine appends a page to the parameters named <Covariance>, giving the covariance matrix for the estimated parameters (not the data itself). If the predicted values are calculated (and appended to the info page), then a page is appended to the parameters named <Error variance>, giving the variance of the error term.

Post-estimate data 

You can specify whether the data is modified with an apop_lm_settings group. Else, left unchanged.

Post-estimate parameter model 

For the mean, a noncentral $t$ distribution (apop_t_distribution).

Post-estimate info 

Reports log likelihood, and runs apop_estimate_coefficient_of_determination to add $R^2$-type information (SSE, SSR, &c) to the info page.

Residuals: I add a page named <Predicted>, with three columns. The first column is the dependent variable from the input data. Let our model be $ Y = \beta X + \epsilon$. Then the second column is the predicted values: $\beta X$, and the third column is the residuals: $\epsilon$. The third column is therefore always the first minus the second.

Given your estimate est, the zeroth element is one of
apop_data_get(est->info, .page= "Predicted", .row=0, .colname="observed"),
apop_data_get(est->info, .page= "Predicted", .row=0, .colname="predicted") or
apop_data_get(est->info, .page= "Predicted", .row=0, .colname="residual").

RNG 

Linear models are typically only partially defined probability models. For OLS, we know that $P(Y|X\beta) \sim {\cal N}(X\beta, \sigma)$, because this is an assumption about the error process, but we don't know much of anything about the distribution of $X$.

The apop_lm_settings group includes an apop_model element named input_distribution. This is the distribution of the independent/predictor/X columns of the data set.

The default is that input_distribution = apop_improper_uniform , meaning that $P(X)=1$ for all $X$. So $P(Y, X) = P(Y|X)P(X) = P(Y|X)$. This seems to be how many people use linear models: the $X$ values are taken as certain (as with actually observed data) and the only question is the odds of the dependent variable. If that's what you're looking for, just leave the default. This is sufficient for getting log likelihoods under the typical assumption that the observed data has probability one.

But you can't draw from an improper uniform. So if you draw from a linear model with a default input_distribution, then you'll get an error.

Alternatively, you may know something about the distribution of the input data. For example, the data model may simply be a PMF from the actual data:

1 apop_settings_set(your_model, apop_lm, input_distribution, apop_estimate(inset, apop_pmf));

Now, random draws are taken from the input data, and the dependent variable value calculated via $X\beta+\epsilon$, where $X$ is the drawn value, $\beta$ the previously-estimated parameters and $\epsilon$ is a Normally-distributed random draw. Or change the PMF to any other appropriate distribution, such as a apop_multivariate_normal, or an apop_pmf filled in with more data, or perhaps something from http://en.wikipedia.org/wiki/Errors-in-variables_models , as desired.

Examples 

A quick overview opens with a sample program using OLS. For quick reference, here is the program, but see that page for a full discussion.

#ifdef Datadir
#define DATADIR Datadir
#else
#define DATADIR "."
#endif
#include <apop.h>
int main(){
apop_text_to_db(.text_file= DATADIR "/" "data" , .tabname="d");
apop_data *data = apop_query_to_data("select * from d");
}
apop_model apop_pmf

A probability mass function is commonly known as a histogram, or still more commonly, a bar chart. It indicates that at a given coordinate, there is a given mass.

Each row of the PMF's data set holds the coordinates, and the weights vector holds the mass at the given point. This is in contrast to the crosstab format, where the location is simply given by the position of the data point in the grid.

For example, here is a typical crosstab:

col 0col 1col 2
row 0 08.13.2
row 1 002.2
row 2 07.31.2

Here it is as a sparse listing:

dimension 1dimension 2value
0 1 8.1
0 2 3.2
1 2 2.2
2 1 7.3
2 2 1.2

The apop_pmf internally represents data in this manner, with the dimensions in the matrix, vector, and text element of the data set, and the cell values are held in the weights element (not the vector).

If your data is in a crosstab (with observation coordinates in the matrix element for 2-D data or the vector for 1-D data), then use apop_crosstab_to_db to make the conversion. See also the wiki for another crosstab-to-PMF function.

If your data is already in the sparse listing format (which is probably the case for 3- or more dimensional data), then estimate the model via:

1 apop_model *my_pmf = apop_estimate(in_data, apop_pmf);
  • If the weights element is NULL, then I assume that all rows of the data set are equally probable.
  • If the weights are present but sum to a not-finite value, the model's error element is set to 'w' when the estimation is run, and a warning printed.
Name 

PDF or sparse matrix

Input format 

One observation per row, with coordinates in the vector, matrix, and/or text, and the density at that point in the weights. If weights==NULL, all observations are equiprobable.

Parameter format 

None. The list of observations and their weights are in the data set, not the parameters.

Post-estimate data 

The data you sent in is linked to (not copied).

Post-estimate parameters 

Still NULL.

RNG 

Return the data in a random row of the PMF's data set. If there is a weights vector, I will use that to make draws; else all rows are equiprobable.

  • If you set draw_index to 'y', e.g.,
1 Apop_settings_add(your_model, apop_pmf, draw_index, 'y');

then I will return the row number of the draw, not the data in that row. Because apop_draw only returns numeric data, this is the only meaningful way to make draws from text data.

  • The first time you draw from a PMF with uneven weights, I will generate a vector tallying the cumulative mass. Subsequent draws will have no computational overhead. Because the vector is built using the data on the first call to this or the cdf method, do not rearrange or modify the data after the first call. I.e., if you choose to use apop_data_sort or apop_data_pmf_compress on your data, do it before the first draw or CDF calculation.
Exceptions
m->error='f'There is zero or NaN density in the CMF. I set the model's error element to 'f' and set out=NAN.
m->error='a'Allocation error. I set the model's error element to 'a' and set out=NAN. Maybe try apop_data_pmf_compress first?
CDF 

Assuming the data is sorted in a meaningful manner, find the total mass up to a given data point.

That is, a CDF only makes sense if the data space is totally ordered. The sorting you define using apop_data_sort defines that ordering.

  • The input data should have the same number of columns as the data set used to construct the PMF. I use only the first row.
  • If the observation is not found in the data, return zero.
  • The first time you get a CDF from from a data set with uneven weights, I will generate a vector tallying the cumulative mass. Subsequent draws will have no computational overhead. Because the vector is built using the data on the first call to this or the cdf method, do not rearrange or modify the data after the first call. I.e., if you choose to use apop_data_sort or apop_data_sort on your data, do it before the first draw or CDF calculation.
Settings 

apop_pmf_settings

apop_model apop_poisson

$p(k) = {\mu^k \over k!} \exp(-\mu). $

Name 

Poisson distribution

Input format 

One scalar observation per row (in the matrix or vector).

Parameter format 

One parameter, the zeroth element of the vector (double mu = apop_data_get(estimated_model->parameters)).

Post-estimate data 

Unchanged.

Post-estimate parameters 

Unless you decline it by adding the apop_parts_wanted_settings group, I will also give you the variance of the parameter, via bootstrap, stored in a page named <Covariance>.

Post-estimate info 

Reports log likelihood.

RNG 

A wrapper for gsl_ran_poisson. Sets a single scalar.

apop_model apop_probit

Apophenia makes no distinction between the Bivariate Probit and the Multinomial Probit. This one does both.

Name 

Probit

Input format 

The first column of the data matrix this model expects is zeros, ones, ..., enumerating the factors; see the prep routine. The remaining columns are values of the independent variables. Thus, the model will return [(data columns)-1] $\times$[(option count)-1] parameters. Column names are options; row names are input variables.

Prep routine 

The initial column of data should be a set of factors, set up via apop_data_to_factors. If I find a factor page, I will use that info; if not, then I will run apop_data_to_factors on the left-most column (the vector if there is one, else the first column of the matrix.)

Also, if there is no vector, then I will move the first column of the matrix, and replace that matrix column with a constant column of ones, just like with OLS.

Parameter format 

As above

Post-estimate data 

Unchanged.

RNG 

See apop_ols; this one is similar but produces a category number instead of OLS's continuous draw.

apop_model apop_t_distribution

The t distribution, primarily for descriptive purposes.

If you want to test a hypothesis, you probably don't need this, and should instead use apop_test.

In that world, the $t$ distribution is parameter free. The data are assumed to be normalized to be based on a mean zero, variance one process, you get the degrees of freedom from the size of the data, and the distribution is thus fixed.

For modeling purposes, more could be done. For example, the t-distribution is a favorite proxy for Normal-like situations where there are fat tails relative to the Normal (i.e., high kurtosis). Or, you may just prefer not to take the step of normalizing your data—one could easily rewrite the theorems underlying the t-distribution without the normalizations.

In such a case, the researcher would not want to fix the $df$, because $df$ indicates the fatness of the tails, which has some optimal value given the data. Thus, there are two modes of use for these distributions:

  • Parameterized, testing style: the degrees of freedom are determined from the data, and all necessary normalizations are assumed. Thus, this code—
1 apop_data *t_for_testing = apop_estimate(data, apop_t)

—will return exactly the type of $t$-distribution one would use for testing.

  • By removing the estimate method—
    1 apop_model *spare_t = apop_model_copy(apop_t);
    2 spare_t->estimate = NULL;
    3 apop_model *best_fitting_t = apop_estimate(your_data, spare_t);
    —I will find the best $df$ via maximum likelihood, which may be desirable for to find the best-fitting model for descriptive purposes.
Name 

t distribution

Input format 

Unordered list of scalars in the matrix and/or vector.

Parameter format 

Three scalars in the vector element:
double mu=apop_data_get(estimated_model->parameters, 0)
double sigma=apop_data_get(estimated_model->parameters, 1)
double df=apop_data_get(estimated_model->parameters, 2)

Post-estimate data 

Unchanged.

apop_model apop_uniform

This is the two-parameter version of the Uniform, expressing a uniform distribution over [a, b].

The MLE of this distribution is simply a = min(your data); b = max(your data). Often useful for the RNG, such as when you have a Uniform prior model.

Name 

Uniform distribution

Input format 

One scalar observation per row (in the matrix or vector).

Parameter format 

Zeroth vector element is $a$, the min; element one is $b$, the max.

Post-estimate data 

Unchanged.

Post-estimate info 

Reports log likelihood.

apop_model apop_yule

$ Y(x, b) = (b-1) \gamma(b) \gamma(k) / \gamma(k+b) $

$ \ln Y(x, b) = \ln(b-1) + ln\gamma(b) + \ln\gamma(k) - \ln\gamma(k+b) $

$ d\ln Y/db = 1/(b-1) + \psi(b) - \psi(k+b) $

Name 

Yule distribution

Input format 

One scalar observation per row (in the matrix or vector). See also apop_data_rank_compress for means of dealing with one more input data format.

Parameter format 

One element in the parameter set's vector.

Post-estimate data 

Unchanged.

RNG 

From Devroye (1986), p 553.

Settings 

MLE-type: apop_mle_settings, apop_parts_wanted_settings

apop_model apop_zipf

Wikipedia has notes on the Zipf distribution.

$Z(a) = {1\over \zeta(a) * i^a} $

$lnZ(a) = -(\log(\zeta(a)) + a \log(i)) $

Name 

Zipf distribution

Input format 

One scalar observation per row (in the matrix or vector). See also apop_data_rank_compress for means of dealing with one more input data format.

See also apop_data_rank_compress for means of dealing with one more input data format.

Parameter format 

One item in the parameter set's vector.

Post-estimate data 

Unchanged.

RNG 

Returns an ordinal ranking, starting from 1.

From Devroye (1986), Chapter 10, p 551.

Settings 

apop_mle_settings