The apop_data structure represents a data set. It joins together a gsl_vector, a gsl_matrix, an apop_name, and a table of strings. It tries to be lightweight, so you can use it everywhere you would use a gsl_matrix or a gsl_vector.

Here is a diagram showing a sample data set with all of the elements in place. Together, they represent a data set where each row is an observation, which includes both numeric and text values, and where each row/column may be named.

Rowname

Vector

Matrix

Text

Weights


"Steven"
"Sandra"
"Joe"

Outcome
1
0
1

Age	Weight (kg)	Height (cm)
32	65	175
41	61	165
40	73	181

Sex	State
Male	Alaska
Female	Alabama
Male	Alabama

1

3.2

2.4

In a regression, the vector would be the dependent variable, and the other columns (after factor-izing the text) the independent variables. Or think of the apop_data set as a partitioned matrix, where the vector is column -1, and the first column of the matrix is column zero. Here is some sample code to print the vector and matrix, starting at column -1 (but you can use apop_data_print to do this).

for (int j = 0; j< data->matrix->size1; j++){
    printf("%s\t", apop_name_get(data->names, j, 'r'));
    for (int i = -1; i< data->matrix->size2; i++)
        printf("%g\t", apop_data_get(data, j, i));
    printf("\n");
}

Most functions assume that each row represents one observation, so the data vector, data matrix, and text have the same row count: data->vector->size==data->matrix->size1 and data->vector->size==*data->textsize. This means that the apop_name structure doesn't have separate vector_names, row_names, or text_row_names elements: the rownames are assumed to apply for all.

See below for notes on managing the text element and the row/column names.

Pages

The apop_data set includes a more pointer, which will typically be NULL, but may point to another apop_data set. This is intended for a main data set and a second or third page with auxiliary information, such as estimated parameters on the front page and their covariance matrix on page two, or predicted data on the front page and a set of prediction intervals on page two.

The more pointer is not intended for a linked list for millions of data points. In such situations, you can often improve efficiency by restructuring your data to use a single table (perhaps via apop_data_pack and apop_data_unpack).

Most functions, such as apop_data_copy and apop_data_free, will handle all the pages of information. For example, an optimization search over multi-page parameter sets would search the space given by all pages.

Pages may also be appended as output or auxiliary information, such as covariances, and an MLE would not search over these elements. Any page with a name in XML-ish brackets, such as <Covariance>, is considered information about the data, not data itself, and therefore ignored by search routines, missing data routines, et cetera. This is achieved by a rule in apop_data_pack and apop_data_unpack.

Here is a toy example that establishes a baseline data set, adds a page, modifies it, and then later retrieves it.

apop_data *d = apop_data_alloc(10, 10, 10); //the base data set, a 10-item vector + 10x10 matrix
apop_data *a_new_page = apop_data_add_page(d, apop_data_alloc(2,2), "new 2 x 2 page");
gsl_vector_set_all(a_new_page->matrix, 3);
//later:
apop_data *retrieved = apop_data_get_page(d, "new", 'r'); //'r'=search via regex, not literal match.
apop_data_print(retrieved); //print a 2x2 grid of 3s.

Functions for using apop_data sets

There are a great many functions to collate, copy, merge, sort, prune, and otherwise manipulate the apop_data structure and its components.

apop_data_add_named_elmt
apop_data_copy
apop_data_fill
apop_data_memcpy
apop_data_pack
apop_data_rm_columns
apop_data_sort
apop_data_split
apop_data_stack
apop_data_transpose : transpose matrices (square or not) and text grids
apop_data_unpack
apop_matrix_copy
apop_matrix_realloc
apop_matrix_stack
apop_text_set
apop_text_paste
apop_text_to_data
apop_vector_copy
apop_vector_fill
apop_vector_stack
apop_vector_realloc
apop_vector_unique_elements

Apophenia builds upon the GSL, but it would be inappropriate to redundantly replicate the GSL's documentation here. Meanwhile, here are prototypes for a few common functions. The GSL's naming scheme is very consistent, so a simple reminder of the function name may be sufficient to indicate how they are used.

gsl_matrix_swap_rows (gsl_matrix * m, size_t i, size_t j)
gsl_matrix_swap_columns (gsl_matrix * m, size_t i, size_t j)
gsl_matrix_swap_rowcol (gsl_matrix * m, size_t i, size_t j)
gsl_matrix_transpose_memcpy (gsl_matrix * dest, const gsl_matrix * src)
gsl_matrix_transpose (gsl_matrix * m) : square matrices only
gsl_matrix_set_all (gsl_matrix * m, double x)
gsl_matrix_set_zero (gsl_matrix * m)
gsl_matrix_set_identity (gsl_matrix * m)
gsl_matrix_memcpy (gsl_matrix * dest, const gsl_matrix * src)
void gsl_vector_set_all (gsl_vector * v, double x)
void gsl_vector_set_zero (gsl_vector * v)
int gsl_vector_set_basis (gsl_vector * v, size_t i): set all elements to zero, but set item to one.
gsl_vector_reverse (gsl_vector * v): reverse the order of your vector's elements
gsl_vector_ptr and gsl_matrix_ptr. To increment an element in a vector use, e.g., *gsl_vector_ptr(v, 7) += 3; or (*gsl_vector_ptr(v, 7))++.
gsl_vector_memcpy (gsl_vector * dest, const gsl_vector * src)

Reading from text files

The apop_text_to_data() function takes in the name of a text file with a grid of data in (comma|tab|pipe|whatever)-delimited format and reads it to a matrix. If there are names in the text file, they are copied in to the data set. See Input text file formatting for the full range and details of what can be read in.

If you have any columns of text, then you will need to read in via the database: use apop_text_to_db() to convert your text file to a database table, do any database-appropriate cleaning of the input data, then use apop_query_to_data() or apop_query_to_mixed_data() to pull the data to an apop_data set.

Input text file formatting

Alloc/free

You may not need to use these functions often, given that apop_query_to_data, apop_text_to_data, and many transformation functions will auto-allocate apop_data sets for you.

The apop_data_alloc function allocates a vector, a matrix, or both. After this call, the structure will have blank names, NULL text element, and NULL weights. See Name handling for discussion of filling the names. Use apop_text_alloc to allocate the text grid. The weights are a simple gsl_vector, so allocate a 100-unit weights vector via allocated_data_set->weights = gsl_vector_alloc(100).

Examples of use can be found throughout the documentation; for example, see A quick overview.

apop_data_alloc
apop_data_calloc
apop_data_free
apop_text_alloc : allocate or resize the text part of an apop_data set.
apop_text_free

Using views

There are several macros for the common task of viewing a single row or column of a apop_data set.

apop_data *d = apop_query_to_data("select obs1, obs2, obs3 from a_table");
//Get a column using its name. Note that the generated view, ov, is the
//last item named in the call to the macro.
Apop_col_t(d, "obs1", ov);
double obs1_sum = apop_vector_sum(ov);
//Get row zero of the data set's matrix as a vector; get its sum
double row_zero_sum = apop_vector_sum(Apop_rv(d, 0));
//Get a row or rows as a standalone one-row apop_data set
apop_data_print(Apop_r(d, 0));
//ten rows starting at row 3:
apop_data *d10 = Apop_rs(d, 3, 10);
apop_data_print(d10);
//Column zero's sum
gsl_vector *cv = Apop_cv(d, 0);
double col_zero_sum = apop_vector_sum(cv);
//or one one line:
double col_zero_sum = apop_vector_sum(Apop_cv(d, 0));
//Pull a 10x5 submatrix, whose origin element is the (2,3)rd
//element of the parent data set's matrix
double sub_sum = apop_matrix_sum(Apop_subm(d, 2,3, 10,5));

Because these macros can be used as arguments to a function, these macros have abbreviated names to save line space.

Apop_r : get row as one-observation apop_data set
Apop_c : get column as apop_data set
Apop_cv : get column as gsl_vector
Apop_rv : get row as gsl_vector
Apop_cs : get columns as apop_data set
Apop_rs : get rows as apop_data set
Apop_mcv : matrix column as vector
Apop_mrv : matrix row as vector
Apop_subm : get submatrix of a gsl_matrix

A second set of macros have a slightly different syntax, taking the name of the object to be declared as the last argument. These can not be used as expressions such as function arguments.

The view is an automatic variable, not a pointer, and therefore disappears at the end of the scope in which it is declared. If you want to retain the data after the function exits, copy it to another vector:

return apop_vector_copy(Apop_rv(d, 2)); //return a gsl_vector copy of row 2

Curly braces always delimit scope, not just at the end of a function. When program evaluation exits a given block, all variables in that block are erased. Here is some sample code that won't work:

apop_data *outdata;
if (get_odd){
    outdata = Apop_r(data, 1);
} else {
    outdata = Apop_r(data, 0);
}
apop_data_print(outdata); //breaks: outdata points to out-of-scope variables.

For this if/then statement, there are two sets of local variables generated: one for the if block, and one for the else block. By the last line, neither exists. You can get around the problem here by making sure to not put the macro declaring new variables in a block. E.g.:

apop_data *outdata = Apop_r(data, get_odd ? 1 : 0);

apop_data_print(outdata);

Set/get

First, some examples:

apop_data *d = apop_data_alloc(10, 10, 10);
apop_name_add(d->names, "Zeroth row", 'r');
apop_name_add(d->names, "Zeroth col", 'c');
//set cell at row=8 col=0 to value=27
apop_data_set(d, 8, 0, .val=27);
assert(apop_data_get(d, 8, .colname="Zeroth") == 27);
double *x = apop_data_ptr(d, .col=7, .rowname="Zeroth");
*x = 270;
assert(apop_data_get(d, 0, 7) == 270);
// This is invalid---the value doesn't follow the colname. Use .val=5.
// apop_data_set(d, .row = 3, .colname="Column 8", 5);  
// This is OK, to set (3, 8) to 5:
apop_data_set(d, 3, 8, 5);
//apop_data set holding a scalar:
apop_data *s = apop_data_alloc(1);
apop_data_set(s, .val=12);
assert(apop_data_get(s) == 12);
//apop_data set holding a vector:
apop_data *v = apop_data_alloc(12);
for (int i=0; i< 12; i++) apop_data_set(s, i, .val=i*10);
assert(apop_data_get(s,3) == 30);
//This is a common form from pulling from a list of named scalars, 
//produced via apop_data_add_named_elmt
double AIC = apop_data_get(your_model->info, .rowname="AIC");

The versions that take a column/row name use apop_name_find for the search; see notes there on the name matching rules.
For those that take a column number, column -1 is the vector element.
For those that take a column name, I will search the vector last—if I don't find the name among the matrix columns, but the name matches the vector name, I return column -1.
If you give me both a .row and a .rowname, I go with the name; similarly for .col and .colname.
You can give the name of a page, e.g.
double AIC = apop_data_get(data, .rowname="AIC", .col=-1, .page="<Info>");

Numeric values default to zero, which is how the examples above that treated the apop_data set as a vector or scalar could do so relatively gracefully. So apop_data_get(dataset, 1) gets item (1, 0) from the matrix element of dataset. But as a do-what-I-mean exception, if there is no matrix element but there is a vector, then this form will get vector element 1. Relying on this DWIM exception is useful iff you can guarantee that a data set will have only a vector or a matrix but not both. Otherwise, be explicit and use apop_data_get(dataset, 1, -1).

The apop_data_ptr function follows the lead of gsl_vector_ptr and gsl_matrix_ptr, and like those functions, returns a pointer to the appropriate double. For example, to increment the (3,7)th element of an apop_data set:

(*apop_data_ptr(dataset, 3, 7))++;

apop_data_get
apop_data_set
apop_data_ptr : returns a pointer to the element.
apop_data_get_page : retrieve a named page from a data set. If you only need a few items, you can specify a page name to apop_data_(get|set|ptr).
```
See also:
```

double gsl_matrix_get (const gsl_matrix * m, size_t i, size_t j)
double gsl_vector_get (const gsl_vector * v, size_t i)
void gsl_matrix_set (gsl_matrix * m, size_t i, size_t j, double x)
void gsl_vector_set (gsl_vector * v, size_t i, double x)
double * gsl_matrix_ptr (gsl_matrix * m, size_t i, size_t j)
double * gsl_vector_ptr (gsl_vector * v, size_t i)
const double * gsl_matrix_const_ptr (const gsl_matrix * m, size_t i, size_t j)
const double * gsl_vector_const_ptr (const gsl_vector * v, size_t i)
gsl_matrix_get_row (gsl_vector * v, const gsl_matrix * m, size_t i)
gsl_matrix_get_col (gsl_vector * v, const gsl_matrix * m, size_t j)
gsl_matrix_set_row (gsl_matrix * m, size_t i, const gsl_vector * v)
gsl_matrix_set_col (gsl_matrix * m, size_t j, const gsl_vector * v)

Map/apply

These functions allow you to send each element of a vector or matrix to a function, either producing a new matrix (map) or transforming the original (apply). The ..._sum functions return the sum of the mapped output.

There are two types, which were developed at different times. The apop_map and apop_map_sum functions use variadic function inputs to cover a lot of different types of process depending on the inputs. Other functions with types in their names, like apop_matrix_map and apop_vector_apply, may be easier to use in some cases. They use the same routines internally, so use whichever type is convenient.

You can do many things quickly with these functions.

Get the sum of squares of a vector's elements:

  //given apop_data *dataset and gsl_vector *v:
double sum_of_squares = apop_map_sum(dataset, gsl_pow_2);
double sum_of_sqvares = apop_vector_map_sum(v, gsl_pow_2);

Create an index vector [ $0, 1, 2, ...$ ].

double index(double in, int index){return index;}

apop_data *d = apop_map(apop_data_alloc(100), .fn_di=index, .inplace='y');

Given your log likelihood function, which acts on a apop_data set with only one row, and a data set where each row of the matrix is an observation, find the total log likelihood via:

static double your_log_likelihood_fn(apop_data * in)
     {[your math goes here]}
double total_ll = apop_map_sum(dataset, .fn_r=your_log_likelihood_fn);

How many missing elements are there in your data matrix?

static double nan_check(const double in){ return isnan(in);}
int missing_ct = apop_map_sum(in, nan_check, .part='m');

Get the mean of the not-NaN elements of a data set:

static double no_nan_val(const double in){ return isnan(in)? 0 : in;}
static double not_nan_check(const double in){ return !isnan(in);}
static double apop_mean_no_nans(apop_data *in){
    return apop_map_sum(in, no_nan_val)/apop_map_sum(in, not_nan_check);
}

The following program randomly generates a data set where each row is a list of numbers with a different mean. It then finds the $t$ statistic for each row, and the confidence with which we reject the claim that the statistic is less than or equal to zero.

Notice how the older apop_vector_apply uses file-global variables to pass information into the functions, while the apop_map uses a pointer to send parameters to the functions.

#include <apop.h>
double row_offset;
void offset_rng(double *v){*v = gsl_rng_uniform(apop_rng_get_thread()) + row_offset;}
double find_tstat(gsl_vector *in){ return apop_mean(in)/sqrt(apop_var(in));}
double conf(double in, void *df){ return gsl_cdf_tdist_P(in, *(int *)df);}
//apop_vector_mean is a macro, so we can't point a pointer to it.
double mu(gsl_vector *in){ return apop_vector_mean(in);}
int main(){
    apop_data *d = apop_data_alloc(10, 100);
    gsl_rng *r = apop_rng_alloc(3242);
    for (int i=0; i< 10; i++){
        row_offset = gsl_rng_uniform(r)*2 -1; //declared and used above.
        apop_vector_apply(Apop_rv(d, i), offset_rng);
    }
    int df = d->matrix->size2-1;
    apop_data *means = apop_map(d, .fn_v = mu, .part ='r');
    apop_data *tstats = apop_map(d, .fn_v = find_tstat, .part ='r');
    apop_data *confidences = apop_map(tstats, .fn_dp = conf, .param = &df);
    printf("means:\n"); apop_data_show(means);
    printf("\nt stats:\n"); apop_data_show(tstats);
    printf("\nconfidences:\n"); apop_data_show(confidences);
    //Some sanity checks, for Apophenia's test suite.
    for (int i=0; i< 10; i++){
        //sign of mean == sign of t stat.
        assert(apop_data_get(means, i, -1) * apop_data_get(tstats, i, -1) >=0);
        //inverse of P-value should be the t statistic.
        assert(fabs(gsl_cdf_tdist_Pinv(apop_data_get(confidences, i, -1), 99) 
                    - apop_data_get(tstats, i, -1)) < 1e-5);
    }
}

One more toy example demonstrating the use of apop_map and apop_map_sum :

#include <apop.h>
/* This sample code sets the elements of a data set's vector to one
   if the index is even.  Then, via the weights vector, it adds up
   the even indices.
   There is really no need to use the weights vector; this code
   snippet is an element of Apophenia's test suite, and goes the long
   way to test that the weights are correctly handled. */
double set_vector_to_even(apop_data * r, int index){
    apop_data_set(r, 0, -1, .val=1-(index %2));
    return 0;
}
double set_weight_to_index(apop_data * r, int index){ 
    gsl_vector_set(r->weights, 0, index); 
    return 0;
}
double weight_given_even(apop_data *r){ 
    return gsl_vector_get(r->vector, 0) ? gsl_vector_get(r->weights, 0) : 0; 
}
int main(){
    apop_data *d = apop_data_alloc(100);
    d->weights = gsl_vector_alloc(100);
    apop_map(d, .fn_ri=set_vector_to_even, .inplace='v'); //'v=void. Throw out return values.
    apop_map(d, .fn_ri=set_weight_to_index, .inplace='v');
    double sum = apop_map_sum(d, .fn_r = weight_given_even);
    assert(sum == 49*25*2);
}

If the number of threads is greater than one, then the matrix will be broken into chunks and each sent to a different thread. Notice that the GSL is generally threadsafe, and SQLite is threadsafe conditional on several commonsense caveats that you'll find in the SQLite documentation. See apop_rng_get_thread() to use the GSL's RNGs in a threaded environment.

The ...sum functions are convenience functions that call ...map and then add up the contents. Thus, you will need to have adequate memory for the allocation of the temp matrix/vector.

Basic Math

apop_vector_exp : exponentiate every element of a vector
apop_vector_log : take the natural log of every element of a vector
apop_vector_log10 : take the log (base 10) of every element of a vector
apop_vector_distance : find the distance between two vectors via various metrics
apop_vector_normalize : scale/shift a matrix to have mean zero, sum to one, have a range of exactly , et cetera
apop_vector_entropy : calculate the entropy of a vector of frequencies or probabilities

Matrix math

apop_dot : matrix $\cdot$ matrix, matrix $\cdot$ vector, or vector $\cdot$ matrix
apop_matrix_determinant
apop_matrix_inverse
apop_det_and_inv : find determinant and inverse at the same time

See the GSL documentation for myriad further options.

Summary stats

Moments

For most of these, you can add a weights vector for weighted mean/var/cov/..., such as apop_vector_mean(d->vector, .weights=d->weights)

apop_mean : the first three with short names operate on a vector.
apop_sum
apop_var
apop_matrix_sum
apop_data_correlation
apop_data_covariance
apop_data_summarize
apop_matrix_mean
apop_matrix_mean_and_var
apop_vector_correlation
apop_vector_cov
apop_vector_kurtosis
apop_vector_kurtosis_pop
apop_vector_mean
apop_vector_skew
apop_vector_skew_pop
apop_vector_sum
apop_vector_var
apop_vector_var_m

Conversion among types

There are no functions provided to convert from apop_data to the constituent elements, because you don't need a function.

If you need an individual element, you can use its pointer to retrieve it:

apop_data *d = apop_query_to_mixed_data("vmmw", "select result, age, "
                                     "income, replicate_weight from data");
double avg_result = apop_vector_mean(d->vector, .weights=d->weights);

In the other direction, you can use compound literals to wrap an apop_data struct around a loose vector or matrix:

//Given:
gsl_vector *v;
gsl_matrix *m;
// Then this form wraps the elements into automatically-allocated apop_data structs.
apop_data *dv = &(apop_data){.vector=v}; 
apop_data *dm = &(apop_data){.matrix=m};
apop_data *v_dot_m = apop_dot(dv, dm);
//Here is a macro to hide C's ugliness:
#define As_data(...) (&(apop_data){__VA_ARGS__})
apop_data *v_dot_m2 = apop_dot(As_data(.vector=v), As_data(.matrix=m));
//The wrapped object is an automatically-allocated structure pointing to the
//original data. If it needs to persist or be separate from the original,
//make a copy:
apop_data *dm_copy = apop_data_copy(As_data(.vector=v, .matrix=m));

apop_array_to_vector : double* $\to$ gsl_vector
apop_data_fill : double* $\to$ apop_data
apop_data_falloc : macro to allocate and fill a apop_data set
apop_text_to_data : delimited text file $\to$ apop_data
apop_text_to_db : delimited text file $\to$ database table
apop_vector_to_matrix

Name handling

If you generate your data set via apop_text_to_data or from the database via apop_query_to_data (or apop_query_to_text or apop_query_to_mixed_data) then column names appear as expected. Set apop_opts.db_name_column to the name of a column in your query result to use that column name for row names.

Sample uses, given apop_data set d:

int row_name_count = d->names->rowct
int col_name_count = d->names->colct
int text_name_count = d->names->textct
//Manually add names in sequence:
apop_name_add(d->names, "the vector", 'v');
apop_name_add(d->names, "row 0", 'r');
apop_name_add(d->names, "row 1", 'r');
apop_name_add(d->names, "row 2", 'r');
apop_name_add(d->names, "numeric column 0", 'c');
apop_name_add(d->names, "text column 0", 't');
apop_name_add(d->names, "The name of the data set.", 'h');
//or append several names at once
apop_data_add_names(d, 'c', "numeric column 1", "numeric column 2", "numeric column 3");
//point to element i from the row/col/text names:
char *rowname_i = d->names->row[i];
char *colname_i = d->names->col[i];
char *textname_i = d->names->text[i];
//The vector also has a name:
char *vname = d->names->vector;

apop_name_add : add one name
apop_data_add_names : add a sequence of names at once
apop_name_stack : copy the contents of one name list to another
apop_name_find : find the row/col number for a given name.
apop_name_print : print the apop_name struct, for diagnostic purposes.

Text data

The apop_data set includes a grid of strings, named text, for holding text data.

Text should be encoded in UTF-8. ASCII is a subset of UTF-8, so that's OK too.

There are a few simple forms for handling the text element of an apop_data set.

Use apop_text_alloc to allocate the block of text. It is actually a realloc function, which you can use to resize an existing block without leaks. See the example below.
Use apop_text_set to write text elements. It replaces any existing text in the given slot without memory leaks.
The number of rows of text data in tdata is tdata->textsize[0]; the number of columns is tdata->textsize[1].
Refer to individual elements using the usual 2-D array notation, tdata->text[row][col].
x[0] can always be written as *x, which may save some typing. The number of rows is *tdata->textsize. If you have a single column of text data (i.e., all data is in column zero), then item i is *tdata->text[i]. If you know you have exactly one cell of text, then its value is **tdata->text.
After apop_text_alloc, all elements are the empty string "", which you can check via
if (!strlen(dataset->text[i][j])) printf("<blank>")
//or
if (!*dataset->text[i][j]) printf("<blank>")
For the sake of efficiency when dealing with large, sparse data sets, all blank cells point to the same static empty string, meaning that freeing cells must be done with care. Your best bet is to rely on apop_text_set, apop_text_alloc, and apop_text_free to do the memory management for you.

Here is a sample program that uses these forms, plus a few text-handling functions.

#include <apop.h>
int main(){
    apop_query("create table data (name, city, state);"
            "insert into data values ('Mike Mills', 'Rockville', 'MD');"
            "insert into data values ('Bill Berry', 'Athens', 'GA');"
            "insert into data values ('Michael Stipe', 'Decatur', 'GA');");
    apop_data *tdata = apop_query_to_text("select name, city, state from data");
    printf("Customer #1: %s\n\n", *tdata->text[0]);
    printf("The data, via apop_data_print:\n");
    apop_data_print(tdata);
    //the text alloc can be used as a text realloc:
    apop_text_alloc(tdata, 1+tdata->textsize[0], tdata->textsize[1]);
    apop_text_set(tdata, *tdata->textsize-1, 0, "Peter Buck");
    apop_text_set(tdata, *tdata->textsize-1, 1, "Berkeley");
    apop_text_set(tdata, *tdata->textsize-1, 2, "CA");
    printf("\n\nAugmented data, printed via for loop:\n");
    for (int i=0; i< tdata->textsize[0]; i++){
        for (int j=0; j< tdata->textsize[1]; j++)
            printf("%s\t", tdata->text[i][j]);
        printf("\n");
    }
    apop_data *states = apop_text_unique_elements(tdata, 2);
    char *states_as_list = apop_text_paste(states, .between=", ");
    printf("\n States covered: %s\n", states_as_list);
}

apop_data_transpose() : also transposes the text data. Say that you use dataset = apop_query_to_text("select onecolumn from data"); then you have a sequence of strings, d->text[0][0], d->text[1][0], .... After apop_data dt = apop_data_transpose(dataset), you will have a single list of strings, dt->text[0], which is often useful as input to list-of-strings handling functions.

apop_query_to_text
apop_text_alloc : allocate or resize the text part of an apop_data set.
apop_text_set : replace a single cell of the text grid with new text.
apop_text_paste : convert a table of strings into one long string.
apop_text_unique_elements : get a sorted list of unique elements for one column of text.
apop_text_free : you may never need this, because apop_data_free calls it.
apop_regex : friendlier front-end for POSIX-standard regular expression searching; pulls matches into an apop_data set.
apop_text_unique_elements

Generating factors

Factor is jargon for a numbered category. Number-crunching programs prefer integers over text, so we need a function to produce a one-to-one mapping from text categories into numeric factors.

A dummy is a variable that is either one or zero, depending on membership in a given group. Some methods (typically when the variable is an input or independent variable in a regression) prefer dummies; some methods (typically for outcome or dependent variables) prefer factors. The functions that generate factors and dummies will add an informational page to your apop_data set with a name like <categories for your_column> listing the conversion from the artificial numeric factor to the original data. Use apop_data_get_factor_names to get a pointer to that page.

You can use the factor table to translate from numeric categories back to text (though you probably have the original text column in your data anyway).

Having the factor list in an auxiliary table makes it easy to ensure that multiple apop_data sets use the same single categorization scheme. Generate factors in the first set, then copy the factor list to the second, then run apop_data_to_factors on the second:

apop_data_to_factors(d1);
d2->more = apop_data_copy(apop_data_get_factor_names(d1));
apop_data_to_factors(d2);

See the documentation for apop_logit for a sample linear model using a factor dependent variable and dummy independent variable.