This reference section describes the assumptions made by apop_text_to_db and apop_text_to_data.

Each row of the file will be converted to one record in the database or one row in the matrix. Values on one row are separated by delimiters. Fixed-width input is also OK; see below.

By default, the delimiters are set to "|,\t", meaning that a pipe, comma, or tab will delimit separate entries. To change the default, use an argument to apop_text_to_db or apop_text_to_data like .delimiters=" \t" or .delimiters="|".

The input text file must be UTF-8 or traditional ASCII encoding. Delimiters must be ASCII characters. If your data is in another encoding, try the POSIX-standard iconv program to filter the data to UTF-8.

The character after a backslash is read as a normal character, even if it is a delimiter, #, or ". \li If a field contains several such special characters, surround it by \c "s. The surrounding marks are stripped and the text read verbatim.
Text does not need to be delimited by quotes (unless there are special characters). If a text field is quote-delimited, I'll strip them. E.g., "Males, 30-40", is an OK column name, as is "Males named \"Joe\"".
Everything after an unprotected # is taken to be comments and ignored.
Blank lines (empty or consisting only of white space) are also ignored.
If you are reading into the gsl_matrix element of an apop_data set, all text fields are taken as zeros. You will be warned of such substitutions unless you set apop_opts.verbose==0 beforehand. For mixed text/numeric data, try using apop_text_to_db and then apop_query_to_mixed_data.
There are often two delimiters in a row, e.g., "23, 32,, 12". When it's two commas like this, the user typically means that there is a missing value and the system should insert a NAN; when it is two tabs in a row, this is typically just a formatting glitch. Thus, if there are multiple delimiters in a row, I check whether the second (and subsequent) is a space or a tab; if it is, then it is ignored, and if it is any other delimiter (including the end of the line) then a NaN is inserted.

If this rule doesn't work for your situation, you can explicitly insert a note that there is a missing data point. E.g., try:

perl -pi.bak -e 's/,,/,NaN,/g' data_file

If you have missing data delimiters, you will need to set apop_opts.nan_string to text that matches the given format. E.g.,

//Apophenia's default NaN string, matching NaN, nan, or NAN, but not Nancy:
apop_opts.nan_string = "NaN";
//Popular alternatives:
apop_opts.nan_string = "Missing";
apop_opts.nan_string = ".";
//Or, turn off nan-string checking entirely with:
apop_opts.nan_string = NULL;

SQLite stores these NaN-type values internally as NULL; that means that functions like apop_query_to_data will convert both your nan_string string and NULL to NaN.

The system uses the standards for C's atof() function for floating-point numbers: INFINITY, -INFINITY, and NaN work as expected.
If there are row names and column names, then the input will not be perfectly square: there should be no first entry in the sequence of column names like row names. That is, for a 100x100 data set with row and column names, there are 100 names in the top row, and 101 entries in each subsequent row (name plus 100 data points).
White space before or after a field is ignored. So 1, 2,3, 4 , 5, " six ",7 is eqivalent to 1,2,3,4,5," six ",7.
NUL characters ('\0') are treated as white space, so if your fields have NULs as padding, you should have no problem. NULs inside of a string terminates the string as it always does in C.
Fixed-width formats are supported (for plain ASCII encoding only), but you have to provide a list of field ending positions. For example, given
NUMLEOL
123AABB
456CCDD
and .field_ends=(int[]){3, 5, 7}, we have three columns, named NUM, LE, and OL. The names can be read from the first row by setting .has_row_names='y'.