Patterns in static

Apophenia

Input text file formatting

This reference section describes the assumptions made by apop_text_to_db and apop_text_to_data.

Each row of the file will be converted to one record in the database or one row in the matrix. Values on one row are separated by delimiters. Fixed-width input is also OK; see below.

By default, the delimiters are set to "|,\t", meaning that a pipe, comma, or tab will delimit separate entries. To change the default, use an argument to apop_text_to_db or apop_text_to_data like .delimiters=" \t" or .delimiters="|".

The input text file must be UTF-8 or traditional ASCII encoding. Delimiters must be ASCII characters. If your data is in another encoding, try the POSIX-standard iconv program to filter the data to UTF-8.

  • The character after a backslash is read as a normal character, even if it is a delimiter, #, or ". \li If a field contains several such special characters, surround it by \c "s. The surrounding marks are stripped and the text read verbatim.
  • Text does not need to be delimited by quotes (unless there are special characters). If a text field is quote-delimited, I'll strip them. E.g., "Males, 30-40", is an OK column name, as is "Males named \"Joe\"".
  • Everything after an unprotected # is taken to be comments and ignored.
  • Blank lines (empty or consisting only of white space) are also ignored.
  • If you are reading into the gsl_matrix element of an apop_data set, all text fields are taken as zeros. You will be warned of such substitutions unless you set apop_opts.verbose==0 beforehand. For mixed text/numeric data, try using apop_text_to_db and then apop_query_to_mixed_data.
  • There are often two delimiters in a row, e.g., "23, 32,, 12". When it's two commas like this, the user typically means that there is a missing value and the system should insert a NAN; when it is two tabs in a row, this is typically just a formatting glitch. Thus, if there are multiple delimiters in a row, I check whether the second (and subsequent) is a space or a tab; if it is, then it is ignored, and if it is any other delimiter (including the end of the line) then a NaN is inserted.

If this rule doesn't work for your situation, you can explicitly insert a note that there is a missing data point. E.g., try:

perl -pi.bak -e 's/,,/,NaN,/g' data_file

If you have missing data delimiters, you will need to set apop_opts.nan_string to text that matches the given format. E.g.,

//Apophenia's default NaN string, matching NaN, nan, or NAN, but not Nancy:
apop_opts.nan_string = "NaN";
//Popular alternatives:
apop_opts.nan_string = "Missing";
apop_opts.nan_string = ".";
//Or, turn off nan-string checking entirely with:
apop_opts.nan_string = NULL;

SQLite stores these NaN-type values internally as NULL; that means that functions like apop_query_to_data will convert both your nan_string string and NULL to NaN.

  • The system uses the standards for C's atof() function for floating-point numbers: INFINITY, -INFINITY, and NaN work as expected.
  • If there are row names and column names, then the input will not be perfectly square: there should be no first entry in the sequence of column names like row names. That is, for a 100x100 data set with row and column names, there are 100 names in the top row, and 101 entries in each subsequent row (name plus 100 data points).
  • White space before or after a field is ignored. So 1, 2,3, 4 , 5, " six ",7 is eqivalent to 1,2,3,4,5," six ",7.
  • NUL characters ('\0') are treated as white space, so if your fields have NULs as padding, you should have no problem. NULs inside of a string terminates the string as it always does in C.
  • Fixed-width formats are supported (for plain ASCII encoding only), but you have to provide a list of field ending positions. For example, given
    NUMLEOL
    123AABB
    456CCDD
    and .field_ends=(int[]){3, 5, 7}, we have three columns, named NUM, LE, and OL. The names can be read from the first row by setting .has_row_names='y'.