Apophenia is an open statistical library for working with data sets and statistical models. It provides functions on the same level as those of the typical stats package (such as OLS, probit, or singular value decomposition) but gives the user more flexibility to be creative in model-building. The core functions are written in C, but experience has shown them to be easy to bind to in Python/Julia/Perl/Ruby/&c.
It is written to scale well, to comfortably work with gigabyte data sets, million-step simulations, or computationally-intensive agent-based models. If you have tried using other open source tools for computationally demanding work and found that those tools weren't up to the task, then Apophenia is the library for you.
The library has been growing and improving since 2005, and has been downloaded over 10,000 times. To date, it has over two hundred functions to facilitate scientific computing, such as:
- OLS and family, discrete choice models like probit and logit, kernel density estimators, and other common models
- database querying and maintenance utilities
- moments, percentiles, and other basic stats utilities
- t-tests, F-tests, et cetera
- Several optimization methods available for your own new models
- It does not re-implement basic matrix operations or build yet another database engine. Instead, it builds upon the excellent GNU Scientific and SQLite libraries. MySQL is also supported.
For the full list, click the index link from the header.
Most users will just want to download the latest packaged version linked from the Download Apophenia here header.
Those who would like to work on a cutting-edge copy of the source code can get the latest version by cutting and pasting the following onto the command line. If you follow this route, be sure to read the development README in the
Apophenia directory this command will create.
To start off, have a look at this Gentle Introduction to the library.
The outline gives a more detailed narrative.
The index lists every function in the library, with detailed reference information. Notice that the header to every page has a link to the outline and the index.
To really go in depth, download or pick up a copy of Modeling with Data, which discusses general methods for doing statistics in C with the GSL and SQLite, as well as Apophenia itself. A Cross-paradigm Modeling Framework (PDF) discusses some of the theoretical structures underlying the library.
There is a wiki with some convenience functions, tips, and so on.
Much of what Apophenia does can be done in any typical statistics package. The apop_data element is much like an R data frame, for example, and there is nothing special about being able to invert a matrix or take the product of two matrices with a single function call (apop_matrix_inverse and apop_dot, respectively). Even more advanced features like Loess smoothing (apop_loess) and the Fisher Exact Test (apop_test_fisher_exact) are not especially Apophenia-specific. But here are some things that are noteworthy.
- The text file parser is flexible and effective. Such data files are typically called `CSV files', meaning comma-separated values, but the delimiter can be anything (or even some mix of things), and there is no requirement that text have "special delimiters". Missing data can be specified by a simple blank or a marker of your choosing (e.g.,
apop_opts.nan_string = "N/A";). Or there can be no delimiters, as in the case of fixed-width files. If you are a heavy SQLite user, Apophenia may be useful to you simply for its apop_text_to_db function.
- The maximum likelihood system combines a lot of different subsystems into one form: it will do a few flavors of conjugate gradient search, Nelder-Mead Simplex, Newton's Method, or Simulated Annealing. You pick the method by a setting attached to your model. If you want to use a method that requires derivatives and you don't have a closed-form derivative, the ML subsystem will estimate a numerical gradient for you. If you would like to do EM-style maximization (all but the first parameter are fixed, that parameter is optimized, then all but the second parameter are fixed, that parameter is optimized, ..., looping through dimensions until the change in objective across cycles is less than
eps), just set
Apop_settings_add_group(your_model, apop_mle, .dim_cycle_tolerance=eps).
- The Iterative Proportional Fitting algorithm, apop_rake, is best-in-breed, designed to handle large, sparse matrices.
- As well as the apop_data structure, Apophenia is built around a model object, the apop_model. This allows for consistent treatment of distributions, regressions, simulations, machine learning models, and who knows what other sorts of models you can dream up. By transforming and combining existing models, it is easy to build complex models from simple sub-models.
- For example, the apop_update function does Bayesian updating on any two well-formed models. If they are on the table of conjugates, that is correctly handled, and if they are not, Gibbs sampling produces an empirical distribution. The output is yet another model, from which you can make random draws, or which you can use as a prior for another round of Bayesian updating.
- Of course, it's a C library, meaning that you can build applications using Apophenia for the data-processing back-end of your program. For example, it is currently used in production for certain aspects of processing for the U.S. Census Bureau's American Community Survey.
- Develop a new model object.
- Contribute your favorite statistical routine.
- Package Apophenia into an RPM, apt, portage, cygwin package.
- Report bugs or suggest features.
- Write bindings for your preferred language, which may just mean modifying the existing SWIG interface file.
If you're interested, write to the maintainer (Ben Klemens), or join the GitHub project.