4 April 13.

Emulating others and fat-tailed distributions

[PDF version]

I finally posted my paper on leptokurtic (fat-tailed) outcomes to arXiv. I started writing it as part of my dissertation, so it took over a decade to put out. I'm still not entirely happy with it--it's two subpapers that don't meet as well as I want them to:

  • The first subpaper gathers the literature in finance about how returns are leptokurtic and that people in finance have real incentives to emulate each other.
  • The second subpaper writes down a simple model where agents get direct benefit from emulating each other--their utility function is increasing in a private valuation of an action and in the percentage of others who are acting--and finds that leptokurtic distributions result.

I've always thought that emulation is such a blatantly clear facet of human behavior, and was struck by how the economics literature largely ignores it. Your typical econ textbook starts off by saying that agents have a utility function which we will take as the given starting point, making as few assumptions as possible about what's in that function. If people have a strong preference for giving away their money, the textbook advises that our only job as economists is to model the implications of that preference. But then we go to the journals, and we find a rather narrow range of utility functions assumed. They have to be out there somewhere, but I never found a paper in a mainstream econ journal that was willing to assume that agents get direct benefit from emulating others.

Nonetheless, emulative behavior is fundamental to being human: mothers emulate their babies, who emulate their mothers; con men and magicians devise tricks that depend on how it can sometimes take effort to not emulate another person; cliques of people who grow to behave similarly are abundant, and not just in high school; people pay good money for fashion magazines that instruct the reader on how to dress like everybody else--even what underwear is fashionable.

We can sometimes explain away certain aspects of emulative behavior as really being something more direct. For example, the actions of others imparts information, and gathering information is expensive. So if somebody else buys product A , that should provide information to you that product A is better. Of course, this works iff the other party has information, and isn't just imitating somebody else or making a random draw. There are network externalities, wherein a product you buy is worth more when there are others on the same network or using the same standard, which produce a narrowly-rational motivation to emulate others.

Depending on the assumptions you want to make, the outcomes from the emulation-for-information or emulation-for-network story can be equivalent to the outcomes from the direct emulation story. The only difference is that those stories are special cases, and so applicable to fewer situations.

The broad intended lesson of my arXiv paper is that when you see a fat-tailed distribution, or a widening gap between the winners and losers, then a social emulation story should pop into your mind. Emulation causes a bit more push to the extreme outcomes--if, without any emulation, private tastes would lead to 65% taking some action, we can expect that if people have some slight preference to emulate the majority, then the same private tastes would lead to maybe 67% taking the action. Apply this to every level of action, and you've transformed your private-tastes distribution to have fatter tails.

There are some obvious cases where the herd-following story already comes up, especially in the case of blockbusters and flops, including movies (see the paper by de Vaney and Walls, cited in my arXiv paper) and cities (which are pretty obviously a network). But there are more subtle cases, and cases where a lot of parties want to pretend that emulation doesn't happen. The stock market is such a case, because none of the heroes in any Ayn Rand novels had any interest in emulating others. As I mention in the paper's conclusion, Congressional roll call votes provide another such case.

Further, here in the real world, equity traders and Congressfolk do have their own opinions, and we can't expect a lot of their positions to be based on pure emulation. Nor does the data demonstrate a bifurcated blockbuster/flop dichotomy in equity returns or vote counts. If we had a simple Central Limit Theorem at play, and people independently took positions, we'd expect a Normal distribution of returns or vote counts. Instead, we see distributions that are bell curves with fatter tails than the Normal. The paper demonstrates that we can explain this spreading-out via an emulative component to the decision process.

To give another example, I just ran across an article that finds a blockbuster-flop pattern in citations. It tries to blame it on an information overload story, but it's also pretty easy to explain via emulative behavior, either due to information gathering or simple fashion.

I found the paper via social means, from the blog of someone who was once a fellow student. He adds a useful point about Gini coefficients (though applying his analysis to the explosion of publications seems to require that we assume that as publication became cheaper, the additional papers that appear are lower-quality and less citation-worthy).

Alexander Schuessler, in his Logic of Expressive Choice (BUY!), is my favorite work that models humans as expressing identity by joining groups and emulating others. His title is an homage to Mancur Olson's Logic of Collective Action, which bangs its head against the wall for the length of a book asking why it is that people exert effort to work for politically-oriented groups when the Prisoner's Dilemma predicts that they'd leave the work to others. Schuessler's response is that if you are willing to allow the possibility that people have some desire to be part of a social group, then there's not much of a paradox left.

There can be other explanations, and the fact that I've constructed a model which assumes emulative behavior and ends in a leptokurtic outcome doesn't prove that (1) emulative behavior must lead to leptokurtic outcomes or that (2) leptokurtic outcomes are always caused by emulative behavior. But it does demonstrate that the causal story that emulation fat-tailed outcomes is a possibility for a lot of contexts.

The other thing I hope readers get out of the paper is that agent-based models can be used to generate distributions, which we can then do statistics on as with any other distribution. I'm working on a paper that goes into immense detail on this point, and I'll try to post at least one example here over the next month.



[link][no comments]


22 March 13.

My coworkers

[PDF version]

I work with lots of statisticians. Here are a few, all of whom I recommend following:

Jerzy Wieczorek: Jerzy's blog is a pretty generalist statistics blog. He gets around the city, and often writes about meetups and other such events--he's how I found out about the recent data drives at the World Bank. He's also a twitter, whom you can follow @civilstat. I'm @b__k, btw.

Chris Blakely: he does time series, which is a somewhat foreign world to me. The methods are more oriented toward Fourier decompositions than fitting the parameters of bell curves. Chris has put a lot of work into writing new computing tools for the application of time series methods, which he applies to financial series. Even his diagrams are pretty awesome.

Josh Tokle: he is more a mathematician and computer geek than a statistician. You can tell because he wrote the blog in a blogging platform you've never heard of, though his first few posts have been on statistical topics. His diagrams are a awesome in a diametrically different way from Chris's.


[link][no comments]


19 January 13.

Frequentist, Bayesian, and why I don't care

[PDF version]

Really intelligent people have made attempts to formalize the intuitive definition of odds/chance-of/probability/likelihood for a few centuries now. We have no new data: what real-world evidence on the link between our intuition about odds and the real world do we have that Pascal didn't have in the 1600s? The entropy angle is interesting, because we've only had the concept for, I dunno, 150 years, and information entropy for not even a century.

When there's a major problem that thousands of people possessing full information have written about in great detail, and it's still unresolved, that's when we conclude that it's a question of philosophy. It can be resolved in the sense that philosophical questions can be resolved, wherein distinct parties come up with internally consistent and personally satisfactory storylines and there is no objective test to determine which storyline is objectively correct.

Articles that talk about Bayesians and Frequentists often remind me of those comedians with bits about how Black people, they do it like this, .... But then White people, they do it like this: .... I end up wondering if the people being made fun of really exist here in the real world.

A lot of attacks on Frequentists are really attacks on oversimplified undergrad stats textbooks, or seat-of-the-pants statisticians who do things that seem to work but have no serious theoretical basis. Here's a review of a book by Nate Silver, written by a non-statistician, who got the impression from Silver's writing that Frequentists do nothing but produce inferior research. Using his commonsense prior knowledge, he concludes that Silver is beating up a straw man that is unlikely to be representative of real Frequentists.

Characterizing a class of people based on the subset that is least paying attention is fun for the feelings of superiority, but is otherwise a waste of time. Here's an essay (PDF) where Jayne says that he's done with Frequentists, pointing to certain people who had trouble understanding how likelihood functions relate to states of nature and calling them “handicapped”. Some day, Bayesian methods will be common enough that there are oversimplifying textbooks that provide recipes for users to misapply, and software that advertises itself as being so easy that even an undergrad who knows nothing of statistics can get Bayesianly correct estimates without knowing a lick of statistics. At which point we'll have essays about how Bayesians don't have a deep understanding of statistics.

On models

A model intermediates between data and parameters. We typically express a model in terms of a probability/likelihood function, which I will write as P(d, β) , where d is a data element (typically scalar or matrix, typically known), β is a set of parameters (typically a vector, typically unknown), and the output to P(d, β) is a probability/likelihood, a single nonnegative scalar.

At this point, you might want to go and read this entry on probability versus likelihood, which explains that the function P(d, β) expresses both subjective and objective components in a single joint distribution, and why I have to finesse the question of whether this is a probability or a likelihood--it's both. I think that if you are comfortable with how P(d| β) is verifiable and (relatively) objective, while P(β| d ) is subjective and (in most cases) can not be verified, then you have a handle on the crux of the debate.

Having gotten the caveats about stereotyping out of the way, my working definition is that Bayesians see P(d, β) as a fundamentally subjective expression, but strive to rely as much as possible on an objective probabilistic framework; Frequentists see P(d, β) as fundamentally objective, but acknowledge and strive to accommodate subjective elements where needed.

Defining the model

Isn't the distinction between Bayesians and everybody else right in the title--that one team uses Bayes's rule and everybody else doesn't? Bayes's rule has a one-line proof, and your typical stats/probability textbook covers it by page twenty. To not believe in Bayes's rule is equivalent to not believing basic standard probability theory. Models that don't explicitly use Bayes's rule at some point over the course of their typical description can frequently still be restated to make the underlying rule more apparent; explicit use of Bayes's rule is often computationally expensive, so Bayesian models are frequently reshaped to a version with no mention of Bayes's rule.

I found this article by David Draper (PDF, 95 slides) to be a nice summary of the descriptive-side situation. The first 25pp give an overview of the efforts to formalize the problem of defining a model and separating the objective from the subjective, listing the many conditions that are required to really make for a consistent theory. It points out that Bayes's rule is essential to consistency, but then turns around and seems to imply that only models that explicitly use Bayes's rule are going to be consistent, which is clearly not true (and I'm guessing/hoping that DD would acknowledge this).

Grammar tip: possessives in English end in an apostrophe-s combination. As an exception, this 's is omitted for plurals ending in s, such as the pedestrians' crossing. The exception about plurals does not apply to words that are not plural but happen to end in s. See rule #1 in Strunk and White's Elements of Style for details.

We all want our models to lean to the objective side; how do we make that happen? Here we get to the rift between the different schools. The Frequentists lean heavily on the CLT and other such objectively-proven and subjectively-applied theorems. Sometimes the CLT is preeminently defensible (as alluded above, typically back in the world of controlled experiments on non-living things); sometimes it's an irritating refusal to really accept the ambiguity of the situation.

Central limit theorems, when applicable, work. If elements really have a fixed probability of leaving the set at any given time, then lifespans really do follow a Poisson distribution. Inquiries in the physical sciences typically have a derivable distribution like these built in to the controlled experiment, while in the social sciences we typically have no idea what we're doing, and can't control anything in the data. In both the physical and social sciences, the model is a human-imposed structure on the data, but we can already see that in some situations people will be more inclined to label the final statements about parameters as objective and in some will be inclined to call them subjective. The reader will note that Bayesians tend to hang out in the social sciences.

The Bayesians point out that there is still subjective information to be had, and maybe we can codify it all into a prior distribution. OK, great, you've concentrated all your subjectivity into one place. This forces a certain format on the model, and makes you no more or less objective than you were before, but does make it easy to incorporate correctly-codified additional information.

The typical Bayesian setup involves using a prior distribution, which is allegedly subjective auxiliary information, and a likelihood distribution, which is often described as the real model. This core-plus-auxiliary terminology creates a lot of trouble, and life is easier when we take the entire pipeline, prior plus likelihood, as the unit that we call a model. This is how Apophenia does it: the apop_update function returns a single model estimated using the data and setup you provided. Now we're back where we were with the Frequentists: a single model that expresses what information we have and what beliefs we have about the structure of the model, including some portion of its inner workings that is clearly subjective and some portion that is more clearly objective. Any claim that we could somehow componentize the objective and subjective is chimerical anyway.

There are people who take the principle of maximum entropy to be an objective rule about the universe, which has been observed to hold many times over. I have a coworker who gets really pissed off when anybody mentions the principle of maximum entropy, and thinks of any work using it as subjective bunk. The maxEntropists are making another attempt to confront the problem that we have too little information and have to fill it in with some sort of assumption, and having used a different principle, wound up with a different final structure to the model.

The descriptive models of the different schools embody subjective information in different ways. But in the end, they all have the same form of an assumed structure P(d, β) .

The inferential step

To this point, I've been talking about descriptive statistics: the art of combining objective facts and subjective modeling decisions into a single final model, then estimating its parameters from the data. Bayesians, Frequentists, maxEntropists tend toward different models, that each feels is on the better side of the objective-subjective scale. Then we have the inferential step, where we take the model's final distribution describing the parameters, P(β| d ) , and try to say something interesting about β . A lot of ink is spilled at this point, about how much one should rely on certain intervals and whether to call them confidence intervals or credible intervals, and how much one can use a point on the distribution, like P(β = 3| d ) and what you would name such a thing. This all began thanks to a blog post by a coworker on the different schools; his post focuses on the interpretation of the inference step. Also, we should again remember to distinguish between human Frequentists and crappy undergrad stats textbooks.

I do like to make one quibble about language: when an author says something like `given the data, we fail to reject the null hypothesis with 91% certainty', a key element is left out; this should read `given the data and the model, we fail to reject the null hypothesis with 91% certainty'. We developed a model, and every statement we make from that point on will rely on that model and its subjective and objective foundations. Careful authors tend to get this right, undergraduate stats textbooks never do.

Quibbling after that point is largely a restatement of the same philosophy of science question from the descriptive step: the probability distribution of a parameter is fundamentally unobservable in most of the situations we're dealing with, so how do we express that we've imposed a model and are stating probabilities using this invention of ours, and how do those invented probabilities relate to reality?

Summary sentence: people have trouble dealing with things that are not easily categorized, and P(d, β) is fundamentally a combination of observable, unobservable, subjective, and objective. This is conceptually difficult--I think I've used the term mind-blowing before. Among the reasonable people, some (which I have called Bayesians) class models as subjective and then from that point reach toward saying something objective; some (what I've called Frequentists) class models as fundamentally objective, but acknowledge and strive to accommodate subjective aspects. They eventually meet in the middle, though their models will have different forms expressing the differences in perspective.


[link][no comments]


16 January 13.

In memory and on-disk databases for SQLite

[PDF version]

This is mostly a housekeeping entry on Apophenia, but I'll start with a useful trick for speeding up dealings with SQLite. Next time, some statistical epistemology.

SQLite is exactly as much as you need to produce and store data tables and deal with them using SQL. It stores the database in a single file on disk or in memory. Naturally, the in-memory database will be faster, though as soon as your program exits, your database is wiped.

Here's how I use SQLite, where possible: I open a database in memory, and use that as the primary database. To get to the data on my on-disk database, I use SQL's attach command. So I'm opening the base database using the C-side databse opening command, and then attaching an auxiliary database on the SQL side. This is why Apophenia has only one global database handle, and has never seriously needed more.

There may be scale issues, but they're easily manageable, because we generally have a reasonable handle on how big any given table is going to be, and we can create scratch tables guaranteed to be much less than a gigabyte in memory, and potentially larger tables with a name like diskdb.bigtable, where diskdb is the attached on-disk database.

If you want more exposition, I detail this sort of thing in both Modeling with Data and 21st Century C (one of the few overlaps of the two books).

Some functions for merging

If you need a (not-too-large) table from the disk, you can copy it to the database without much effort; if you have a table in memory that needs to be saved for posterity, you can copy it to the disk. E.g.

create table diskdb.results as
select * from results

I pictured the workflow wherein the user opens an in-memory database, copies up the entire on-disk database, and then gets to work, as being much more useful than it turned out to be. So I wrote a few functions to make it happen, and put them into Apophenia in 2005 (whoa--eight years ago). They just sat around, without serious testing or use, to the best of my knowledge.

So I trimmed them from Apophenia. If you still find them to be useful, here they are. The documentation is in Doxygen format. Apophenia has its own means of implementing C-standard variable argument lists, so I removed that part, leaving fixed-argument functions.

/** Merge a single table from a database on the hard drive with the database currently open.

\param db_file	The name of a file on disk. 
\param tabname	The name of the table in that database to be merged in. 
\param inout  Do we copy data in to the currently-open main db [\c 'i'] or out to the specified auxiliary db[\c 'o']? 

If the table exists in the new database but not in the currently open one, then it
is simply copied over. If there is a table with the same name in the currently open
database, then the data from the new table is inserted into the main database's table
with the same name. [The function just calls <tt>insert into main.tab select * from
merge_me.tab</tt>.]
*/
void apop_db_merge_table(char *db_file, char *tabname, char inout){
    Apop_assert_n(tabname, "I need a non-NULL tabname");
    char maine[] = "main";
    char merge_me[] = "merge_me";
    char *from = inout == 'i' ? merge_me : maine;
    char *to = inout == 'i' ? maine : merge_me ;
	if (db_file !=NULL)
		apop_query("attach database \"%s\" as merge_me;", db_file);
	int d = apop_query_to_float("select count(*) from %s.sqlite_master where name == \"%s\";", to, tabname);
	if (!d){	//just import table
		Apop_notify(2, "adding in %s", tabname);
		apop_query("create table %s.%s as select * from %s.%s;", to, tabname, from, tabname);
	}
	else	{			//merge tables.
		Apop_notify(2, "merging in %s", tabname);
		apop_query("insert into %s.%s select * from %s.%s;", to, tabname, from, tabname);
	}
	if (db_file !=NULL)
		apop_query("detach database merge_me;");
}

/** Merge a database on the hard drive with the database currently open.

\param db_file	The name of a file on disk. 
\param inout  Do we copy data in to the currently-open main db [\c 'i'] or out to the specified auxiliary db[\c 'o']? 

If a table exists in the new database but not in the currently open one, then it is
simply copied over. If there are  tables with the same name in both databases, then
the data from the new table is inserted into the main database's table with the same
name. [The function just calls <tt>insert into main.tab select * from merge_me.tab</tt>.]

\li This is SQLite-only; I'm not sure if it really makes much sense for mySQL.
*/
void apop_db_merge(char *db_file, char inout){
    apop_data *tab_list;
	apop_query("attach database \"%s\" as filedb;", db_file);
	tab_list= apop_query_to_text("select name from %s.sqlite_master where type==\"table\";"
              , inout == 'i' ? "filedb" : "main" );
    if(!tab_list) return; //No tables to merge.
	for(int i=0; i< tab_list->textsize[0]; i++)
		apop_db_merge_table(db_file, (tab_list->text)[i][0], inout);
	apop_query("detach database filedb;");
	apop_data_free(tab_list);
}

For bonus points, here is a command-line program making use of the above.

/** \file cmd_apop_merge_dbs.c	A little command-line utility to merge
 two databases. Try <tt>apop_merge_dbs -h</tt> for help.*/

/* Copyright (c) 2005--2007, 2009 by Ben Klemens.  Licensed under the modified GNU GPL v2; see COPYING and COPYING2.  */

#include <apop.h>
#include <unistd.h>

int main(int argc, char **argv){
    int  merge_ct = 0;
    char c, msg[1000], **merges = NULL;
	sprintf(msg, "%s [-v] [-t table_name] [-t next_tabname] main_db.db db_to_merge_into_main.db\n"
			     "   -t\ttable to merge. If none, do all in the source db. Use as many as you'd like.\n"
			     "   -v\tverbose\n", argv[0]); 
	if(argc<3){
		printf("%s", msg);
		return 1;
	}
	while ((c = getopt (argc, argv, "vht:")) != -1){
		switch (c){
		  case 'v':
			apop_opts.verbose	++;
			break;
		  case 'h':
			printf("%s", msg);
			return 1;
          case 't':
            merges = realloc(merges, sizeof(char*)*merge_ct++);
            merges[merge_ct-1] = optarg;
		}
	}
	apop_db_open(argv[optind]);
    if (merge_ct)
        for (int i=0; i< merge_ct; i++)
            apop_db_merge_table(argv[optind +1], merges[i]);
    else
        apop_db_merge(argv[optind +1]);
	apop_db_close('n');
}


[link][no comments]


16 January 13.

Raking with structural zeros

[PDF version]

It's popular to set certain cells to zero, thus producing a not-quite-rectangular grid of values. The popular example around my workplace is that people under 15 can't be married, so we want our grid of options to exclude those cells for (14, married), (14, divorced), ... (0, married), (0, divorced), but to be otherwise square.

If you have a row in the table of margins that doesn't meet the constraint, then it gets thrown out as bad data, by the way. I'm not sure if there's a better way to deal with such things; feel free to leave your suggestions in the comments.

Here's an example, using almost the same synthesis problem as before, but w/o the NaN trick this time, and with extra data at (3, 1) to keep things interesting.

Apophenia uses a lot of SQL on the back end, so it's natural to express the structural zeros via SQL. In this case, we have only one structural zero, at row=1 and col=1.

apop_text_to_db -O  -d="|" '-' margins sample.db <<"----------"
row | col | weight
  1 |  1 |   2.5
  1 |  2 |   2.5
  2 |  1 |   7.5
  3 |  1 |   7.5
  2 |  2 |   7.5
----------

cat <<"----------" > rake.c
#include <apop.h>

int main(){
    apop_db_open("sample.db");
    apop_data_show(
        apop_rake(.margin_table="margins", .count_col="weight",
            .contrasts=(char*[]){"row", "col"}, .contrast_ct=2,
            .structural_zeros="row=1 and col=1"
    ));
}
----------

export CFLAGS="-g -Wall -O3 `pkg-config --cflags apophenia`"
export LDLIBS="`pkg-config --libs apophenia`" CC=c99
make rake
./rake



[link][no comments]


8 December 12.

Raking to complete missing data

[PDF version]

In this episode, we start with a data set with some complete observations and some observations with certain dimensions missing, and generate a PMF for the values of the complete data that accommodates all of the information in the partially-observed observations.

Further, we're going to do this using the same off-the-shelf raking routine as used in the last few episodes to adjust one data set toward another and to generate synthetic data.

Last timeLast time, when I showed you how we can synthesize data via raking, I presented a little trick to specify the table of margins, by including margins with missing values (represented by the IEEE NaN marker). If I have this data--

X Y Z count
NaN 2 3 10
1 2 3 10
1 NaN 3 10

--then the (Y, Z) margin of (2, 3) has total weight of 20 on it, as does the (X, Z) margin of (1, 3) . For synthesis, the trick of specifying NaNs but summing to not-NaN margins let me specify margins directly, rather than thinking about how I could invent a data set that happens to have the right margins.

This time, we'll read the NaNs as missing data. Having NaNs mixed in all over like this is formally known as a non-monotone missingness pattern, but I prefer the term Swiss cheese data. Simply summing the observations to margins produces a set of margins that includes the observed information, including what information is embodied in partially observed data. But we don't just want margins, we want the probabilities for specific cells.

We can use the complete data as a starting point, and rake to find the closest data set consistent with the complete information in the margins. The final result can be proven to have some desirable properties; see Little & Rubin for details (Amazon page). They view the problem as a Bayesian updating problem, using the information in the partially-observed observations to update the distribution of the fully-observed cells, and they arrive at the same raking algorithm here. Note, by the way, that their text uses only individual contrasts (in the notation here, .contrasts=(char *[]){"X", "Y", "Z"}) but nothing keeps us from using higher margins as well, like .contrasts=(char *[]){"X|Y", "X|Z", "X|Z"}. Doing so would allow us to retain more cross-dimension information than their treatment.

Here's the cut-n-pasteable sample code. It's longer than before (and more Apophenia-specific) because I added a little function to make draws from the resulting completed table. That function estimates a model from the data set, via apop_estimate(d, apop_pmf), then draws rows of data from the PMF. The rows get written to a new data set that the function allocates and fills.

Also, there's a weighting step, currently commented out, which I'll discuss below.

apop_text_to_db -O  -d="|" '-' cheese nans.db <<"----------"
row | col | weight
  1 |  1 |   5
  1 |  2 |   5
  2 |  1 |   5
  2 |  2 |   5
  1 |  nan |   5
  2 |  nan |   15
  nan |  1 |   15
  nan |  2 |   5
----------


cat <<"----------" > rake.c
#include <apop.h>

apop_data* make_draws(apop_data *d, int count, gsl_rng *r){
    apop_data *draws=apop_data_alloc(count, d->matrix->size2);
    apop_model *m = apop_estimate(d, apop_pmf);
    for (int i=0; i< count; i++){
        Apop_row(draws, i, onerow);
        apop_draw(onerow->data, r, m);
    }
    apop_data_show(draws);
    return draws;
}

int main(){
    apop_db_open("nans.db");
    apop_table_exists("init", 'd');
    apop_query("create table init as select * from cheese "
               "where row + col is not null");
    apop_data *filledin= apop_rake(.margin_table="cheese", .count_col="weight",
            .contrasts=(char*[]){"row", "col"}, .contrast_ct=2,
            .init_table="init", .init_count_col="weight",);
    //gsl_vector_scale(filledin->weights, 1./apop_vector_sum(filledin->weights));
    apop_data_show(filledin);

    gsl_rng *r = apop_rng_alloc(342134);
    make_draws(filledin, 20, r);
}
----------

export CFLAGS="-g -Wall -O3  `pkg-config --cflags apophenia`"
export LDLIBS="`pkg-config --libs apophenia`" CC=c99
make -s rake && ./rake

One detail about the setup above: the totals might not be right. Say that our contrasts were Z and X| Y ; then the first contrast has a total not-NaN weight of 30, and the second has a total not-NaN weight of 10. The raking algorithm would thus rescale the table to sum to weight 30, then rescale to sum to weight 10, back and forth ad nauseum. The test for convergence only occurs at the end of a full round of raking, so the algorithm can still halt, but the total weight in the table will be determined by that last contrast. In this case, we'd have a total weight of ten; if we had ordered the contrasts to be X| Y then Z , we'd have weight 30. One option would be to add a reweighting step at the end of the sample code above, which I left in as a comment. It's unnecessary here because if the data gets treated as a PMF, the total weight of the vector of weights doesn't matter anyway. For other uses, you might want the scaling explicitly done.


[link][no comments]


25 November 12.

Synthesis via raking

[PDF version]

I'd promised you some raking for missing data, but I'm putting that off for one more episode.

Last time ILast time showed you a table--only the margins mattered, so I'll just print those:

- - 5
- - 15
10 10  

Apophenia's raking setup doesn't take in pure margins like this; instead it wants a data set giving the value for eack cell in the table, that it will then sum into margins. So what should we fill in for the dashes? We just want something that matches the margins, and the exact values don't matter. If (1, 1, 1, 1) somehow fit, we'd run with it.

Of course, that question is exactly the raking question all over again: what is the closest table to the all-ones table that matches the given margins?

For today's sample code, let's make it happen. But before presenting the code, let me point out a trick I used to express the margins. Let us say that we have these three cells in the table:
(1, 1) = 0 ,
(1, 2) = 0 ,
(1, NaN) = 10 .
Then the sum for row 1 is ten. Also, if a cell is zero in the initial setup, and raking only re-scales cells, then that cell will remain zero throughout the procedure--zero times anything is zero. So if we start with an initial data set with no NaNs, then the (1, NaN) cell in the margin table is a sort of phantom cell that never gets used, and only affects the margin totals. Thus, the initial data listing sets the margins as we wish via the observations with NaNs in them, and the data cells themselves [like (1, 1) = 0 ] are just filler so the procedure doesn't get confused.

I need the NaN trick because my premise is that we don't have an already-extant data set that would produce the margin table. IRL, we typically have such a data set, so synthesis consists of finding the closest data set to the all-ones set that fits the original margins as written. At the end of the procedure, you have data that encapsulates the info from the margins of the margin table, but didn't use the individual cell values at all--synthetic data.

Here's the sample code for the NaN-using scenario, which should be reasonably legible. Again, if you have the requisites installed, you can cut/paste it to the command line and watch it go.

apop_text_to_db -O  -d="|" '-' margins na.db <<"----------"
row | col | weight
  1 |  1 |   0
  1 |  2 |   0
  2 |  1 |   0
  2 |  2 |   0
  1 |  nan |   5
  2 |  nan |   15
  nan |  1 |   10
  nan |  2 |   10
----------

cat <<"----------" > rake.c
#include <apop.h>

int main(){
    apop_db_open("na.db");
    apop_query("create table init as select row, col, 1 as val "
               "from margins where row is not null and col is not null");
    apop_data_show(
        apop_rake(.margin_table="margins", .count_col="weight",
            .contrasts=(char*[]){"row", "col"}, .contrast_ct=2,
            .init_table="init", .init_count_col="val",)
    );
}
----------

export CFLAGS="-g -Wall -O3  `pkg-config --cflags apophenia`"
export CC=c99 LDLIBS="`pkg-config --libs apophenia`"
make rake
./rake

Many dimensions

So far, I've been showing you only two-dimensional data, because that's what's easy to display on a static screen. Everything has obvious extension to margins in three, four, or twenty dimensions.

There are two intersting consequences to multiple dimensions, though. The first, which you shouldn't have to care about, is that high-dimensional matrices tend to be very sparse, so processing requires some care. Last time, I mentioned a table with 1.9 billion cells, of which 3 million were nonempty; you want a procedure that can work with only the 3 million cells.

The other is that we might want to fix to the margins not only single dimensions like rows and then columns, but higher-dimensional combinations of elements, such as specifying that all X, Z combinations must match a pre-set total. That is, we might want to hold fixed higher-dimensional slices.

If you are holding fixed a higher-dimensional slices, like X| Z as above, then the lower dimensional slices are also taken care of. If we fix the value of the sums for (X = 1, Z = 1) , (X = 1, Z = 2) , ..., (X = 1, Z = n) , then the total for X = 1 has to be fixed at the sum of these more detailed targets.

The people writing regression-type models like this because it makes sense that if you are regressing on the `interaction' of X cross Z , then you would also want to include separate X and Z terms in there somewhere. Having mentioned that, I'm not going to go into too much detail on the use of raking as simulation of linear methods. I will keep the vocabulary, which uses the term contrast to refer to the X| Z combination.

Using Apophenia's syntax for this stuff (which is pretty similar to others, but for the C's compound literal (char*[]) cruft), we might have a call to the raking function like:

apop_rake(.margin_table="controltab",
        .constrasts=(char*[]){"tract|age|sex", "race|ethnicity|tract"}, 
        .contrast_ct=2);

This is going to generate synthetic data with the right tract| age| sex and tract| race| ethnicity distributions, beginning at the starting point of all ones (the default initial table when none is specified). Because only these two margins were set, we're evidently not worried about the age| race distribution or many other such combinations, which are likely to distort as the other contrasts are set.

At the extreme, if we have a 3-D table, and hold fixed the contrast X| Y| Z , then we are insisting that every cell in the table be fixed to the value given by the margin table. That is, we have zero degrees of freedom. If the margins are somehow inconsistent, then don't expect the raking procedure to converge.

Next time: structural zeros, and the promised missing data imputation.


[link][no comments]