8 August 16.

Murphy bed projects

link PDF version

People in sitcoms have jobs. They have a routine that allows similar things to happen every week. If people in movies have jobs, the job is an irrelevance mentioned in passing, or they are full-time spy assassin hunters.

I want my work narrative to be about projects rather than a continuous stream of existence with no set ending. People in movies lead more interesting lives.

I've made a real effort to switch over to project-oriented thinking all the time, and it does feel better. I sit down to work, and I see a set of finite things to build, not a never-ending slog. Everything (including the admin stuff) is in its own repository to check out as needed.

Murphy bed projects

To give another metaphor: the Murphy beds you find in tiny studio apartments. Once the bed is folded down from the wall, it covers the space and there is nothing to do but be in bed; once the bed is folded back up into its closet, you don't think about the bed at all. I've had the joy of sleeping in a few, and I think they're great.

It clearly dovetails with the project-centric life. Work on the project until you hit your stopping point, then put up the Murphy bed, pack up the tool box, fold up the tent, or whatever other physical metaphor you want, and move on.

My home directory, conceptual view. via

If you leave the cat on the Murphy bed when you fold it up, you will know. The process of being ready to fold everything away forces the discipline of stopping to ask what needs cleaning away, what threads of thought are still open, and what is to be done about it all.

If I'm going to check out the project fresh every morning, I need a makefile describing every setup detail—that's a good thing. It should have a make clean target to throw out the things I know are temp files or that can be regenerated—that's useful. I'm using revision control, which tracks some files and doesn't track others, so I have to decide early whether a given file is important enough to track, and what to do about it if not—that's so much better than my old routine of getting up to my ears in semi-important files and then feeling overwhelmed and shoving them into a temp directory I never look at again.

There's a definite trend toward being able to fold even up the entire virtual computer, which is stored in The Cloud or on your repository of virtual images. Personally, I'm not there yet, because even on an æsthetic level this isn't doing much for me beyond the usual PC-as-server setup that I have. I mean, you're always going to be typing into something. Without The Cloud, I'm going to have the usual home directory and an archive of projects from which I can pull.

My home

What's in my home directory? An archive directory, a temp directory, and that's it.

My home directory, actual view

The archive directory holds all those project repositories, a library of PDFs and data sets, items from my history.

I'm trying to get rid of the temp directory but can't let go yet. I thought it was weird when Lisa Rost said she used her desktop as her temp directory, but now I see that she makes a lot of sense. At the end of the day, if there are stray temp files in my home, they are blatantly present and I want to destroy them.

As a half-digression, I've taken to keeping one (1) text file with all my little side-notes on all of my projects. The notes use Org mode, which is a common standard for writing outlines. My text editor vim, with the orgmode plugin folds the inactive outline segments out of view, so I get the same Murphy bed effect of starting with a near-blank slate, unfolding the current project's notes, and leaving everything else hidden away. This text file of side-notes has massively cut down on my count of stupid two-line text files, and also serves as my index of all in-progress projects.

So, setup is easy: when I want to work on a project, I open a new terminal (actually, I make a new work session in tmux), git clone a copy from the bare repository in the archive directory, unfold the segment in my notes, go. When I want to shut down the project for the day, rm -rf the directory, close the tmux session, fold away the notes, go make some tea.

Perhaps you tensed up as much as I do at the part where I rm -rf the directory. In fact, with version control there are several ways to lose data beyond just deleting a file.

  • Yes, I could delete an untracked file.
  • I could have a stash that gets deleted.
  • I could have diffs that I haven't committed.
  • I could have everything committed locally but not pushed to the archive.

So I wrote a script to check for all of these things. People have told me that some DVCS GUIs do some number of these things. It's turned out to be useful to have a command that I can call quickly, because I can call it all the time to check the state of things (even when I have terminal-only access).

I named the script git-isclean, and posted it on Github at that link. If it gives me green check marks, I'm done; else it will (with the -a flag) automatically help me with the next step in cleanup. It depends on the interactive status script I wrote before, because it makes perfect sense to use it; if you don't want to use that script, change the use of git istatus to git status.

Sample usage.

Given my goal of keeping my home directory empty save for the one project I am working on right now, the need for this script is obvious. But it is useful even in less sparse workflows, because it provides a little to-do list of loose ends, worth having any time.

I hope something there was useful to you or gave you some ideas. Last time I wrote a navel-gazing post about my workflow was in 2013. Maybe I'll do another one in 2019.

1 August 16.

Version control as narrative device

link PDF version

I'm a convert. I went from not getting distributed version control systems (DVCSes) to writing a book chapter on one and forcing it on my friends and coworkers. I originally thought it was about mechanics like easy merging and copying revisions, but have come to realize that its benefits are literary, not mechanical. This matters in how DVCSes are taught and understood.

The competition

One way to see this is to consider the differences among the many distributed version control systems we aren't using.

Off the top of my head, there's Git (by the guy who wrote Linux, for maintaining Linux), mercurial (adopted as the Python DVCS), fossil (by the guy who wrote SQLite, for SQLite development), bazaar (by GNU to support GNU projects). So, first, in terms of who `invented' DVCS, it seems like everybody saw the same problems with existing version control systems at about the same time, and developed responses at about the same time. All of these are big ticket projects that have a large network and geek mindshare.

But at this point, Git won the popularity contest. Why?

It's not the syntax, which is famously unclear and ad hoc. Want to set up a new branch based on your current work? You might start typing git branch, but you're better off using git checkout -b. There's a command to regraft a branch onto a different base on the parent branch, git rebase, which is of course what you would use if you want to squash two commits into one.

Did you stage a file to be committed but want to unstage it? You mean

git reset HEAD -- fileA

The git manual advises that you set up an alias for this:

git config --global alias.unstage 'reset HEAD --'

#and now just use

git unstage fileA

I find it telling that the maintainers decided to put these instructions in the manual, instead of just adding a frickin' unstage command.

Yes, there are front-ends that hide most of this, the most popular being a web site and service named github.com, but it's a clear barrier that anything one step beyond what the web interfaces do will be awkward.

Early on, git also had some unique technical annoyances—maybe you remember having to call git gc manually. It has a concept, the index that other systems don't even bother with.

All that said, it won.

The index and commit squashing

To go into further detail on the index that most DVCSes don't bother with, it is a staging area describing what your next commit will look like. Other systems simplify by just taking all changes in your working directory and bundling them into a snapshot. This is how I work about 98% of the time.

The other 2%, I have a few different things going on, say adding some text on two different parts of a document, and for whatever reason, I want to split them into two commits. The index lets me add one section, commit, then add the next and produce a second commit.

By splitting the mess of work into one item with one intent and a second with a new intent, I've imposed a narrative structure on my writing process. Each commit has some small intent attached, and I even have the ability to rearrange those intents to form a more coherent story. Did I know what I was doing and where I was going before I got there? Who knows, but it sure looks like I did in the commit log.

There's some debate about whether these rewritings are, to use a short and simple word, lying. The author of Fossil says (§3.7) that not being able to rewrite the history of the project development is a feature, because it facilitates auditing. You can find others who recommend against using git rebase for similar reasons. On the other side, there are people who insist on using git rebase, because it is rude to colleagues to not clean up after yourself. And I stumble across tutorials on how to remove accidentally-committed passwords often.

One side of the debate is generating an episodic narrative of how the project formed; the other is generating a series of news bulletins.

Personally, I am on the side of the episodic narrative writers. If you could install a keylogger that broadcasts your work, typos and all, would you do it? Would you argue that it is a matter of professional ethics that not doing so and only presenting your cleaned-up work is misleading to your coworkers?

Providing a means of squashing together the error and its fix loses the fact that the author made and caught a mistake, but that makes life more pleasant for the author and has no serious cost to others. Most information is not newsworthy.

You might draw the line in my ethical what-if at publication: once your work is out the door, correcting it should be accompanied by a statement. Git semi-enforces this by making it almost impossible to rebase after pushing to another repository. This seems to be more technical happenstance than a design decision.

On a social scale, revision control is again about building a narrative. Before, we were a bunch of kids running around the field kicking a ball around, and now we're a bunch of kids playing kickball. We don't throw tarballs or semi-functional patches at each other; we send discrete commits to discrete locations under specific rules, and use those to understand what colleagues were doing. We can read their commit history to know what they were doing, because they had tools to make the commit history a structured narrative.

I'd love to see somebody take my claim that revision control develops a narrative literally and actually use a DVCS to write literature. We are all familiar with interactive fiction; why would revision control fiction, with its built-in sense of time and branching, be so unusual?

The evangelical disconnect

This is the setup for so many awkward conversations. The evangelist has such zeal because of the structuring of unstructred work facilitated by the complicated features like the index and the rebase command; the newbie just wants to save snapshots. The evangelist may not be able to articulate his or her side, or the evangelist does talk about building a narrative, but a lot of people just do not care and see it as useless bureauracy, or don't have a concept of what it means until they are on the other side and have had a chance to try it themselves. I'd be interested to see tutorials that focus on the process of narrative writing, instead of rehashing the basics that are already covered so well.

In other evangelist news, I've saved you hours reading the million Git vs Other DVCS blog entries out there. The other one is easier to use, because it has fewer narrative control features.

If you buy my claim that Git's primary benefit over near-equivalent competitors is tools to rewrite the narrative, then Git's winning over alternatives is a strong collective revealed preference. Some of these same people who reject some programming languages because the semicolons look cluttery enthusiastically embrace a system where git symbolic-ref HEAD refs/heads/baseline is a reasonable thing to type. Can you work out what it does and why I needed it? That's a strong indication of just how high-value narrative control is among geekdom.

Of course, not everybody in a crowd is of like mind. DVCSes are socially interesting because they are something that working adults have to learn, with a real payoff after the learner gets over the hump. Like double-entry bookkeeping or playing the piano, the people who have taken the time to learn have a new means of understanding the world and generating narratives that people pre-hump may not see. It's interesting to see who has the desire to pursue that goal to its conclusion and who gives up after seeing the syntactic mess.

6 December 15.

Apophenia v1.0

link PDF version

Apophenia version 1.0 is out!

What I deem to be the official package announcement is this 3,000 word post at medium.com, which focuses much more on the why than the what or how. If you follow me on twitter then you've already seen it; otherwise, I encourage you to click through.

This post is a little more on the what. You're reading this because you're involved in statistical or scientific computing, and want to know if Apophenia is worth working with. This post is primarily a series of bullet points that basically cover background, present, and future.

A summary

Apohenia is a library of functions for data processing and modeling.

The PDF manual is over 230 pages, featuring dozens of base models and about 250 data and model manipulation functions. So if you're thinking of doing any sort of data analysis in C, there is probably already something there for you to not reinvent. You can start at the manual's Gentle Introduction page and see if anything seems useful to you.

For data processing, it is based on an apop_data structure, which is a lot like an R data frame or a Python Pandas data frame, except it brings the operations you expect to be able to do with a data set to plain C, so you have predictable syntax and minimal overhead.

For modeling, it is based on an apop_model structure, which is different from anything I've seen in any other stats package. In Stats 101, the term statistical model is synonymous with Ordinary Least Squares and its variants, but the statistical world is much wider than that, and is getting wider every year. Apophenia starts with a broad model object, of which ordinary/weighted least squares is a single instance (apop_ols).

By assiging a format for a single model structure:

  • We can give identical treatment to models across paradigms, like microsimulations, or probabilistic decision-tree models, or regressions.
  • We can have uniform functions like apop_estimate and apop_model_entropy that accommodate known models using known techniques and models not from the textbooks using computationally-intensive generic routines. Then you don't have to rewrite your code when you want to generalize from the Normal distributions you started with for convenience to something more nuanced.
  • We can write down transformations of the form f:(model, model) $\to$ model.
    • Want a mixture of an empirical distribution built from observed data (a probability mass function, PMF) and a Normal distribution estimated using that data?
      apop_model *mix = apop_model_mixture(
                                    apop_estimate(your_data, apop_pmf),
                                    apop_estimate(your_data, apop_normal)
    • You want to fix a Normal$(\mu, \sigma)$ at $\sigma=1$? It's a little verbose, because we first have to set the parameters with $\mu=$NaN and $\sigma=1$, then send that to the parameter-fixing function:
      apop_model *N1 = apop_model_fix_parameters(
                                    apop_model_set_parameters(apop_normal, NAN, 1));
    • You want to use your mixture of a PMF+Normal as a prior to the $\mu$ in your one-parameter Normal distribution? OK, sure:
      apop_model *posterior = apop_update(more_data,
                                          .prior=mix, .likelihood=N1);
    • You want to modify your agent-based model via a Jacobian [apop_coordinate_transform], then truncate it to data above zero [apop_model_truncate]? Why not—once your model is in the right form, those transformations know what to do.
  • In short, we can treat models and their transformations as an algebraic system; see a paper I once wrote for details.

What v1 means

  • It means that this is reasonably reliable.
    • Can the United States Census Bureau rely on it for certain aspects of production on its largest survey (the ACS)? Yes, it can (and does).
    • Does it have a test bed that checks that for correct data-shunting and good numeric results in all sorts of situations? Yes: I could talk all day about how much the 186 scripts in the test base do.
    • Is it documented? Yes: the narrative online documentation is novella length, plus documentation for every function and model, plus the book from Princeton University Press described on the other tabs on this web site, plus the above-linked Census working paper. There's a lot to cover, but an effort has been made to cover it.
    • Are there still bugs? Absolutely, but by calling this v1, I contend that they're relatively isolated.
    • Is it idiot-proof? Nope. For example, finding the optimum in a 20-dimensional space is still a fundamentally hard problem, and the software won't stop you from doing one optimization run with default parameters and reporting the output as gospel truth. I know somebody somewhere will write me an angry letter about how software that does not produce 100% verifiably correct results is garbage; I will invite that future correspondent to stick with the apop_normal and apop_ols models, which work just fine (and the OLS estimator even checks for singular matrices). Meanwhile, it is easy to write models that don't even have proven properties such as consistency (can we prove that as draw count $\to\infty$, parameter estimate variance $\to 0$?). I am hoping that Apophenia will help a smart model author determine whether the model is or is not consistent, rather than just printing error: problem too hard and exiting.

  • It means that it does enough to be useful. A stats library will never be feature-complete, but as per the series of blog posts starting in June 2013 and, well, the majority of what I've done for the last decade, it provides real avenues for exploration and an efficient path for many of the things a professional economist/statistician faces.

  • It means I'm no longer making compatibility-breaking changes. A lot of new facilities, including the named/optional arguments setup, vtables for special handling of certain models, a decent error-handling macro, better subsetting macros, and the apop_map facilities (see previously) meant that features implemented earlier merited reconsideration, but we're through all that now.

  • It's a part of Debian! See the setup page for instructions on how to get it from the Debian Stretch repository. It got there via a ton of testing (and a lot of help from Jerome Benoit on the Debian Scientific team), so we know it runs on a lot more than just my own personal box.

From here, the code base is in a good position to evolve:

  • The core is designed to facilitate incremental improvements: we can add a new model, or a new optimization method, or another means of estimating the variance of an existing model, or make the K-L divergence function smarter, or add a new option to an existing function, and we've made that one corner of the system better without requiring other changes or work by the user. The intent is that from here on out, every time the user downloads a new version of Apophenia, the interface stays the same but that the results get better and are delivered faster, and new models and options appear.

  • That means there are a lot of avenues for you and/or your students to contribute.

  • Did I mention that you'll find bugs? Report them and we'll still fix them.

  • It's safe to write wrappers around the core. I wrote an entire textbook to combat the perception that C is a scary monster, but if the user doesn't come to the 19,000-line mountain of code that is Apophenia, we've got to bring the mountain to the user.

By the way, Apophenia is free, both as in beer and as in speech. I forget to mention this because it is so obvious to me that software—especially in a research context—should be free, but there are people for whom this isn't so obvious, so there you have it.

A request

I haven't done much to promote Apophenia. A friend who got an MFA from an art school says that she had a teacher who pushed that you should spend 50% of your time producing art, and 50% of your time selling your art.

I know I'm behind on the promotion, so, please: blog it, tweet it, post it on Instagram, make a wikipage for it, invite me to give talks at your department. People will always reinvent already-extant code, but they should at least know that they're doing so.

And my final request: try it! Apophenia doesn't look like every other stats package, and may require seeing modeling from a different perspective, but that just may prove to be a good thing.

7 June 15.

Editing survey data (or, how to deal with pregnant men)

link PDF version

This post is partly a package announcement for Tea, a package for editing and imputation. But it mostly discusses the problem of editing, because I've found that a lot of people put no thought into this important step in data analysis.

If your data is a survey filled in by humans, or involves using sensors that humans operated, then your data set will have bad data.

But B, you object, I'm a downstream user of a data set provided by somebody else, and it's a clean set—there are no pregnant men—so why worry about editing? Because if your data is that clean, the provider already edited the data, and may or may not have told you what choices were made. Does nobody have gay parents because nobody surveyed has gay parents, or because somebody saw those responses and decided they must be an error? Part of the intent of Tea is to make it easy to share the specification of how the data was edited and missing data imputed, so end users can decide for themselves whether the edits make sense.

If you want to jump to working with Tea itself, start with the Tutorial/manual.

For editing bad values in accounting-type situations, you may be able to calculate which number in a list of numbers is incorrect—look up Fellegi-Holt imputation for details. But the situations where something can be derived like this are infrequent outside of accounts-based surveys.

So people wing it.

Some people will listwise delete the record with some failure in it, including a record that is missing entirely. This loses information, and can create a million kinds of biases, because you've down-weighted the deleted observation to zero and thus up-weighted all other observations. An analyst who deletes the observations with N/As for question 1 only when analyzing question 1, then restores the data set and deletes the N/As for question 2 when analyzing question 2, ..., is creating a new weighting scheme for every question.

I don't want to push you to use Tea, but I will say this: listwise deletion, such as using the ubiquitous na.rm option in R functions, is almost always the wrong thing to do.

Major Census Bureau surveys (and the decennial census itself) tend to lean on a preference ordering for fields and deterministic fixes. Typically, age gets fixed first, comparing it to a host of other fields, and in a conflict, age gets modified. Then age is deemed edited, and if a later edit involving age comes up it's complicated..., then age stays fixed and the other fields are modified. There is usually a deterministic rule that dictates what those modifications should be for each step. This is generally referred to as sequential editing.

Another alternative is to gather evidence from all the edits and see if some field is especially guilty. Maybe the pregnant man in our public health survey also reports using a diaphragm for contraception and getting infrequent pap smears. So of the pair (man, pregnant), it looks like (man) is failing several edits.

If we don't have a deterministic rule to change the declared-bad field, then all we can do is blank out the field and impute, using some model. Tea is set up to be maximally flexible about the model chosen. Lognormal (such as for incomes), Expectation-Maximization algorithm, simple hot deck draw from the nonmissing data, regression model—just set method: lognormal or EM or hot deck or ols in the spec file, and away you go.

Tea is set up to do either sequential or deterministic edits. Making this work was surprisingly complicated. Let me give you a worst-case example to show why.


Assume these edits: fail on
(age $<$ 15 and status=married),
(age $<$ 15 and school=PhD),
and (age$>$65 and childAge $<$ 10 $=>$ change childAge to age-25).
The third one has a deterministic if-then change attached; the first two are just edits.

I think these are reasonable edits to make. It is entirely possible for a 14 year old to get a PhD, just as there has existed a pregnant man. Nonetheless, the odds are much greater when your survey data gives you a 12-year old PhD or a pregnant man that somebody committed an error, intentional or not.

Then, say that we have a record with (age=14, married, PhD, childAge=5), which goes through these steps:

  • Run the record through the edits. Age fails two edits, so we will blank it out.
  • Send age to the imputation system, which draws a new value for the now-blank field. Say that it drew 67.
  • Run the record through the edits, and the deterministic edit hits: we have to change the child's age to 42.

First, this demonstrates why writing a coherent editing and imputation package is not a pleasant walk in the park, as edits trigger imputations which trigger edits, which may trigger more imputations.

Second, it advocates against deterministic edits, which can be a leading cause of these domino-effect outcomes that seem perverse when taken as a whole.

Third, it may advocate against automatic editing entirely. It's often the case that if we can see the record, we could guess what things should have been. It's very plausible that age should have been 41 instead of 14. But this is guesswork, and impossible to automate. From Census experience, I can say that what you get when you add up these little bits of manual wisdom for a few decades is not necessarily the best editing system.

But not all is lost. We've generated consistent micro-data, so no cross-tabulations will show anomalies. If we've selected a plausible model to do the imputations, it is plausible that the aggregate data will have reasonable properties. In many cases, the microdata is too private to release anyway. So I've given you a worst-case example of editing gone too far, but even this has advantages over leaving the data as-is and telling the user to deal with it.

Multiply impute

Imputation is the closely-allied problem of filling in values for missing data. This is a favored way to deal with missing data, if only because it solves the weighting problem. If 30% of Asian men in your sample didn't respond, but 10% of the Black women didn't respond, and your survey oversampled women by 40%, and you listwise delete the nonresponding observations, what reweighting should you do to a response by an Asian woman? The answer: side-step all of it by imputing values for the missing responses to produce a synthetic complete data set and not modifying the weights at all.

Yes, you've used an explicit model to impute the missing data—as opposed to the implicit model generated by removing the missing data. To make the model explicit is to face the reality that if you make any statements about a survey in which there is any missing data, then you are making statements about unobserved information, which can only be done via some model of that information. The best we can do is be explicit about the model we used—a far better choice than omitting the nonresponses and lying to the reader that the results are an accurate and objective measure of unobserved information.

We can also do the imputation multiple times and augment the within-imputation variance with the across-imputation variance that we'd get from different model-based imputations, to produce a more correct total estimate of our confidence in any given statistic. That is, using an explicit model also lets us state how much our confidence changes because of imputation.

There are packages to do multiple imputation without any editing, though the actual math is easy. Over in the other window, here's the R function I use to calculate total variance, given that we've already run several imputations and stored each (record,field,draw) combination in a fill-in table. The checkOutImpute function completes a data set using the fill-ins from a single imputation, generated by the imputation routine earlier in the code. There's, like, one line of actual math in there.


get_total_variance <- function(con, tab, col, filltab, draw_ct, statvar){
    v <- 0  #declare v and m
    m <- 0
    try (dbGetQuery(con, "drop table if exists tt"))
    for (i in 1:draw_ct){
        checkOutImpute(dest="tt", origin=tab, filltab=filltab, imputation_number=i-1)
        column <- dbGetQuery(con, paste("select ", col, " from tt"))
        vec <- as.numeric(column[,1]) #type conversions.
        v[i] <- statvar(vec)
        m[i] <- mean(vec)
    total_var <- mean(v) + var(m)/(1+1./draw_ct)
    return(c(mean(m), sqrt(total_var)))

# Here's the kind of thing we'd use as an input statistic: the variance of a Binomial
binom_var <- function(vec){
    p = mean(vec)

12 March 15.

Banning the hypothesis test

link PDF version

In case you missed it, a psychology journal, Basic and Applied Social Psychology, has banned the use of hypothesis tests.

Much has already been said about this, very little in support. The ASA points out that this approach may itself have negative effects and a committee is already working on a study. Civil statistician, a person who is true to his nom de blog and never says anything uncivil about anything, is very annoyed. Nature defended the party line.

This is an opportunity for statisticians to remind the world of what these $p$-values mean, exactly, and when they can or can not be trusted. A hypothesis test provides a sense of crisp, pass-or-fail clarity, but this can be a bad thing in situations where there is far too much complexity for crisp anything. How can we get readers to take $p$-values with the grain of salt that they must be taken with?

I agree with the dissenters, in the sense that if I were the editor of this journal, this is not something I would have done. If nothing else, smart people find a way to route around censorship. As noted by some of the commenters above, if you can only provide standard errors, I can do the mental arithmetic to double them to get the approximate 95% confidence intervals. Banning the reporting of the sample size and variance of the data would effectively keep me from solving for the confidence interval, but I doubt even these editors would contemplate such a ban.

The editors claim that $p$-values are `invalid'. In the very narrow sense, this is kinda crazy. The values are based on theorems that work like any other mathematical theorem: given the assumptions and the laws of mathematics that we generally all accept, the conclusion holds with certainty. But once we look further at the context, things aren't so bright-lined:

$\bullet$ A $p$-value is a statement about the likelihood of a given claim about model parameters or a statistic of the data using the probability distribution defined by the model itself. We do not know the true probability distribution of the statistic, and so have to resort to testing the model using using the model we are testing.

$\bullet$ A test is in the context of some data gathering or experimental design. Psychology is not Physics, and small details in the experimental design, such as the methods of blinding and scoring, matter immensely and are not necessarily agreed upon. In my experience as a reader, I am far more likely to dismiss a paper because a point of design made the paper too situation-specific or too fallible than because they reject the null with only 89% confidence.

$\bullet$ We are Bayesian when reading papers, in the sense that we come in with prior beliefs about whether a given fact about the world is true, and use the paper to update our belief. At the extreme, a paper on ESP that proves its existence with 99.9% confidence might marginally sway me into thinking something might be there, but in my mind I'd be arguing with the methods and it'll take a hundred such papers before I take it seriously. A paper finding that colorblind people process colors differently from typically-sighted people would get a well, yeah from me even if the hypothesis test finds 85% confidence, and in my mind I'd think about how the experiment could have been improved to get better results next time.

A corollary to this last bullet point is that the editors are also Bayesian, and are inclined to believe some theories more than others. One editor may dislike the theory of rational addiction, for example, and then what keeps the editor from desk rejecting any papers that support the theory? Having only qualitative information means one less check on biases like these.

The full set of bullet points show how a crisp $p$-value can be misleading, in terms of the modeling, of the experiment as a whole, and the manner in which readers digest the information. Assuming Normally-distributed data, the $p$-value can be derived from first principles, but the statement that the reader should reject the null with $1-p$% probability requires accepting that nothing went wrong in the context that led up to that number. (By the way, $1-p$ is the $q$-value, and I sometimes wish that people reported it instead.)

Psychological problems

Psychology is not physics, where context can be controlled much more easily. To talk meta-context, a psychology study has so many challenges and confounders that the typical $p$-value in a psychology journal is a qualitatively different thing from a $p$-value in a biology or physics journal. A $p$-value can be read as a claim about the odds of getting the same result if you repeat the experiment, but defining what goes into a correct replication is itself a challenge the psych literature is grappling with in the present day. [But yes, there are high-quality, simple, reproducible psych experiments and badly designed physics experiments.]

Second, your revolution in the understanding of drosophila is going to be upstaged in the papers by the most piddling result from a psychology lab, every time. There are press agents, journalists, pop psychologists, and people selling books who have very little interest in the complexities of a study in context, and every interest in finding a study that `proves' a given agendum, and they have a large population of readers who are happy to eat it up.

Maybe you recall the study that obesity is contagious, perhaps because you read about it in Time magazine. With lower likelihood, you saw the follow-up studies that questioned whether the contagion effect was real, or could be explained away by the simple fact that similar people like to hang out together (homophily). Much to their credit, Slate did a write-up of some of the contrary papers. Or maybe you saw the later study that found that obesity contagion is reasonably robust to homophily effects.

I'm not going to wade into whether obesity patterns show contagion effects beyond homophily here, but am going to acknowledge that finding the answer is an imperfect process that can't be summarized by any single statistic. Meanwhile, the journalists looking for the biggest story for Time magazine aren't going to wade into the question either, but will be comfortable stopping at the first step.

So I think it's an interesting counterfactual to ask what the journalists and other one-step authors would do if a psychology journal didn't provide a simple yes-or-no and had to acknowledge that any one study is only good for updating our beliefs by a step.

I commend the editors of the BASP, for being bold and running that experiment. It doesn't take much time in the psychology literature to learn that our brains are always eager to jump on cognitive shortcuts, yet it is the job of a researcher to pave the long road. No, if I were editor I would never ban $p$-values—I've pointed out a few arguments against doing so above, and the links at the head of this column provide many more valid reasons—but these editors have taken a big step in a discussion that has to happen about how we can report statistical results in the social sciences in a manner that accommodates all the uncertainty that comes before we get to the point where we can assume a $t$ distribution.

3 March 15.

Overlapping bus lines

link PDF version

I have at this point become a regular at the Open Data Day hackathon, hosted at the World Bank, organized by a coalition including the WB, Code for DC and the hyperproductive guy behind GovTrack.us.

This year, I worked with the transportation group, which is acting as a sort of outside consultant to a number of cities around the world. My understanding of the history of bus lines in any city is that bus lines start off by some enterprising individual who decides to buy a van and charge for rides. Routes are decided by the individual, not by some central planner. With several profit-maximizing competitors, especially lucrative routes will be overcrowded with redundant lines relative to what a central planner could do, taking into account congestion, pollution, even headways, and system complexity.

Many places see a consolidation. For example, the Washington Metro Area Transit Authority was formed by tying together many existing private lines. Over the course of decades, some changes were made to consolidate. The process of tweaking the lines from the 1900s still slowly continue to this day.

Measuring overlap

Here's where the data comes in. A Bank project [led by Jacqueline Klopp and Sarah Williams] developed a map of Nairobi's bus maps, by sending people out on the bus with a GPS-enabled gadget, to record the position every time the bus stopped. The question the organizers [Holly Krambeck, Aaron Dibner-Dunlap] brought to Open Data Day: how much redundancy is there in Nairobi's system, and how does it compare to that of other systems?

We defined an overlap as having two stops with latitude and longitude each within .0001 degrees of each other: roughly 90 meters, which is a short enough walk that you could point and say `go stand over there for your next bus'. It also makes the geographic component of the problem trivial, because we can just ask SQL to find (rounded) numbers that match, without involving Pythagoras.

GTFS data is arranged in routes, which each have one or more trips. We considered only route overlaps, which may have a significant effect on our final results if night bus trips are very different from day bus trips. Modifying the code below to account for time is left as a future direction for now.

The data for Chicago's CTA, Los Angeles, and the DC area's WMATA have both bus and subway/el routes.

On the horizontal axis of this plot, we have the percent overlap between two given routes, and on the vertical axis, we have the density of route pairs, among route pairs that have any overlap at all. In all cities, about 90% ($\pm$2%) of routes have no overlap, and are excluded from this density.

The hunch from our WB transportation experts was right: WMATA, LA, and CTA have pretty similar plots, but Nairobi's plot meanders for a while with a lot of density even up to 30% overlap.

The map and the terrain

The ideal bus map would form a grid in a perfect world. For example, Chicago is almost entirely a grid, with major streets at regular intervals that go forever (e.g., Western Ave changes names at the North and South ends, but runs for 48km). The CTA's bus map looks like the city map, with a bus straight down each major street. The overlap for any N-S bus with any E-W bus is a single intersection. The routes that have a lot of overlap are the ones downtown, along the waterfront, and on a few streets like Michigan Ave.

Further East and in older cities, things fall apart and the ideal of one-street-one-bus is simply impossible.

Thad Kerosky fed the above data to QGIS to put the stops that have nonzero overlap on a map:

The bus overlaps basically produce a map of the major arterial roads and major bottlenecks in the city grid (plus the airport).

So the problem seems to partly be geography, and there's not much that can be done about that. The last time a government had the clout to blow out the historic map to produce a grid was maybe 1870, and there aren't any countries left with Emperors who can mandate this kind of thing. But that doesn't preclude the possibility of coordinating routes along those arteries in a number of ways, such as setting up trunk-and-branch sets of coordinated schedules.

How to

Keeping with the theme of not overengineering, we used a set of command line tools to do the analysis. We had a version in Python until one of the team members pointed out that even that was unnecessary. You will need SQLite, Apophenia, and Gnuplot. We also rely on a GNU sed feature and some bashisms. It processes WMATA's 1.6 million stop times on my netbook in about 70 seconds.

Start off by saving a GTFS feed as a zip file (this is the norm, e.g., from the GTFS Data Exchange), save this script as, e.g., count_overlaps, then run

Zip=cta.zip . count_overlaps

to produce the pairs table in the database and the histogram for the given transit system.

The script produces individual plots via Gnuplot, while the plot above was via R's ggplot, which in this case isn't doing anything that Gnuplot couldn't do.

#!/usr/bin/bash   #uses some bashisms at the end

if [ "$Zip" = "" ] ; then
  echo Please set the Zip enviornment variable with the zip file with your GTFS data
else #the rest of this file

base=`basename $Zip .zip`
mkdir $base
cd $base
unzip ../$Zip


for i in *.txt; do sed -i -e 's/|//g' -e "s/'//g"  $i; done
for i in *.txt; do apop_text_to_db $i `basename $i .txt` $DB; done

sqlite3 $DB "create index idx_trips_trip_id on trips(trip_id);"
sqlite3 $DB "create index idx_trips_route_id on trips(route_id);"
sqlite3 $DB "create index idx_stop_times_trip_id on stop_times(trip_id);"

sqlite3 $DB << ——

create table routes_w_latlon as
    select distinct route_id, s.stop_id, round(stop_lat, 4) as stop_lat,
       round(stop_lon, 4) as stop_lon 
       from stops s, stop_times t, trips tr
       where s.stop_id = t.stop_id
       and tr.trip_id=t.trip_id ;

create index idx_trips_rid on routes_w_latlon(route_id);
create index idx_trips_lat on routes_w_latlon(stop_lat);
create index idx_trips_lon on routes_w_latlon(stop_lon);

create table pairs as
  select routea, routeb,
    ((select count(*) from
     (select distinct * from
     routes_w_latlon L, routes_w_latlon R
     L.route_id = routea
     R.route_id = routeb
     and L.stop_lat==R.stop_lat and L.stop_lon==R.stop_lon))
    / (select count(*) from routes_w_latlon where route_id=routea or route_id=routeb)
     as corr
    (select distinct route_id as routea from routes),
    (select distinct route_id as routeb from routes)
    where routea+0.0<=routeb+0.0;

cat <(echo "set key off;
set xlabel 'Pct overlap';
set ylabel 'Count';
set title '$base' ;
set xrange [0:.6];
set term png size 1024,800;
set out '${base}.png';
plot '-' with impulses lt 3") <(apop_plot_query -f- -H0 $DB "select corr from pairs where corr > 0 and corr < 1"| sed '1,2d') | gnuplot