Data convenience matters

Last quarter, I wrote a research paper on the political establishment. The political establishment is a murky concept and there is a lot to think about. However, surprisingly, the most difficult decisions during the project were on how to consider what data I should use to measure my variables.

The problem put simply, is that researchers tend to use data sets that are available because they have access to them. They simply can’t do a study in which the dataset doesn’t exist or is too expensive to create. This sounds reasonable but it means that the sorts of questions and the way we define our variables are heavily constrained by the data available. This is scary!!!

In my case this meant I had very few options to measure the political establishment. How do I measure such a squishy topic? I eventually came up with an answer based on incumbency status. Why did I pick this option? Part of it was certainly that I think it is a decent way to measure the concepts I am aiming to understand. However, while it may be somewhat dismaying to say this, it is no coincidence that this happens to be one of the few things I had available to me in large datasets. I certainly engaged in some motivated reasoning when parametrizing my variables to answer my central question.

Similarly, there is a big incentive to ask initial questions that can be answered by available datasets. Perhaps most obvious is DW Nominate and FEC datasets. These publicly available datasets are commonly used to examine polarization in Congress and financial spending in politics respectively. Naturally, they bias us towards asking questions about polarization and money in politics. It’s what we can measure. This is at the expense of questions that might have squishier answers such as candidate quality or cultural impact. While these questions are being asked also, we are biased in a meaningful way against harder to answer queries.

All this is to say, when we interpret data about politics we should always be thinking in the back of our minds: is that what we should be looking at or is all we could find?