Martin Bento ([info]explodedview) wrote,

Methodology and Code of New Hampshire Analysis

OK, here's where I outline exactly how I did the analysis of the New
Hampshire Democratic primary including code. If you don't know what I'm
talking about, see the previous entry in this blog.

The short version is that I
downloaded the HTML tables of precinct results from the Secretary
of State website and loaded them into Access using its HTML import
facility (yes, yes, I know, but Access is what I had at hand, and I was
in a hurry). I then entered the voting machine data. Finally, I derived
answers using straightforward SQL code (Access has an SQL interface,
and that's pretty much what I used). SQL has built-in functions to
derive aggregates (such as totals) over groups defined by a common
value (like vote counting technique).

I got the voting total per precinct from this URL (click on the county name at the bottom to get the totals for precincts in that county).

http://www.state.nh.us/sos/presprim%202004/dpresbelk.htm

I used the following page to determine which precincts were using which counting techniques.

http://www.nh.gov/sos/voting%20machines.htm

Here are the gory details:

First of all HTML tables are basically creatures of layout, while SQL tables have a logical structure. The HTML tables on the SOS site are broken horizontally for readability, which results in Bateman and Hamm being in the same column (because they are vertically alligned), same for Moseley-Braun and Kerry, etc. (Look at the HTML source to see what I mean). That won't do, so I inserted end and begin table markup to divide these into separate tables. I also deleted the totals, as they would distort my own. Then I imported them into Access. The table structure I used was as follows:

Table votetally
ID autonumber,
municipality text,
VotingTechUsed text,
Bateman number,
Moseley-Braun number,
..etc, for the rest of the candidates.

The ID field is called a "primary key" and is the standard database device to uniquely identify rows in a table. Autonumber means the database generates these values automatically.

This works, but the result is poorly normalized (technical term, look up "database" "normal form" if you're curious). You get multiple rows for each municipality, each having the votes for some of the candidates and zeroes for the others. To clean this up, I coalesced the votes into another table. I also took this opportunity to reduce the field I was considering to the top 5. It wasn't just Kucinich and Sharpton. The New Hampshire ballot had 22 Democrats on it, and I wanted a readable result. So I create a new table, substantially the same, but reduced to the top five:

Table Big5ivecoalesced
ID autonumber,
municipality text,
VotingTechUsed text,
Clarkvotes number,
Deanvotes number,
Edwardsvotes number,
Kerryvotes number,
Liebermanvotes number,
PopThreshold number);

PopThreshold I use later when I examine the vote in terms of town population. I filled this table with a concise version of the data from the first as follows:

Insert into Big5ivecoalesced(municipality, Clarkvotes, Deanvotes, Edwardsvotes, Kerryvotes, Liebermanvotes)
Select municipality, sum(Clark), sum(Dean), sum(Edwards), sum(Kerry), sum(Lieberman)
From votetally
Group by municipality;

This doesn't really sum the votes. It just adds the real votes to the superfluous zeroes generated by the dummy rows. It does, however, give me the real vote totals per city of these candidates. I should note here that some cities have more than one "precinct". I treated these as though they were separate municipalities, which is also how they were listed by the SOS.

Next I put in the VotingTechUsed values. You could save typing by defining a default value for this column of 'hand' in the table design window. I put these in by going into the table in the datasheet view.

Now the fun.

Here's how to calculate the percentages:

SELECT [VotingTechUsed], sum([Kerryvotes]) AS Kerry, ((sum([Kerryvotes])/(sum([Kerryvotes])+sum([Deanvotes])+sum([Edwardsvotes])+sum([Clarkvotes])+sum([Liebermanvotes]))) * 100) AS Kperc,
sum([Deanvotes]) AS Dean, ((sum([Deanvotesvotes])/(sum([Kerryvotes])+sum([Deanvotes])+sum([Edwardsvotes)+sum([Clarkvotes])+sum([Liebermanvotes]))) * 100) AS Dperc,
sum([Edwardsvotes]) AS Edwards, ((sum([Edwardsvotes])/(sum([Kerryvotes])+sum([Deanvotes])+sum([Edwardsvotes])+sum([Clarkvotes])+sum([Liebermanvotes]))) * 100) AS Eperc,
sum([Clarkvotes]) AS Clark, ((sum([Clarkvotes])/(sum([Kerryvotes])+sum([Deanvotes])+sum([Edwardsvotes])+sum([Clarkvotes])+sum([Liebermanvotes]))) * 100) AS Cperc,
sum([Liebermanvotes]) AS Lieberman, ((sum([Liebermanvotes])/(sum([Kerryvotes])+sum([Deanvotes])+sum([Edwardsvotes])+sum([Clarkvotes])+sum([Liebermanvotes]))) * 100) AS Lperc
FROM Big5ivecoalesced
GROUP BY [VotingTechUsed];

Looks scary, but the key is the GROUP BY. This divides the data into groups based on the value of VotingTechUsed. The aggregate functions (all sum in this case) are applied to each such group, i.e., each group of records with the same VotingTechUsed value. Each percentage divides what the particular candidate got by what they all got (again within the groups) to get a decimal ratio. It then multiplies this by 100 to convert it to a percentage.


The query that I used to calculate the percentage by which Kerry beat Dean is as follows:

SELECT [VotingTechUsed], sum([Kerryvotes]) AS Kerry,
( ((sum([Kerryvotes])/(sum([Deanvotes])))-1) * 100) AS Kerrymargin, sum([Deanvotes]) AS Dean
FROM Big5iveCoalesce
GROUP BY [VotingTechUsed];

To eliminate the towns with more than 20000 voters, I first went to http://bbs.vcsnet.com/State.php4?NH

This is a service that provides demographic information on voters. A little material is on their site for free, including the number of voters in each town.

For every town that had a population of over 20,000 voters, I set the population threshold to 20,000. The others I set to zero (all had a population greater than zero, presumably). I may revisit this with a more granular analysis. Here is the code:

SELECT [VotingTechUsed], sum([Kerryvotes]) AS Kerry, (sum([Kerryvotes])/(sum([Kerryvotes])+sum([Deanvotes])+sum([Edwardsvotes])+sum([Clarkvotes])+sum([Liebermanvotes]))) AS Kperc, sum([Deanvotes]) AS Dean, (sum([Deanvotes])/(sum([Kerryvotes])+sum([Deanvotes])+sum([Edwardsvotes])+sum([Clarkvotes])+sum([Liebermanvotes]))) AS Dperc, sum([Edwardsvotes]) AS Edwards, (sum([Edwardsvotes])/(sum([Kerryvotes])+sum([Deanvotes])+sum([Edwardsvotes])+sum([Clarkvotes])+sum([Liebermanvotes]))) AS Eperc, sum([Clarkvotes]) AS Clark, (sum([Clarkvotes])/(sum([Kerryvotes])+sum([Deanvotes])+sum([Edwardsvotes])+sum([Clarkvotes])+sum([Liebermanvotes]))) AS Cperc, sum([Liebermanvotes]) AS Lieberman, (sum([Liebermanvotes])/(sum([Kerryvotes])+sum([Deanvotes])+sum([Edwardsvotes])+sum([Clarkvotes])+sum([Liebermanvotes]))) AS Lperc
FROM Big5iveCoalesce
WHERE PopThreshold < 20000
GROUP BY [VotingTechUsed];

The only thing new here is the WHERE clause near the bottom, which eliminates from consideration records that do not meet this criterion.

Code in this entry, such as it is, is released under the GNU Public License (GPL).

  • Post a new comment

    Error

  • 8 comments

[info]mark_gubrud

February 6 2004, 17:48:24 UTC 8 years ago

You should issue a retraction

Dear Martin Bento,

Martin, Jonathan Wand has posted a reply to your original post which clearly demonstrates that your anomalous results can be accounted for in terms of geographical variations in voter preference, possibly but not necessarily rural-vs.-urban. When this effect is properly controlled for by comparing within municipalities, Prof. Wand found no statistically significant bias due to the use of the optical scan machines. I suggest you study his reports and contact him if you do not yet understand his analysis.

You argue that you tested the geographical hypothesis by eliminating towns over 20,000. However, your own numbers show that all this did was to eliminate about a quarter of the Diebold-counted votes. Since these votes favored Kerry more than the others, it is no wonder that removing a quarter of them caused the results to move somewhat in favor of Dean, but since you only removed a quarter, it is no wonder that "the results are but slightly different." Your results are fully consistent with the geographical hypothesis. If you think about it a bit, you may realize that you cannot reliably control for geographical variation as long as you lump the statewide results together.

Also, you should realize that it is unscientific for you to magnify the anomaly you saw by taking the difference between the Dean and Kerry votes and dividing by the smaller of the two. Suppose an ad agency interviews 10 people, and finds that 6 like Coke while 4 prefer Pepsi. Not satisfied with a statistically insignificant 20% difference in popularity, an ad man following your procedure would come up with "Coke is 50% more popular than Pepsi!" That's more dramatic, but just as statistically invalid.

By now, your findings are undoubtedly oozing through the internet as evidence that Kerry is the beneficiary of some Skull & Bony conspiracy to control the world. This sort of thing can be far more damaging, especially to voter turnout among the disaffected, than you might think. So frankly, you have done some (small) damage to our hopes of unseating Resident Shrub in the Fall. It doesn't seem as if that was your intention, but the damage is done, and I think it is your responsibility to try to repair it as well as you can.

Please admit that you made a mistake, and issue a clear retraction. Given the results posted by Prof. Wand, it is clear that we have no evidence of any bias due to the use of optical scan in New Hampshire. It is also important to point out that in this case a voter-verified paper record of votes cast exists, which makes any conspiracy especially unlikely. If there were valid evidence of a possible bias due to the machines, a partial recount would be enough to reveal this. But as of this time, no such evidence exists. I think you should admit that, and do your best to stop this rumor before it becomes an entrenched legend.

Best wishes,

Mark Avrum Gubrud
University of Maryland Peace Forum, and
Physics Dept., University of Maryland
College Park, MD 20740

Anonymous

February 7 2004, 19:45:49 UTC 8 years ago

Not what I see in Wand's analysis

What I see are 4 or perhaps 5 statistically significant results indicating a difference in the proportion of Dean votes between hand counts and Accuvote counts -within- different geographical areas. As Professor Wand looked at only 10 areas, this is far more than one would expect by chance alone. Now, as the Accuvote and Optech proportions do not differ significantly, tampering with the physical machines seems unlikely. However, if the programming of candidates/ballots might have been carried out by a single company for both types of physical machines, i.e. if the programming process depended upon geographical area rather than physical machine, then there might still be a slim possibility of tampering in order to skew the vote. Given the difference between hand counts and machine counts -within- the same geographical areas, it would not be unreasonable to simply investigate the programming process and ask a few more questions. Nor do I see how a discussion like this is harmful; rather, it reminds us of the desirability of a paper trail in all voting.

[info]paigegirl

February 9 2004, 17:34:05 UTC 8 years ago

Re: You should issue a retraction

I do not see why he should apologize to anyone. The Diebold machines do not leave a paper trail and therefore charges of corruption are going to come, like it or not.

[info]mark_gubrud

February 9 2004, 18:21:30 UTC 8 years ago

Re: You should issue a retraction

The Diebold machines in question here are not the touch-screen machines which don't produce a paper record. I agree that those machines pose a very serious issue and they should not be used at all until they are fitted with printers that spit out ballots the voters can verify and deposit.

All the machines in question here are optical scanners that simply count ballots people have marked by hand. So in this case the voter-verified paper records exist, and a recount could be done, if there were any reason to suspect something was wrong here. But there isn't.

Martin Bento found correlations between the vote and the use of optical scanners; he also found a correlation with the type of scanner. Jonathan Wand found that the latter correlation disappeared once he controlled for the county (region) as a common factor. The remaining correlation with the use of either type of optical scanner is expected on the basis of rural-vs.-urban differences in voter preference, since the scanners were used in the larger (more urban) precincts.

Bento claimed he could discount a rural-vs.-urban common factor hypothesis on the basis of a very weak ad-hoc test, but in fact his results tend to confirm the hypothesis. Until someone can show by some valid statistical method that there is a reason to doubt that the observed correlation has a common factor explanation, there is no evidence here to support any allegations or insinuations of fraud.

[info]paigegirl

February 9 2004, 18:41:45 UTC 8 years ago

Re: You should issue a retraction

I agree that those machines pose a very serious issue and they should not be used at all until they are fitted with printers that spit out ballots the voters can verify and deposit.

Exactly.

All the machines in question here are optical scanners that simply count ballots people have marked by hand. So in this case the voter-verified paper records exist, and a recount could be done, if there were any reason to suspect something was wrong here. But there isn't.

I can see your point in that case. If there is a paper trail to fall back on there is no issue. Thanks for responding.

Anonymous

February 7 2004, 04:09:07 UTC 8 years ago

diebold

Am I missing something? I understood that some doubt was raised about the statistical difference between the optical scan and the touch screen. I haven't seen any math here that explains or discounts the larger difference between either machine and hand counts. I certainly haven't seen anything that comes close to requiring an apology, especially given what we already know about the unreliability of voting machines.
Jan

[info]mark_gubrud

February 8 2004, 00:54:18 UTC 8 years ago

Okay - let's get this straight

People have short attention spans, so let me start with the conclusion:

There is no evidence here of any election fraud. Martin Bento had no basis to say his results indicate that "computers" favored Kerry over Dean. He leaped to an unwarranted inference, based on a deeply flawed analysis, did not apply any statistical tests of significance, and even reported his results in an artificially exaggerated form, Madison Avenue-style.

Now, I was a bit careless in my wording before. That's probably because so much is at stake in this election. I assume Mr. Bento is not out to help George W. Bush. But he really could send Karl Rove a bill for services rendered; this has to be worth a few thousand votes in November, at least. Even if Martin will take responsibility and admit his error.

All of Bento's results, and all of the facts before us, are fully consistent with a purely geographical effect. Dean was more popular relative to Kerry in rural areas and further from Boston, as one New Hampshire resident explained here. Kerry was preferred more strongly in urban areas and downstate. We have no reason to think Bento's results reflect anything more than that.

Bento, lumping all statewide results together, showed that there was a significant correlation between preference for Kerry over Dean and the use of optical scan. But his results also indicated there was a significant correlation between preference and which optical scanner was in use.

Wand showed that when you control for regional variations, there is no significant effect with respect to which type of scanner. Right away, that should tell you there is a problem with Bento's methodology: it produced an anomalous effect that went away when additional controls were imposed.

However, Wand's results still showed a significant correlation between Dean/(Dean+Kerry) and hand vs. optical. This is consistent with a geographical effect, since within each region there are precincts which are rural as well as those which are urban, and it is the urban ones which tend to use the optical scanners.

Wand controlled for regional variation, but not for density variation within regions. His results indicate that any difference in the results with the two types of scanners can be attributed to correlations between regional variations in voter preference and regional variations in the type of scanner used. Such correlations would be expected to exist due to random variation alone.

To reiterate, Wand did not control for rural-vs.-urban differences within a region. Therefore it is to be expected, on the rural-vs.-urban hypothesis, that there would still be a correlation between optical scan vs. hand and relative preference for Dean and Kerry. And that's what his results show.

Bento claimed he could reject a rural-vs.-urban hypothesis on the basis of a run in which he eliminated towns over 20,000. This is not a valid test! Why 20,000? Why not some other number? Setting a different cutoff would produce different results. What really needs to be done is to test for a correlation between precinct population density and voting patterns, for a given counting method, instead of just cutting the sample into two groups with an arbitrary cutoff.

But in fact, eliminating the largest towns moved Bento's results for the Diebold votes (which apparently included all the towns over 20,000) in the direction of Dean. Thus, Bento's test tends to confirm that there is an effect due to population density.

When you add that all up, you are left with the conclusion that there is nothing here which can be taken as evidence for a vote rigging hypothesis. The fact that there is a correlation is consistent with such a hypothesis, but it is also consistent with a geographical (regional variation PLUS local density) hypothesis. Since the geographical effect was to be expected, Bento's results cannot be considered evidence of election fraud.

But that won't stop the allegation, especially as packaged so nicely by Bento, with the artificially exaggerated numbers that "bring the matter into sharper focus" and the misleading argument about towns over 20,000, from wafting its redolent way down the information sewer main and into the collective delerium.

Your turn, Martin.

Mark Avrum Gubrud

[info]explodedview

February 9 2004, 19:49:25 UTC 8 years ago

Re: Okay - let's get this straight

Mr. Gubrud posted this comment both here and in the previous blog entry. I see no reason to get tangled up in two parallel discussions, and that discussion is busier, so I replied there. Anyone interested in following this particular debate should go to that entry. Ayone interested in specific discussion of the code and such should probably comment here.
Create an Account
Forgot your login or password?
Facebook Twitter More login options
English • Español • Deutsch • Русский…