**Introduction:**

I'm interested in whether there is a relationship between voter's gender and their preference in candidates. To be specfic, I want to test if there is a statistical significance difference in term of voting preference between male and female voters. The result of this reserch might rise question like whether voting is bias due to voters' gender in general, and that would be an interesting question for further research.

### Data:

I am using the American National Elections Study(ANES)[http://www.electionstudies.org/][1] dataset for this research. The dataset is collected by conducting surveys of voters in the United States before and after every presidential election. Even though the surveys are carryed out to random sample of all U.S voters, since there is no experiment involved in the data collection stage, my data analysis on this dataset is an observational study. My research result will only be able to generalize to the people who were intended to vote in the 2008 election, it can not be generalized to all Americans so as to the 2012 election. And also, I can not draw causual relationship between voters' gender and there voting perference.

Each case in the dataset represents a voter that is been surveyed by ANES. I will be using voters' gender and who they are interested in voting for in the 2008 election as the two variables to work on. The voters gender is coded as “gender_respondent”“ in the dataset, and it is a categorical variable taking values in "Female” or “Male”. The candidate to whom the voter is interested in voting for is coded as “interest_whovote2008” in the dataset, and it is also a categorical variable which takes value in “Barack Obama”, “John Mccain” and “Other”

### Exploratory data analysis:

1 2 3 |
anesVot <- subset(anes, anes$interest_voted2008 == "Yes, Voted") anesVot <- subset(anesVot, select = c(gender_respondent, interest_whovote2008)) anesVot <- na.omit(anesVot) |

Since the dataset also included people who didn't vote in 2008 election, first a subset of the original dataset that only contain people who actually voted in the 2008 election with the two variables of interest is created. Among the people who did report voted in the 2008 election, those who didn't report the candidate he or she voted for is then excluded from the subset. So anesVot is the name for the created dataset that will be used in the following research.

1 2 |
N <- nrow(anesVot) N |

1 2 |
## [1] 4520 |

The obtained dataset anesVot contains 4520 observations, each case in the dataset represents an individual that not only voted in the 2008 election, but also reports who he or she voted.

1 |
table(anesVot$gender_respondent) |

1 2 3 4 |
## ## Female Male ## 2342 2178 |

1 |
table(anesVot$interest_whovote2008) |

1 2 3 4 |
## ## Barack Obama John Mccain Other {Specify} ## 2704 1702 114 |

1 2 3 4 5 |
numFemale <- 2342 numMale <- 2178 VBarack <- 2704 VJohn <- 1702 VOther <- 114 |

Overall, there are 2342 Females and 2178 Males in the dataset. There are 2704 voters who voted for Barack Obama, 1702 voters who voted for John Mccain, and 114 voters who voted for candidates other than the pervious two.

1 |
table(anesVot$gender_respondent, anesVot$interest_whovote2008) |

1 2 3 4 5 |
## ## Barack Obama John Mccain Other {Specify} ## Female 1479 820 43 ## Male 1225 882 71 |

1 2 |
femaleVBarack = 1479 round(femaleVBarack/numFemale, 2) |

1 2 |
## [1] 0.63 |

1 2 |
maleVBarack = 1225 round(maleVBarack/numMale, 2) |

1 2 |
## [1] 0.56 |

63% female voters voted for Barack Obama, 56% male voters voted for Barack Obama.

1 2 |
femaleVJohn <- 882 round(femaleVJohn/numFemale, 2) |

1 2 |
## [1] 0.38 |

1 2 |
maleVJohn <- 820 round(maleVJohn/numMale, 2) |

1 2 |
## [1] 0.38 |

The percentage of voters voted for John Mccain in both genders are the same 38%.

1 2 |
femaleVOther <- 71 round(femaleVOther/numFemale, 2) |

1 2 |
## [1] 0.03 |

1 2 |
maleVOther <- 43 round(maleVOther/numMale, 2) |

1 2 |
## [1] 0.02 |

3% female voters and 2% male voters voted for candidates other than the two major competitors. So it seems from this calculations that female voters tend to slightly perfer Barack than John Mccain and other candidate.

1 2 3 4 5 |
MaleVot <- c(1225, 882, 71) FemaleVot <- c(1479, 820, 43) barplot(rbind(FemaleVot, MaleVot), beside = T, col = c("red", "blue"), names.arg = c("Barack Obama", "John Mccain", "Other")) legend("topright", c("Female", "Male"), pch = 15, col = c("red", "blue")) |

To see it graphically, the above plot show the difference between gender and perfered candidates as describe before.

This difference could be ture, however it could also be simply due to chance. If we draw another sample from the people who voted in the 2008 election, this difference might not appear at all. So I want to carry out a Chi-square independence test.

### Inference:

Since I want to test the dependent relationship between two categorical variables, I will use Chi-square independence test.

The null hypothesis for my Chi-square independence test is: Voter gender and preference among candidates are independent. And the alternative hypothesis is that:Voter gender and preference among candidates are dependent.

Let's check the conditions for carry out the desired Chi-square independence test. The data is collectied on a random sample of U.S voters without replacement. According to the official Federal Election Commission report [http://www.fec.gov/pubrec/fe2008/federalelections2008.pdf][2] there are 131,313,820 voters voted in the 2008 election.

1 |
N/131313820 < 0.1 |

1 2 |
## [1] TRUE |

So the sample is less 10% of the population.

The expected count is calculated using the formula

Calculating expected values for each cell will produce a table as follows:

1 2 3 4 5 6 7 8 9 |
column0 <- rbind("Male", "Female", "Total") columnBarack <- rbind(numMale/N * VBarack, numFemale/N * VBarack, 1 * VBarack) columnJohn <- rbind(numMale/N * VJohn, numFemale/N * VJohn, 1 * VJohn) columnOther <- rbind(numMale/N * VOther, numFemale/N * VOther, 1 * VOther) columnTotal <- rbind(1 * numMale, 1 * numFemale, 1 * N) expectedTable <- data.frame(A = column0, B = columnBarack, C = columnJohn, D = columnOther, E = columnTotal) names(expectedTable) <- c(" ", "Barack", "John", "Other", "Total") expectedTable |

1 2 3 4 5 |
## Barack John Other Total ## 1 Male 1303 820.1 54.93 2178 ## 2 Female 1401 881.9 59.07 2342 ## 3 Total 2704 1702.0 114.00 4520 |

So this satisfied the required condition that each cell has at least 5 expected value.

Recall the forumla for Chi-square is:

and formula for degree of freedom is:

1 2 3 4 |
Chisquare <- (1479 - 1401.055)^2/1401.55 + (1225 - 1302.945)^2/1302.945 + (820 - 881.887)^2/881.887 + (882 - 820.123)^2/820.123 + (43 - 59.06814)^2/59.06814 + (71 - 54.93186)^2/54.93186 Chisquare |

1 2 |
## [1] 27.08 |

1 2 |
df = (3 - 1) * (2 - 1) df |

1 2 |
## [1] 2 |

1 |
pchisq(Chisquare, df, lower.tail = F) |

1 2 |
## [1] 1.317e-06 |

There is a function that's built in R that can do this calculation for us.

1 |
chisq.test(table(anesVot$gender_respondent, anesVot$interest_whovote2008)) |

1 2 3 4 5 6 |
## ## Pearson's Chi-squared test ## ## data: table(anesVot$gender_respondent, anesVot$interest_whovote2008) ## X-squared = 27.08, df = 2, p-value = 1.317e-06 |

The out put suggest the calculation being carried out earlier is correct.

The resulting value is less than suggests that we should reject the null hypothesis and conclude that there is a dependent relationship between voter gender and preference among candidates.

### Conclusion:

This research finds out that voters' gender do affect who he or she is likely to vote for in the 2008 election. It seems that Barack Obama successfully win over more womens' favour than John did.

As for future studies, I guess I will try to find data and see is this kind of dependent relationship can be found in previous and follow-up elections.

### Reference

[1] http://www.electionstudies.org/ “The American National Election Studies”

[2] http://www.fec.gov/pubrec/fe2008/federalelections2008.pdf “Official Federal Election Commission report”