Tuesday, June 1, 2010

Final Project Presentation

Objective: Legibly represent changes in household expenditure rankings between study years (1986, 1990, 1997, 2006). Attempt to include factors such as cash crop production, HIV prevalence, and geography (village membership).

##old method: expenditure transition matricies in excel spreadsheets
Not very easy to understand. Go on, keep staring.

-----------------------------------
##new method: R
#prepare workspace
library(lattice)
library(car)
library(MASS)
library(RColorBrewer)
#load data
malawi <- read.csv("MwExpends2.csv")
attach(malawi)
#only households whose tobacco status did not change (yes or no for all years)
malawi.lim <- read.csv("MwExpends4.csv")

#compare scatterplots
par(mfrow=c(2,2))
plot(rank90, rank86, main="Household Ranking, 1986-1990", xlab="Rank 1990", ylab="Rank 1986", pch=19, col=tob90)
plot(rank97, rank90, main="Household Ranking, 1990-1997", xlab="Rank 1997", ylab="Rank 1990", pch=19, col=tob97)
plot(rank06, rank97, main="Household Ranking, 1997-2006", xlab="Rank 2006", ylab="Rank 1997", pch=19, col=tob06)
plot(rank06, rank97, main="Household Ranking, 1997-2006", xlab="Rank 2006", ylab="Rank 1997", pch=19, col=hiv06v2)
#file: "scatterplot.quad"

Apologies for the lack of legends...still struggling with this one. Here, red=tobacco growers in the first 3 plots, and red = HIV presence in the last plot (bottom right)


#add lines
?????????
Well, it would be fun, but I don't know what kind of lines to add. A regression or loess line wouldn't make sense, because neither variable is dependent on the other.


------------------------------------------
#scatterplot matrix
splom(malawi.lim[ c(2,3,4,5)], groups=tob)
#file: "splom.limit"

Legend: purple = tobacco, blue = non-tobacco

------------------------------------------
#boxplots
par(mfrow=c(2,2))
boxplot(diff86.90~tob90,data=malawi, main="Change in Household Rankings 1986-1990", xlab="1 = No Tobacco, 2 = Yes Tobacco", ylab="Change in Rank") boxplot(diff90.97~tob97,data=malawi, main="Change in Household Rankings 1990-1997", xlab="1 = No Tobacco, 2 = Yes Tobacco", ylab="Change in Rank") boxplot(diff.97.06~tob06,data=malawi, main="Change in Household Rankings 1997-2006", xlab="1 = No Tobacco, 2 = Yes Tobacco", ylab="Change in Rank") > boxplot(diff.97.06~hiv06v2,data=malawi, main="Change in Household Rankings 1997-2006", xlab="1 = No HIV Deaths, 2 = 1 or more HIV Deaths", ylab="Change in Rank")
#file: "boxplots"


----------------------------------------
#parallel coordinate plots: yearly change in rank
mw <- rbind(malawi[,,3], malawi[,,4], malawi[,,5], malawi[,,6]) parcoord((mw)[, c(3,4,5,6)]) #flie: "matplot" What a mess!!


------------------------------------------
#colored by hiv06 > parcoord((mw)[, c(3,4,5,6)], col=hiv06v2)
#file: "matplot.hiv06"

Not much better. Remember that red = HIV, but the HIV data were gathered in 2006. The impact of the disease was arguably negligible before that year of study. This means that only the last one-third of the plot is meaningful--at it doesn't really tell us much.

------------------------------------------
#by cluster
parcoord((mw)[, c(3,4,5,6)], col=cluster)
#file: "matplot.cluster"

Well this one has pretty colors, but all it really tells us is that there is not really any pattern in change in ranking discernible by village cluster. Thus my failed attempt at being geographical has stalled.

------------------------------------------
#starting (1986) quartile compared to ending (2006) quartile
par(mfrow=c(2,1))
parcoord((mw)[, c(3,4,5,6)], col=q86)
parcoord((mw)[, c(3,4,5,6)], col=q06)
#file: "matplot.quartiles86.06"

Hmmm...this is a mild improvement. By coding the lines according to expenditure quartiles one can at least pick out beginning (upper) and ending (lower) positions. Makes it a bit easier to confirm that the rankings are VERY variable. Some of the colors do trend in place, especially blue, for the highest quartile. Do the rich stay right?

------------------------------------------
#by 20-year tobacco status
mw.l <- rbind(malawi.lim[,,2], malawi.lim[,,3], malawi.lim[,,4], malawi.lim[,,5])
parcoord((mw.l)[, c(2,3,4,5)], col=tob)
#file: "matplot.tob.lim"

I suppose this cleans it up a bit. In order to use the matrix coordinate plot to track tobacco growers, I had to eliminate the households which changed from year to year--which was the vast majority of them. I was left with 37 households (again, black = no, red - yes). From this, one could say that the richer families who grew tobacco and stayed with it by and large stayed in the to 50% of rankings. I probably could not have told you that from the aggregate data we've been using in the past. So...interesting, but a tad underwhelming.

Monday, May 24, 2010

Assignment #6 - Spatial Autocorrelation

---------------
orcounty <- readShapePoly("orcounty.shp",proj4string=CRS("+proj=longlat"))
plot(orcounty)
#save image now: "plot.orcounty.pdf"



---------------
summary(orcounty)
coordinates(orcounty)
centers=coordinates(orcounty)

centers=data.frame(centers) points(centers,col="blue",cex=1.2)
text(centers,labels=rownames(centers),cex=1.5)
orcounty.centers = coordinates(orcounty)
#save image now: "orcounty.labels.pdf"



---------------
k=1
knn1 = knearneigh(orcounty.centers,k,longlat=T)
orcounty.knn1=knn2nb(knn1)

plot(orcounty)
plot(orcounty.knn1, orcounty.centers, col="blue",add=T)

#save image now: "knn1.pdf"



---------------
plot(orcounty)
k=2

knn2 = knearneigh(orcounty.centers,k,longlat=T)

orcounty.knn2=knn2nb(knn2)
plot(orcounty.knn2, orcounty.centers, col="blue",add=T)

#save image now: "knn2.pdf"


---------------
plot(orcounty)
k=3
knn3 = knearneigh(orcounty.centers,k,longlat=T)

orcounty.knn3=knn2nb(knn3)

plot(orcounty.knn3, orcounty.centers, col="blue",add=T)

#save image now: "knn3.pdf"

---------------
plot(orcounty)
k=4

knn4 = knearneigh(orcounty.centers,k,longlat=T)

orcounty.knn4=knn2nb(knn4)

plot(orcounty.knn4, orcounty.centers, col="blue",add=T)

#save image now: "knn4.pdf"



---------------
plot(orcounty)
k=5
knn5 = knearneigh(orcounty.centers,k,longlat=T)
orcounty.knn5=knn2nb(knn5)

plot(orcounty.knn4, orcounty.centers, col="blue",add=T)

#save image now: "knn5.pdf"


---------------
d=100
orcounty.dist.100 = dnearneigh(orcounty.centers,0,d,longlat=T)

plot(orcounty)

plot(orcounty.dist.100, orcounty.centers,add=T,lwd=2,col="red")

#save image now: "d100.pdf"


---------------
d=200

orcounty.dist.200 = dnearneigh(orcounty.centers,0,d,longlat=T)
plot(orcounty)

plot(orcounty.dist.200, orcounty.centers,add=T,lwd=2,col="red")

#save image now: "d200.pdf"



---------------
d=150
orcounty.dist.150 = dnearneigh(orcounty.centers,0,d,longlat=T)
plot(orcounty)

plot(orcounty.dist.150, orcounty.centers,add=T,lwd=2,col
="red")
#save image now: "d150.pdf"



---------------
d=15
orcounty.dist.15 = dnearneigh(orcounty.centers,0,d,longlat=T)

plot(orcounty)

plot(orcounty.dist.15, orcounty.centers,add=T,lwd=2,col="red")
#save image now: "d15.pdf"


---------------
d=1000
orcounty.dist.1000 = dnearneigh(orcounty.centers,0,d,longlat=T)
plot(orcounty)
plot(orcounty.dist.1000, orcounty.centers,add=T,lwd=2,co
l="red")
#save image now: "d1000.pdf"



---------------
orcounty.lags=nblag(orcounty.knn2,2)
plot(orcounty)
plot(orcounty.lags[[2]],orcounty.centers, add=T,lwd=3,col="green",lty=2)

#save image now: "orcounty.lag2.pdf"


---------------
w.cols = 1:36
w.rows = 1:36

w.mat.knn = nb2mat(orcounty.knn1, zero.policy=TRUE)
w.mat.knn

image(w.cols,w.rows,w.mat.knn,col=brewer.pal(3,"BuPu"))

#save image now: "knn1.matrix.pdf"



---------------
w.mat.dist = nb2mat(orcounty.dist.100, zero.policy=TRUE) image(w.cols,w.rows,w.mat.dist,col=brewer.pal(9,"PuRd"))
#save image now: "d100.matrix.pdf"


---------------
breaks = round(quantile(orcounty$MEDIANRENT))
colors = c("red","orange","yellow","green")
plot(orcounty,col=colors[findInterval(orcounty$MEDIANRENT,breaks,all.inside=TRUE)])
#save image now: "orcounty.medianrent.pdf"


---------------
display.brewer.all() nclr = 4 plotclr = brewer.pal(nclr,"PuRd")
class = classIntervals(orcounty$MEDIANRENT,nclr,style="quantile")
colcode = findColours(class,plotclr)
plot(orcounty,col=colcode)
title(main="Median Rent in Oregon",sub="Quantiles")

#legend code not working:

##legend(71.5,35,legend=names(attr(colcode, "table")),fill=attr(colcode, "palette"), cex=0.75,bty="n")

#tried to alter parameters (guessing this is what I should do)

##legend(321.2,242.2,legend=names(attr(colcode, "table")),fill=attr(colcode, "palette"), cex=0.75,bty="n")

#didn't work, so just save image as is: "orcounty.medianrent.PuRd"

---------------
moran.plot(orcounty$MEDIANRENT,nb2listw(orcounty.dist.200),labels=orcounty$NAME)
#save image now: "orcounty.moran.pdf"

---------------

moran.test(orcounty$MEDIANRENT,nb2listw(orcounty.dist.200, style="W"))


Monday, May 17, 2010

Final Project Proposal

This project is the result of a number of data-related issues I’ve been thinking about for a few years now. I joined the Zomba project (as I informally refer to it) in 2007 as a data processor. This is a series of research projects begun by Pauline Peters (Harvard Center for International Development) in 1986 in a rural area of the Zomba District in southern Malawi. I was hired to process the data from her most recent round of data collection from 2006. I continued the data collection in Malawi during the summer of 2008. My current priority is to make sure our use of the data is appropriately oriented to statistical practice. Analysis is meant to better understand household wellbeing relative to each other. The problems I have spotted are listed below.

The Zomba project has consisted of ethnographic and survey-based data collection from approximately 230 households from 6 clusters of villages in the area. The general objective is to study the response of smallholder household food security to cash crop initiatives and HIV infection. Below is a description of which data have been collected over the years, followed by a list of the problems I’d like to address.

The Data (household level)

· income indicators: expenditures—monthly/annual totals, percentiles, and classified by category (eg. food, labor, household supplies); occupations; household assets—by total value and percentiles; income from crop sales (maize and tobacco); size of landholdings;

· demographics: number of members; age of household head; dependency ratio; headship (female de jure, female de facto, joint, male, child); occupation; education level; relationship and extended family; HIV prevalence (self-reported); morbidity and mortality;

· lifestyle: daily activities; mobility ;

· agriculture: crops grown; maize yield; tobacco yield; farming strategies (qualitative); fertilizer use; field size, location and use;

· [NEW] GPS coordinates for each household and some roads/paths;

· [Forthcoming] GPS polygons of farmers’ fields;

Problems/Questions

· The data are not parametrically distributed, but by and large we have been using parametric statistics. I need to find a more appropriate approach.

· Data are biased and cannot be described as representing a general population. Dr. Peters’ original intent was to compare tobacco-growing households to non-growers. Thus, she sampled households to select an equal number of tobacco-growers and non-growers. Though she tried to select for a representative sample based on other household indicators, the inclusion of a high proportion of tobacco growers biased the sample in favor of higher income and landholdings. How does this bias affect analysis? How can this be moderated for statistical analysis? (ie. Am I simply stuck with including a footnote explaining the bias? I’m pretty sure I am…)

· These data have never been analyzed geographically. The study area is geographically limited (roughly 30 km across), so Dr. Peters, an anthropologist, believes there should be no geographical effect on the data (that is, no spatial autocorrelation). It was only in the 2008 round that I collected GPS coordinates on each household. I was unable to record coordinates for other locations, such as clinics, wells or agricultural depots. Are we in danger of ignoring geography?

· Some of the visualization methods used by Peters--both for analysis and presentation--can be improved upon. Also, I would like to integrate spatial data into these visualizations.



References: (in alpha order)

Hargreaves, J., Morison, L., Gear, J., Kim, J., Makhubele, M., Porter, J., et al. (2007). Assessing household wealth in health studies in developing countries: a comparison of participatory wealth ranking and survey techniques in rural South Africa. Emerging Themes in Epidemiology, 4(1):4.

(This article compares our current method of wealth assessment to one I’ve been thinking of trying out. It uses some statistical techniques that might come in handy.)

Jayne, T.S., Takashi Yamano, Michael T. Weber, David Tschirley, Rui Benfica, Antony Chapoto and Ballard Zulu. (2003). Smallholder income and land distribution in Africa: implications for poverty reduction strategies. Food Policy, 28(3):253-275.

(This article uses spatial and statistical analysis to examine similar data to that collected for the Zomba project. It also comes from a journal I frequently read and reference, and would like to contribute to, so it makes a good example.)

Miller, D.C. (2002). Handbook of Research Design and Social Measurement, 6th ed. Thousand Oaks, CA: Sage Publications.

(General notes for research design.)

O’Sullivan, D. and D.J. Unwin. (2003). Geographic Information Analysis. Hoboken, NJ:John Wiley and Sons.

(First steps to integrating spatial data into our existing data.)

Peters, P.E., Walker, P.A., & Kambewa, D. (2008). The Effects of Increasing Rates of HIV/AIDS-related Illness and Deaths on Rural Families in the Zomba District, Malawi: a Longitudinal Study: RENEWAL Program.

(This is the most recent report produced by Peters and colleagues. I will use this to draw examples of our current use of statistical analysis and some of the problems I see.)

Serneels, S. and E. Lambin. (2001). Proximate Causes of Land-use Change in Narok District, Kenya: a Spatial Statistical Model. Agriculture, Ecosystems, and Environment, 85(1):65-81.
(Another new technique for future research? The secondary author is a prominent political ecologist whose work closely parallels my own.)

Smith, L. C., & Subandoro, A. (2008). Measuring food security using household expenditure surveys. Washington, D.C.: International Food Policy Research Institute.

(This is specific to one of the types of data we use. It should be noted that expenditures are always cross-references with other indicators, as they are a proxy.)

Tuesday, May 11, 2010

Fishnet Maps--Assignment #5


Look everybody! I managed to create something publishable for my blog!

Tuesday, April 6, 2010