Comment:SXSW Action - Sample Data

This is the commentary page for discussing improvements to the SXSW Action - Sample Data article.

This is not a forum for general discussion about the article's subject. As in all areas of the GridRepublic community, we ask that you be polite, assume good faith on the part of other participants, avoid personal attacks, and in general be welcoming.

My approach (dictated by just having a few mintues) was rather straight forward:

load the data:

malaria_sample <- read.csv("/home/rob/malaria/data.txt", sep="\t")
str(malaria_sample)

 str(malaria_sample) 'data.frame':	2074 obs. of  26 variables:

Then a very simple ols model using rms()

options(datadist="x")
x <-datadist(malaria_sample)
attach(malaria_sample)
model1 <-lm(lossfunction ~
rcs(parameter_3, 5)+
rcs(parameter_4, 5)+
rcs(parameter_5, 5)+
# rcs(parameter_6, 5)+
rcs(parameter_7, 5)+
# rcs(parameter_8, 5)+
rcs(parameter_9, 5)+
rcs(parameter_10, 5)+
rcs(parameter_11, 5)+
# rcs(parameter_12, 5)+
rcs(parameter_13, 5)+
rcs(parameter_14, 5)+
rcs(parameter_15, 5)+
# rcs(parameter_16, 5)+
rcs(parameter_17, 5)+
rcs(parameter_18, 5)+
rcs(parameter_19, 5)+
rcs(parameter_20, 5)+
rcs(parameter_20, 5)+
rcs(parameter_21, 5) +
# rcs(parameter_22, 5)+
# rcs(parameter_23, 5)+
rcs(parameter_24, 5)+
rcs(parameter_25, 5)+
rcs(parameter_26, 5)+
rcs(parameter_27, 5)+
rcs(parameter_28, 5)+
rcs(parameter_29, 5)+
rcs(parameter_30, 5))

I didn't say it was elegant, but...

Multiple R-squared: 0.6076,	Adjusted R-squared: 0.5908  

Then I did what I should not have done (Prof. Harrell) and looked at the four "significant" variables, 7, 14, 21 and 27: and this model was:

model2 <-lm(ln(lossfunction) ~
rcs(parameter_7, 5)+
rcs(parameter_14, 5)+
rcs(parameter_21, 5) +
rcs(parameter_27, 5))

which yielded an adjusted R2 of 0.576 and here are the parameter estimate:


Call: lm(formula = lossfunction ~ rcs(parameter_7, 5) + rcs(parameter_14, 5) + rcs(parameter_21, 5) + rcs(parameter_27, 5)) Residuals: Min 1Q Median 3Q Max -419.21 -81.37 -18.05 63.25 936.70 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.715e+02 2.797e+01 24.008 < 2e-16 *** rcs(parameter_7, 5)parameter_7 8.321e-06 1.640e-06 5.074 4.26e-07 *** rcs(parameter_7, 5)parameter_7' -1.799e-02 4.723e-03 -3.808 0.000144 *** rcs(parameter_7, 5)parameter_7'' 2.387e-02 6.368e-03 3.748 0.000183 *** rcs(parameter_7, 5)parameter_7''' -5.922e-03 1.689e-03 -3.506 0.000465 *** rcs(parameter_14, 5)parameter_14 -6.359e-03 9.878e-04 -6.437 1.51e-10 *** rcs(parameter_14, 5)parameter_14' 1.968e-01 4.146e-02 4.747 2.20e-06 *** rcs(parameter_14, 5)parameter_14'' -7.236e-01 1.832e-01 -3.950 8.06e-05 *** rcs(parameter_14, 5)parameter_14''' 6.740e-01 2.048e-01 3.291 0.001016 ** rcs(parameter_21, 5)parameter_21 -9.010e-06 2.960e-07 -30.444 < 2e-16 *** rcs(parameter_21, 5)parameter_21' 1.610e-04 8.521e-06 18.898 < 2e-16 *** rcs(parameter_21, 5)parameter_21'' -1.094e-03 7.321e-05 -14.948 < 2e-16 *** rcs(parameter_21, 5)parameter_21''' 9.655e-04 6.962e-05 13.869 < 2e-16 *** rcs(parameter_27, 5)parameter_27 1.822e+05 7.527e+04 2.420 0.015587 * rcs(parameter_27, 5)parameter_27' -4.914e+06 2.700e+06 -1.820 0.068833 . rcs(parameter_27, 5)parameter_27'' 1.045e+07 5.248e+06 1.991 0.046636 * rcs(parameter_27, 5)parameter_27''' -7.282e+06 3.096e+06 -2.352 0.018763 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 158.7 on 2057 degrees of freedom Multiple R-squared: 0.5796, Adjusted R-squared: 0.5763 F-statistic: 177.2 on 16 and 2057 DF, p-value: < 2.2e-16


That's unreadable, but you can re-run it, and none of the four parameters behave in any way that suggests a consistent gradient with lossfunction (e.g. lossfunction increases/diminishes as parameter_x increases/decreases). Ahh, for a bit of theory!

So, it does seem as though strictly linear assumptions for relationships between independent and dependent are not valid.

The lossfunction is also highly skewed:

fivenum(lossfunction) [1]   62.67194  122.56499  172.84854  279.84904 1875.62055

Time to put kids to bed...

Obviously PCA is the next place to go.. not sure it is will get us much further. Robust regression, maybe log transform of lossfunction, ... might be next steps.

Multiple R-squared: 0.6076,	Adjusted R-squared: 0.5908 

Contents

Cut and Paste of R source code and output

Cut and paste from rstudio (0.95) into your web-tool works very badly. Is there a trick?


More data

It says on another page that the data set contains " tens of thousands of records" but the one for download here is only 2074. Tens of thousands of records would not make for an unwieldy file, it seems - why not put up the whole data set?


analyzing interactions

On a first look, it seems that the parameters interact in complex ways. I can't see any names with these posts, but the first poster's method is interesting as a first step, but I think we need to get a better handle on how different parameter combinations work together. I'm thinking of running a pairwise MIC, and trying to think of a way to do more complex groupings. Ideas?

- Aaron