OLS results


read 61 lines from gdp.dat
read 90 lines from work.dat

gdp.dat vs work.dat NOTES: 1 41 missing x values discarded 2 12 missing y values discarded 13 (27.1%) points > 1 \sigma 2 (4.2%) points > 2 \sigma y = 0.00122705*x + 46.0427 This is the model. Ostensibly for each $bn USD in GDP, there's an associated .00122% of work force participation. But see the T-test below to check the result can't just be explained by chance. sdx 6555.62 rss 5487.46 se 10.9221 r 0.107955 limits for beta at 90.0% CI tc = 1.67866 at 46 d.f. beta in 0.00122705 +- 0.00279677 = [-0.00156972, 0.00402381] The T-test says the \beta could likely be anywhere between -0.00156 and +0.00402. This indicates there's a 90% possibility (the default confidence for such tests) that it is "really" zero, simply by chance. The model given, above, is then only a "statistically weak" one. T-tests on beta: H0 beta == 0.000000 against H1 beta != 0.000000 calculated t = 0.736493 at 46 d.f. |t| <= tc (1.67866 2-sided); accept H0 H0 beta == 0.000000 against H1 beta > 0.000000 t <= tc (1.30023 right tail); accept H0 Probabilities: P(beta!=0.000000) = 0.534830 P(beta>0.000000) = 0.767415 This "P value" indicates the probability, based on the dataset, that \beta is really positive is 77%. I.e. there is a 23% chance \beta is really <= 0. Most statisticians would be uncomfortable with using the model, given this is the likelihood we're completely wrong. limits for alpha at 90.0% CI tc = 1.67866 at 46 d.f. alpha in 46.0427 +- 2.90141 = [43.1413, 48.9441] r2 = 0.0116544 The "r2" shows how much of the variation of the dependent variable corresponds to the variation in the independent variation. It's a crude measure of "explanation power" for the model, as distinct from its "confidence" (i.e. the 90%, above), or its "strength" (i.e. the actual value of \beta). In this case we see the model "explains" only about 1% of the relationship. This means there must be 100s of variables "similar" to the one we selected here, that would have at least the same success in explaining why the participation rate is as observed in the dataset. IOW, GDP is not a good explanatory variable, even if we accept that the relationship is "statistically significant". calculated Spearman corr = 0.214177 Testing: H0: vars are independent |r| <= rc (0.306000 2-sided) at 5%; accept H0 The Spearman Correlation compares the ordering of the data with respect to each of the variates. If the orderings are "similar" the Spearman tends to +1.0; if the orderings are exactly opposite (i.e. highest GDP corresponds to lowest participation rate and vice versa) it tends to -1.0. The value calculated here -- .21 -- is compared against the so-called "critical value". If the absolute value of the computed Spearman value is greater than the critical value, the relationship is significant -- i.e. we are rejecting the so-called "null hypothesis" that the vars are actually independent. We see the Spearman says the vars are apparently independent at 95% confidence. tag x y yp zimbabwe 6.2 71.5 46.0503* cote_d'ivoire 9 39.4 46.0537 hungary 28 45.1 46.077 egypt 33 27.7 46.0832 nigeria 34 31.1 46.0844 ireland 38 36.8 46.0893 peru 38 39.9 46.0893 singapore 39 63.1 46.0905 colombia 42 44.5 46.0942 nz 42 47.1 46.0942 philippines 46 64.5 46.0991 malaysia 46 37.6 46.0991 pakistan 47 28 46.1004 portugal 58 47.8 46.1139 israel 59 51.7 46.1151 greece 66 39.6 46.1237 poland 71 48.7 46.1298 hk 77 50.1 46.1372 thailand 90 55.7 46.1531 argentina 91 38.1 46.1544 s_africa 91 38.9 46.1544 norway 103 50.9 46.1691 turkey 104 37.5 46.1703 indonesia 111 42.6 46.1789 finland 122 51.1 46.1924 denmark 122 56.7 46.1924 iran 127 26 46.1985 austria 158 45.8 46.2366 taiwan 176 41.6 46.2587 belgium 192 41.9 46.2783 sweden 219 69.3 46.3114* switzerland 226 52.4 46.32 mexico 252 29.6 46.3519 s_korea 274 60.6 46.3789 netherlands 279 45.8 46.385 india 285 34 46.3924 australia 288 63.8 46.3961 china 424 49.6 46.563 brazil 447 43.2 46.5912 russia 480 52.6 46.6317 spain 487 38.9 46.6403 canada 569 66.3 46.7409 uk 964 50.3 47.2256 italy 1072 42 47.3581 france 1168 43.3 47.4759 germany 1692 49.6 48.1189 japan 3337 52.5 50.1374 us 5686 50.3 53.0197 The dataset is presented, along with the modelled values. Data points that appear further than 2 standard-deviations away from the model's regression line are marked with an asterisk (*). They can be later removed to see whether the OLS model improves in quality. While outliers can both create and obscure a "real" relationship, they typically add "noise" to the dataset and obscure a relationship that exists. This is true moreso when there are dozens (rather than "a few") data points in the set.


Kym Horsell /
Kym@KymHorsell.COM

ADVISORY: Email to these sites is filtered. Unsolicited email may be automajically re-directed to the relevant postmaster.