Document:What Time Is It

From AIDS Wiki
Jump to: navigation, search
NOTWITHSTANDING ANY OTHER NOTICE ON THIS PAGE, the material on this page is NOT available under the GNU Free Documentation License; in accordance with Title 17 U.S.C. section 107, it is posted in the manner of bulletin boards in schools and workplaces, to encourage public education and citizen awareness, without profit or payment, for persons and entities engaging in non-profit research and educational activities and purposes only.

What Time Is It? You Mean Now?
by Darin Brown

"You Bet Your Life"
9 October 2006
[postscript 6 November 2008]

Previous chapter


My recent "Jelly not Jam" column [1] attracted a lot of attention on various "science blogs". One of two major points people seemed to be making is along the lines of the following:

They're focusing on the [fact] that because it's much more difficult to predict at the individual level exactly how much CD4 loss will correlate with viral load, well, the whole thing must just be bunk! What they apparently don't realize is that again, this is common in most studies looking at predictive criteria such as these. For example, on average, elevated blood cholesterol levels correlate with an increased risk of cardiovascular disease (CVD). But, one person may have very high blood cholesterol level and show no signs of disease, while another may have fairly low cholesterol and still have CVD. It's always tougher to apply these population-based measurements at the individual level, since population-based data by their nature average out these individual variations. [2]

Few predictive criteria in medicine are perfect. However, this does not make every imperfect predictive criterion valid. What "they" apparently don't realize is that the purpose of the coefficient of determination is to provide a numerical measure of how well one variable predicts another. [3] In the study in question, this value was 4%. [4] This is statistically as good as nil. In other words, there is no correlation between viral load and CD4 cell loss to speak of. Almost none of the loss of CD4 counts can be explained by viral load levels. The rest of the paper is nothing more than an attempt to obscure this central fact. [5]

The other major point was that Figure 1 [also given as the final row of Table 1] was being ignored:

Figure 1 from the paper shows clearly that if you take a large number of HIV infected people and divide them into groups according to their HIV RNA levels you find that on average people with lower than 500 HIV RNA copies/ml had small decreases in CD4+ counts per year and on average people with viral load above 40000 had much larger decreases in CD4+ cells per year. There is a simple relationship on average between viral load and rate of CD4+ cell depletion. What the paper does say is that for one particular individual viral load measurements are not a good predictor of the rate of CD4+ cell depletion. It is pure 'rethinker' spin to imply that this paper demonstrates that there is no relationship between viral load and CD4+ cell depletion. [6]

Figure 1 of Rodríguez et al PDFsmallicon.gif, showing the "simple linear relationship" of median subgroup responses.

There are several reasons for doubting the significance of Figure 1.

First, the subgroup breakdown for "HIV RNA" levels is biologically absurd. Almost all subjects have viral load levels corresponding to no more than a handful of infectious HIV/mL [7]. Moreover, the choice of boundary values appears to be completely arbitrary. Given this, what is the point in examining such "broad categories", except to smooth out the lack of correlation found in the total population?

The CD4 counts also raise questions. For example, the lowest class of viral load level (500 or less) corresponds to a loss of 20 CD4 cells/mm3/year, while the highest category (over 40,000) corresponds to a loss of 78 CD4 cells/mm3/year. But: person's CD4 count may vary between 160 and 240 over a period of several months... [8]

In other words, the inherent variation in CD4 cell counts can range up to 80 counts/.5 year, yet we are asked to believe that a difference of 58 counts/year between the subgroups cited above is significant. If we were to observe differences in CD4 cell loss consonant with the predictions of the study, how would we know it wasn't simply due to normal variation in CD4 cell counts? [9]

There are a number of recognized criteria which generally should be satisfied to justify such a "subgroup analysis" as that appearing in Figure 1:

  1. Subgroup analysis should be treated cautiously, esp. when there is little evidence of correlation in the first place.
  2. Subgroup analysis is most reliable when there is a priori biologically plausible explanations for subgroup differences.
  3. Subgroup analysis is most reliable when the magnitude of the difference is clinically important.
  4. In the absence of a plausible biological explanation, subgroups should not be examined in isolation and use different models/tests, etc., but use the same model/test for the entire data set. [10]
  5. Especially in the absence of a plausible biological explanation, subgroup analysis cannot neccessarily rule out "confounding variables" (distinguishing correlation from causation).

All of these points call into question the justification for Figure 1 and conclusions that can be drawn from it, in particular the notion that it provides significant support for the HIV hypothesis. There are many other explanations which more plausibly account for Figure 1:

  1. So far, I haven't discussed biological or mathematical problems with viral load. Suffice it to say that there is doubt as to what exactly "viral load" is measuring in the first place. It could very well be that "viral load" functions in much the same way as fever is a general indicator of illness but with little predictive power (correlation) for individual cases. [11]

  2. "Co-factors" other than HIV must be involved. But where has it been shown that such non-viral "co-factors" depend on HIV? [12] Indeed, it would be a truly remarkable virus with 9,000 base pairs that can somehow induce a multiplicity of indirect cell-killing mechanisms that aren't related to how much virus is actually circulating in the blood.

  3. But the most plausible explanation to me is that Figure 1 is just a mathematical artifact. If you take a cloud of data points [Figure 3] that are essentially random (no correlation) and you break them into 5 subgroups by magnitude of the predictor variable and choose the median outcome of the response variable for each subgroup, this will have the effect of obscuring the lack of correlation. It's the statistical equivalent of squinting your eyes so you can't see any details anymore.

Figure 3 of Rodríguez et al PDFsmallicon.gif, visually displaying the lack of correlation between viral load and CD4 cell loss

Postscript: 6 November 2008

Objective: To estimate the proportion of variability in rate of CD4 cell loss predicted by presenting plasma HIV RNA levels in untreated HIV-infected persons.... Main Outcome Measures: The extent to which presenting plasma HIV RNA level could explain the rate of model-derived yearly CD4 cell loss, as estimated by the coefficient of determination (R2). Results: In both cohorts, higher presenting HIV RNA levels were associated with greater subsequent CD4 cell decline. In the study cohort, median model–estimated CD4 cell decrease among participants with HIV RNA levels of 500 or less, 501 to 2,000, 2,001 to 10,000, 10,001 to 40,000, and more than 40,000 copies/mL were 20, 39, 48, 56, and 78 cells/μL, respectively. Despite this trend across broad categories of HIV RNA levels, only a small proportion of CD4 cell loss variability (4%-6%) could be explained by presenting plasma HIV RNA level. — Rodríguez et al. PDFsmallicon.gif, JAMA, 2006; 296: 1498-1506.

Common Errors Involving Correlation — We now identify three of the most common sources of errors made in interpreting results involving correlation... 2. Another error arises with data based on averages. Averages suppress individual variation and may inflate the correlation coefficient. One study produced a 0.4 linear correlation coefficient for paired data relating income and education among individuals, but the linear correlation coefficient became 0.7 when regional averages were used. — Elementary Statistics, 10th edition, Mario F. Triola

Ecological Correlation: Correlations based on averages can be arbitrarily misleading if they are interpreted to be about individuals. Correlations based on averages are usually too high, because they ignore the variability across individuals. Correlation of averages is called ecological correlation.... Ecological correlations are correlation coefficients of averages across groups of individuals, rather than correlation coefficients for individuals. Ecological correlations tend to be stronger than the correlation coefficient for individuals, although the opposite is also possible. Beware arguments about association that rely on ecological correlations. — "Statistics Tools for Internet and Classroom Instruction with a Graphical User Interface", Philip B. Stark, Professor of Statistics, UC Berkeley

Footnotes and references

  1.   Brown, Darin, 2006. "It Must Be Jelly, 'cause Jam Don't Shake Like That", "You Bet Your Life", 2 October 2006.
  3.   The coefficient of determination r2 is the square of the correlation coefficient r. Whereas r can range between −1 and 1, indicating a positive or negative correlation, r2 can range between 0 and 1, with 0 indicating no linear predictive power and 1 indicating full linear predictive power.
  4.   Rodríguez B et al., "Predictive Value of Plasma HIV RNA Level on Rate of CD4 T-Cell Decline in Untreated HIV Infection PDFsmallicon.gif", Rodríguez et al., JAMA, 2006; 296: 1498-1506.
  5.   There seems to be some confusion between the slope of the line of best fit and correlation. The line of best fit is just that – the line that fits the data best. Correlation is a measure of how well the data fit that line. Simply finding a line of best fit with non-zero slope implies nothing about correlation. It is possible to have a large slope for the line of best fit with little correlation, and also possible to have a very small slope with perfect correlation.
  6. I have also read comments pointing out that the confidence intervals in Figure 1 have little overlap, suggesting that this provides strong evidence for "overall predictive value among groups of similar viral load". I believe this fact has been misinterpreted by many people. It seems many people are interpreting the 95% confidence interval to mean: "95% of the cell loss rates in a given subgroup lie within the confidence interval." This is totally untrue. Neither is it true to state that "With probability 95%, the median cell loss rate lies in the interval." The correct interpretation is: "If we were to compute the confidence interval many times, then 95% of those times, the confidence interval would contain the true median cell loss rate."
  7.   Piatak et al., Science 259, 1749-1754, 1993; Duesberg P, Bialy H. 1995. "HIV an illusion." Nature 375:197.
  9.   It should also be noted that the difference in CD4 cell loss/year between the lowest and highest subgroups given in the "confirmatory" MACS cohort is even less than that above, roughly 40 counts/year. In this case, the possible inherent variation in the measurement eclipses the difference in "predicted CD4 cell loss" by a factor of four.
  10.   At first I was perplexed trying to reconcile Figures 1 and 2. Each gives the median random-effects model estimate of CD4 cell loss/mm3/year. However, the two data sets are quite different. The data in Figure 1, quoted in the abstract, are 20.2, 39.3, 47.7, 55.9, and 77.7 cells/mm3/year for the 5 subgroups of increasing HIV RNA level. However, Figure 2 gives the data 37.3, 42.8, 45.8, 48.9, and 52.2 and there is significant overlap between the cell loss rates in different subgroups, which would indicate that the "simple relationship on average between viral load and rate of CD4+ cell depletion" is not supported. The difference between the two data sets is that the data for Figure 2 were computed using the same model for the entire data set, while for Figure 1, each subgroup was studied in isolation and different random-effects models were used for each subgroup.
  11.   "The human genome has about 3 billion base pairs, while that of HIV has only about 10,000. Because of this difference, human cells produce a great deal more RNA than HIV does. RNA from human cells could be released in large quantities during times of rapid cell death, which is what occurs during the infectious and inflammatory processes commonly present in people diagnosed HIV-positive... The high rate of false positive results from HIV RNA assays suggests that some of the 3 billion base pairs in the human genome could be producing RNA that is mistakenly attributed to HIV." (Irwin, Matthew, 2001. "False Positive Viral Loads: What Are We Missing?", 2001
  12.   Duesberg PH, "AIDS Acquired by Drug Consumption and Other Non-Contagious Risk Factors", Pharmac. & Ther. Vol. 55: 201-277, 1992, Section 3.5.7. HIV to Depend on Cofactors for AIDS

© 2006 by Darin Brown
Originally published at "You Bet Your Life"

Next chapter