Metodološkizvezki,Vol. 14,No. 2,2017,75–81 BadLuckofCancer–orMisinterpretedStatistics? JanezStare 1 Abstract A paper in Science (January 2015) claimed that the majority, 65% to be precise, of cancers is due to bad luck, so non-preventable. In this paper we show that the analyses, presented in the paper, give absolutely no grounds to make such a claim. Someoftheargumentshaveinthemeantimeappearedelsewhere,butsomehavenot. We also show that the authors’ assumptions and their data can only support a claim ofnomorethan5%ofcancersbeingrandom. 1 Introduction In2015TomasettiandVogelstein(2015)publishedapaperinScienceinwhichtheyana- lyzed associationbetween thenumber of stemcell divisions ofgiven tissues ina lifetime and probability of cancers of those tissues. They found a strong correlation of more than 0.8(orR 2 = 0.65)betweenthelogarithmsofthesevariablesandbasedonthisconcluded that a great majority of cancers, approximately two thirds of them, occurs randomly due tostemcelldivisions,andthattheotherfactorscontributeonlytotheresidualthirdofall cancers. Their work immediately met with some negative reactions, published mainly as letters in Science, but their results were essentially not disputed. Only a year later, Wu, Powers,Zhu,andHannun(2016),inapaperpublishedinNature,showedthatcorrelation cannotsayanythingabouttheproportionofrandomcancers. InthispaperwegiveamoredetailedandmoreversatilecriticismoftheTomasettiand Vogelstein analysis, and also show that using their assumptions one cannot claim more than 5%ofallcancers,usedintheiranalysis,beingrandom. There are different ways to show that correlation, or R 2 , cannot come close to esti- matingtheproportionofrandomcancers. Inthefirstsectionweshowinadirectwaythat such an estimation is impossible, and give an illustration which completely mimics the analysisbyTomasettiandVogelstein,butonadifferentdatasetwhichmakesthemistake more obvious. In the second section we show how understanding the properties of R 2 makes it obvious that something is wrong with the conclusion about the proportion of random cancers, and, finally, we show that Tomasetti and Vogelstein’s assumptions and theirdatacanonlysupportaclaimofnomorethan5%ofcancersbeingrandom. 1 Institute for Biostatistics and Medical Informatics, Faculty of Medicine, University of Ljubljana, Ljubljana,Slovenia;janez.stare@mf.uni-lj.si 76 Stare 2 ADirectWayofShowingthatR 2 cannotMeasureRan- domnessofCancer ThemodelthatTomasettiandVogelsteinassumeis p i = 10 a d b i wherep i istheprobabilityofcancerfortissuei,d i isthenumberofstemcelldivisionsfor thattissue,andaandbarecoefficientstobedeterminedfromthedata. Takinglogarithms weget logp i = a+blogd i and by fitting a regression line to the points (logp i , logd i ) we get the estimates of the coefficients and an R 2 equal to 0.65. If we say that cancers which occur regardless of people’s life style, environment or similar, are random, and other cancers non random, then we can write every probability p i as a sum of two probabilities, the probability of a cancer being random (or stochastic, denoted by s i ) and a probability of a cancer being non-random(ornon-stochastic,denotedbyns i ). Solet’swritep i = s i +ns i and log(s i +ns i ) = a+blogd i If we knew all s i then proportion of random cancers (PRC) among all cancers would be simply PRC = P s i P p i The paper by Tomasetti and Vogelstein hints that this proportion is estimated by R 2 thattheyobtained,so R 2 =  X s i  /  X p i  = 0.65. It should be obvious that regression analysis will give the same results, and so the same R 2 , regardless of how each p i is decomposed into s i and ns i . This means that R 2 has absolutely nothingtodowithPRC.ThissimplefactisillustratedinFigure1forsubsetof valuesinTomasettiandVogelsteinpaper. Proportionofrandomcancersisvaried,butthe total probabilities remain the same. The actual situation should, under the Tomasetti and Vogelstein assumption, look something like subfigure (c), but the values for randomness (darkareas)canbealmostanything. We illustrate this simple fact using another example, completely analogous to the above,butmuchmoreobvious. LetuslookattheEuropeancountriesandrecordtheirareasandpopulationsizes. Data canbe,forexample,foundherehttp://bit.ly/1K2oosV.Ofcourse,anysetofcountrieswill do. So d i now represent areas, and p i are proportions of each country’s population in the total population of Europe. We use the same model as Tomasetti and Vogelstein (so logs of proportions and areas)and fit a regression line (Figure 2). We get an R 2 of 0.86. If, as anexample,s i andns i representproportionsofsmokersandnon-smokersofcountryiin the total of European population, can we say that there are 86% of smokers in Europe? Obviouslynot. ButthisisexactlytheargumentthatTomasettiandVogelsteinmake. The pointis,again,thats i andns i canbeanytwonumbersaddinguptop i ,andwewillalways havethesameR 2 . Bad Luck of Cancer ... 77 0.00 0.01 0.02 0.03 0.04 0.00 0.01 0.02 0.03 0.04 0.00 0.01 0.02 0.03 0.04 Figure1: Differentproportionsofrandomcancersdonotchangetheoverallproportions Figure2: Regressionofproportionsofcountries’populationsinthetotalEuropean populationoncountries’areas(logscales) 3 IndirectWaysofShowingthatR 2 cannotMeasureRan- domnessofCancer Tomasetti and Vogelstein calculate the correlation for a certain range of the number of stemcelldivisions. IfR 2 didindeedestimateaproportionofrandomcancers,forcancers included in their analysis, then we should get a similar result if we calculate this propor- tion for a subrange. For example, if we did this separately for the tissues with stem cell divisions below the median, and above the median, then we should be able to combine these results into the overall number. For example, if R 2 below the median was a, and 78 Stare above the median b (of course, any subdivision would do), then a weighted average of these two numbers should give us 0.65. In fact, the numbers that we get for R 2 are 0.43 and 0.21 (Figure 3). So, parts have less random cancers than the whole? On the other hand, if, in future, say, other cancers are included which have numbers of stem cell di- visions lower than the present minimum, or higher than the present maximum, R 2 will increase. l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l Stem cell divisions in 10^ probablity of cancer in 10^ Data from the paper 6 7 8 9 10 11 12 13 −5 −4 −3 −2 −1 0 overall R 2 = 0.65 R 2 = 0.43 R 2 = 0.21 Figure3: Proportionofexplainedvariationdependsonthechosenintervalofthe independentvariable Another way of showing that R 2 cannot estimate PRC is to assume that all probabili- tiesintheirdataweremultipliedbyacertainfactor. Onecanimagineacountrywhichhas more (less) risk due to some extra factor (or lack of some factor). Since the probabilities of random cancers cannot change, their proportion in the overall number of cancers will nowbedifferent,smallerorlarger,dependingonthefactor. ButR 2 willnotchange! This isillustratedinFigure4. 4 ANoteonData Data which Tomasetti and Vogelstein paper analyze contain some points that should not be there. For example, probabilities for lung cancer are given separately for smokers and nonsmokers. Theseareconditionalprobabilities,givenvaluesofsomeextravariable,and are certainly not the probabilities which one would predict based on the number of the Bad Luck of Cancer ... 79 Figure4: Increaseordecreaseofcancerprobabilitiesbyagivefactordoesnotchangethe R 2 80 Stare stem cell divisions. They argue that leaving them like this does not change the results of theanalysis,butthisisnotavalidargument. Weareinterestedinprobabilitiesofcancers, given the number of stem cell divisions, nothing more. For our calculations in the next section we corrected these data, so that every number of stem cell divisions has just one corresponding probability of a cancer of that tissue. We used data from Tomasetti and Vogelstein supplementary material to do this. For example: lifetime risk of lung cancer fornonsmokersis0.0045,andforsmokers0.081. Assuming(asTomasettiandVogelstein report in their supplementary material) that the proportion of smokers is 0.3, then the overallprobabilityoflungcanceris 0.0045×0.7+0.081×0.3 = 0.02745. 5 ADifferentWayofEstimatingtheProbabilityofRan- domCancers Ifweassume,asTomasettiandVogelsteindointheirmodel,thatprobabilitiesofcancers onlydependonthenumberofdivisions,thenthecandidatesforrandomcancersarethose lyinglowonthegraph. InFigure5weconnectedtwolowpointsonthegraphandvalues on this line could be seen as (log of) probabilities for random cancers. Anything above them must be non random. Of course, assuming that those two points themselves repre- sent probabilities of random cancer for those two tissues is probably overestimating the trueprobabilities. Andstill,thetotalprobabilityofrandomcancer,calculatedinthisway, isverylow. Figure5: Regressionlinethroughtwopointsrepresentingpossiblerandomcancers Another,endevenbetter,wayofcalculatingtheoverallprobabilityofrandomcancers, istodothefollowing Bad Luck of Cancer ... 81 1. Findcancerwiththelowestprobabilityperdivision. 2. Take this (or part of this) probability to be the probability of random cancer per division. 3. Multiplythisprobabilitybythenumberofdivisionsforeachtissue. 4. Thisgivesusprobabilitiesofrandomnesspereachtissue. WhenweapplyabovetotheTomasettiandVogelsteindata,itturnsoutthatthelowest probability per division is for small intestine cancer. Assuming (probably unreasonably) that all small intestine cancers are random, and continuing with points 2. to 4. above, and then summing up all the obtained probabilities, we get that the overall probability of random cancers is 1.6%! This translates into no more than 5% of all cancers in their analysisbeingrandom,dependingonhowmuchindependencewewanttoassumeamong cancers. Of course we do not believe these numbers, we simply show that the claim of mostcancersbeingrandomrestsonavaryshakyground. 6 Conclusion There is no way one can claim any proportion of cancers being random, based on the analysis of Tomasetti and Vogelstein. In fact, their inherent assumption of all divisions being equally likely to produce cancer, yields a very low estimate of the proportion of randomcancers. References [1] Tomasetti, C. and Vogelstein, B. (2015): Variation in cancer risk among tissues can beexplainedbythenumberofstemcelldivisions. Science,347,78–81. [2] Wu, S., Powers, S., Zhu, W., and Hannun, Y. A. (2016): Substantial contribution of extrinsicriskfactorstocancerdevelopment. Nature,529,43–47.