Metodološki zvezki, Vol. 4, No. 1, 2007, 83-98 Statistically Sound Distribution Plots in Excel Gaj Vidmar1 Abstract Excel is the most widespread and the most powerful general-purpose spreadsheet software, but it is not popular with statisticians. Nevertheless, as a natural means for organising, displaying and analysing large amounts of data, spreadsheets keep gaining importance in statistical education and practice. Aiming at improving such practice rather than fruitlessly and indiscriminately condemning it, the paper provides general considerations on the topic, pointers to the huge body of relevant literature and software, and several concrete examples of data visualisation in Excel in the sense of univariate, bivariate and multivariate distribution plotting. Original and improved Excel solutions for producing dot-density plots, dot plots, stem-and-leaf plots, windowgrams, coplots and parallel coordinates plots are presented, as well as for performing the Box-Cox transformation. Additionally, further possibilities opening with the forthcoming Excel 2007 version, use of various commercial and freeware add-ins, and integration of Excel with statistical software are discussed. 1 Background Microsoft® Excel (henceforth referred to as Excel for brevity) is by far the most widespread and arguably the most powerful spreadsheet, but it is not a very popular tool with statisticians. This is partly justified (McCullough and Wilson, 1999, 2002), but to a large extent also based on prejudice and ignorance, as demonstrated by the publications, software and examples referenced, discussed or introduced in this paper. As a natural means for organising, displaying and analysing large amounts of data, spreadsheets have gradually but relentlessly found their way into the world of mathematics and statistics. Their potential for interactive teaching was already realised at the onset of widespread Internet use (Hunt and Tyrrell, 1995), and nowadays high-profile introductory statistics textbooks are being published that 1 University of Ljubljana, Faculty of Medicine, Institute of Biomedical Informatics; gaj.vidmar@mf.uni-lj.si 84 Gaj Vidmar are either entirely based on Excel (e.g., Anderson, Sweeney and Williams, 2005) or demonstrate its use in parallel with a major statistical software package (e.g., Hawkes and Marsh, 2005). Though its limitations and shortcomings, mainly in terms of numerical algorithms and missing data handling, can take considerable effort and expertise to overcome (Heiser, 2005), Excel has proven to be a particularly valid tool for combining mathematics, statistics and engineering education (de Levie, 2004; Liengme, 2002; Neuwirth and Arganbright, 2004). In the form of electronic or printed monographs with accompanying publicly available add-ins, Excel is also becoming a prominent tool for combining statistical teaching and statistical practice (Steppan, Werner and Yeater, 2001; Myerson, 2005). Last but not least, the market niche of general-purpose or specialised statistical add-ins for Excel as low-cost alternatives to stand-alone statistical packages would not be flourishing as it is2 if the producers were only exploiting Excel's deficiencies and omissions, rather than capitalising on Excel's qualities and potential for storing, manipulating, analysing and presenting statistical data. The field of applying Excel in statistics that is subject to particularly harsh criticism, but also witnessing particularly intense development, is data visualisation. The large public knowledge base consisting of the usenet newsgroup and websites of experts3 is an enormous source of ready-made solutions (e.g., for bivariate density charts4), instructions for working around the limitations of Excel's charting facilities (e.g., to create boxplots5), and support for task automation via macros and add-ins. At the same time, professional products are covering the range from large-scale categorical data visualisation6 through dynamic multivariate exploration7 to heat-maps for gene microarray data analysis8. 2 successful products include (tentatively ordered by decreasing scope) XLSTAT-Pro (http://www.kovcomp.co.uk/xlstat), statistiXL (http://www.statistixl.com), WinSTAT (http://www.winstat.com), Simetar© (http://simetar.com), SPC XL (http://www.sigmazone.com), SigmaXL® (http://www.sigmaxl.com) and MegaStat® (http://blue.butler.edu/~orris/megastat) 3 especially recipients of the Microsoft® Most Valuable Professional title, such as (in alphabetical order) F. Cinquegrani (http://www.prodomosua.eu/ppage02.html), T. Mehta (http://www.tushar-mehta.com/), J. Peltier (http:// peltiertech.com) and A. Pope (http://www.andypope.info) 4 http://www.prodomosua.eu/zips/density.exe or http://www.j-walk.com/ss/excel/files/gradcontour.htm 5 http://www.mis.coventry.ac.uk/~nhunt/boxplot.htm or http://peltiertech.com/Excel/Charts/BoxWhisker.html 6 e.g., Treemap freeware implementations (http://research.microsoft.com/community/treemapper) and commercial extensions (http://www.panopticon.com) 7 e.g., commercial Miner3DTM (http://www.miner3d.com) and freeware VisuLab (http://www.inf.ethz.ch/personal/hinterbe/Visulab) 8 e.g., BRB ArrayTools (http://linus.nci.nih.gov/BRB-ArrayTools.html) Statistically Sound Distribution Plots in Excel 85 2 About the paper and the presented solutions With the presented solutions, the author filled the main gaps left in the collection of basic distribution plots available to an Excel user aware of the above-named resources. The examples are primarily intended for teaching and demonstrational purposes, but they are also useful for producing publication-quality graphics. They range from univariate through bivariate to multivariate visualisations and related procedures. The remainder of the paper is organised as follows. First, the univariate dot-density plot is addressed together with simple and multi-way dot plots, discussing the existing resources and providing implementation of the two possible solutions, i.e., character-based cell-charts and modified scatter-plots (Section 3). Next, an implementation of the stem-and-leaf plot with a macro is presented (Section 4). Automated histogram plotting and the use of Excel's built-in Solver add-in for optimisation are combined in the presented implementation of the modified Box-Cox power transformation towards normal distribution (Section 5). Kernel density estimation is introduced through windowgrams, which are implemented only with conditional summing formulas (Section 6). In combination with coplot demonstration (scatter-plots conditioned on a third variable) based on array formulas (Section 7), the aim is to make advanced concepts and techniques understandable and accessible to non-mathematical audience. Parallel coordinates plot is implemented with a macro for quick production of presentation graphics as well as for exploratory purposes (Section 8). Additionally, further possibilities for data visualisation with the forthcoming Excel 2007 version and the integration of Excel with academically oriented statistical packages are discussed (Section 9). Section 10 offers some concluding thoughts. To emphasise the actual user experience and the instructional aspects of the solutions, worksheet screenshots are presented in the figures rather than just the resulting plots. The aim of the text in the worksheets is to make them sufficiently self-explanatory while still stimulating the user's self-discovery. Because of the instructional purpose, some compromise had to be made with respect to the principles of good data visualisation, but elimination of visual clutter (Tufte, 1983) and avoidance of unnecessary use of colour (Wainer and Thissen, 1981) has been pursued relatively strictly. Since random data is generated solely for demonstrational purposes, Excel's built-in random number generation function (RAND) is used despite its known deficiencies (particularly in older versions; see Heiser, 2005, for details). All the presented solutions were developed on the Microsoft® Windows® platform and tested with Excel versions from 97 upwards. Those not using macros should be completely compatible with the corresponding Apple® Macintosh® versions of Microsoft® Office, while those including VBA code could work with Microsoft® Office for Mac® versions prior to the 2004 edition. The workbooks 86 Gaj Vidmar can be downloaded from the webpage of the Biostatistical Centre of the Institute of Biomedical Informatics, Faculty of Medicine, University of Ljubljana (http://www.mf.uni-lj.si/ibmi-english/biostat-center, follow the link to Software). Some other statistical Excel workbooks can also be found there. The workbooks described in this paper are designed to be readily available for use upon download. The combination of instructions, comments, plot and axis titles, column headings and formatting elements should provide all the necessary information for the interested instructor, student or practitioner. The scope of four workbooks is primarily instructional: the solutions for dot-density plots, dot plots, windowgrams and coplots require a small further effort from the users who want to transfer them into daily statistical practice. Three workbooks have dual scope: the solutions for stem-and-leaf plots, modified Box-Cox transformation and parallel coordinates plots can be used not only as instructional aids, but also as "proper" applications. 3 Dot-density plots and dot plots Dot plots (Cleveland, 1985) have recently attracted notable attention in business setting and in the context of management dashboards (Kyd, 2006; Robbins, 2006). As spreadsheets are particularly dominant in such setting, it is not surprising that detailed step-by-step instructions for constructing them were aptly published by Excel experts (O'Day, 2006; Peltier, 2006). Conceptually, dot plots can be considered as developed from dot-density plots, and the two methods are also technically related in terms of possible Excel implementation. Dot-density plots can be simply and efficiently implemented as cell-charts with the REPT function. As explained and automated by Peltier (2006), the same principle can be used for constructing dot plots, including additional possibilities offered by conditional formatting (e.g., different symbols corresponding to the value of another variable, or highlighting outliers). The other main option for constructing dot plots in Excel involves workarounds for changing the appearance of scatter-plots. Below, each plot type is presented in a separate subsection, whereby for instructional purposes, dot-density plots are implemented as cell-charts, while the scatter-plot based technique is used to construct univariate and multi-way dot plots. 3.1 Dot-density plots Figure 1 shows the extremely simple procedure for creating a dot-density plot as a character-based cell-chart. The worksheet also instructs the user how to copy to Statistically Sound Distribution Plots in Excel 87 clipboard a section of the worksheet exactly as it appears on screen, which is a basic and very useful, yet relatively little known Excel feature. f x 1 1 2 3 4 5 6 7 8 9 10 × ×× ××× ××××× × ×× ××× ××××××××× =REPT("x";B2) 2 3 5 white cell fill colour 0 1 select the chart, SHIFT+Edit + Copy Picture... 2 3 9 0 x 1 23456789 10 f 1 2 3 5 0 12 3 9 0 =REPT("l";B2) Wingdings Font Figure 1: Excel solution for dot-density plots with a character-based cell-chart (horizontally oriented above, vertically oriented below). label value see Note 1 see Note 2 Note 1: 1 … no. of data points one 1 1 1 2 1 3 1 4 1 Note 2: second series for displaying labels two 2 three 3 The chart is based on simple tricks: five 5 ! axes have neither lines nor tickmarks ! major gridlines are white so that they hide every second minor gridline five------------------------------------------------• three------------------------•---------------------- two-----------»--------------------------------- ! data labels of second series are positioned to the left of data points ( to assign them, use one of the excellent freeware add-ins: XY Chart Labeler by R. Bovey, or J-Walk Chart Tools by J. Wlakenbach) Figure 2: Excel solution for constructing a dot plot. 88 Gaj Vidmar Pro A - Audi Pro-------------------------------------•------------ Expert Hobby Novice--------------------------•--------------- 10% 15% 20%) 25%) 30%) B - Audi Expert Hobby--------------------------*---------------------- Novice--------------------------•------------------ C - Audi Pro Expert------------•--------------------------------------- Hobby---------------•------------------------------------ Novice D - Audi Expert-----------------------•---------------------------- Hobby---------------•------------------------------------ Novice 10%o 15%) 20%o 25%) 30%) A - Mercedes Pro----------------------------------------•-------- Expert Hobby Novice--------------------------•-------------------------- 10%) 15%) 20%) 25%) 30%) B - Mercedes Pro Expert Hobby--------------------------•-------------------------- Novice--------------------------•---------------------- C - Mercedes Pro Expert----------------•-------------------------------- Hobby-------------------•----------------------------- Novice —•------------------------------------------- D - Mercedes Pro-------------------•----------------------------- Expert--------------------------•-------------------------- Hobby--------------------------•-------------------------- Novice 10%) 15%) 20%) 25%) 30%) Data ADDITIVECAR DRIVER % Audi Audi Audi Audi Mercedes Mercedes Mercedes Mercedes Audi Audi Audi Audi Mercedes Mercedes Mercedes Mercedes Audi Audi Audi Audi Mercedes Mercedes Mercedes Mercedes Audi Audi Audi Audi Mercedes Mercedes Mercedes Mercedes Novice Hobby Expert Pro Novice Hobby Expert Pro Novice Hobby Expert Pro Novice Hobby Expert Pro Novice Hobby Expert Pro Novice Hobby Expert Pro Novice Hobby Expert Pro Novice Hobby Expert Pro For charting purposes 1 10A - Audi 2 10 3 10 4 10 A - Mercedes B - Audi B - Mercedes C - Audi C - Mercedes D - Audi D - Mercedes see previous worksheet for explanation Figure 3: Excel worksheet demonstrating multi-way dot plots with designed-experiment data. 3.2 Dot plots For dissemination purposes, the presented solutions for a single dot plot and multi-way dot plots are stored together in one workbook with two worksheets (named "Single" and "Multi-way", respectively). Construction of dot plot in Excel from a scatter-plot by hiding axes and major gridlines and introducing a series with data labels to replace the categorical axis is displayed in Figure 2. It requires the installation of an add-in for linking data Statistically Sound Distribution Plots in Excel 89 labels to cell contents, which is in any case a mandatory tool for serious data visualisation with Excel. In Figure 3, multi-way dot plots are illustrated with a realistically constructed dataset from the field of experimental design. The visible mechanics of the solution does not harm the presentation-quality display, which is obtained by setting the Print Area (indicated by the dashed line) to include just the plots and setting the Header (e.g., to "Complete block design experiment with single observation per cell") and the Footer (e.g., to "Effect of car, driver experience, and fuel additive on emission reduction") in the Page Setup dialog to the appropriate explanation. 4 Stem-and-leaf plot Although stem-and-leaf plot epitomises the best of the pre-computer era of statistical graphics (Tukey, 1977), it should neither be under-estimated nor excluded from introductory statistics curricula. In addition to serving teaching purposes, it can replace tables of raw data (not limited to whole numbers) in publications. Like dot-density plot, it is a perfect candidate for construction with proper formatting of spreadsheet cells. Count Stem Leaves Number of cases each digit represents: 1 5 0 |23399 1 |56 2 |9 3 | 4 |15 5 |035788 6 |048 7 |355 8 |013559 9 |6 10 |4 2 1 2 6 3 3 6 1 1 Figure 4: Sample output of the steam-and-leaf macro formatted for presentation. The basic algorithm is straight-forward enough for ad-hoc manual implementation, whereby the amount of help from computer's calculation would depend on the user's proficiency with the spreadsheet application. For routine use, especially with larger amounts of data and larger numbers, macro automation is required. It was programmed by Maxwell (2006), but some debugging was needed and workbook design had to be added to make it a self-explanatory instructional tool that can also be used for producing publication-quality plots. Instructions for the user to run the macro either via menu or using keyboard shortcut, and to specify which column contains the data, are placed in the first worksheet. Output is produced in a separate worksheet, which also provides instructions for 90 Gaj Vidmar formatting the plot (Figure 4 presents the result for the sample data) and copying it to the clipboard as a picture. 5 Box-cox transformation The implementation of the modified Box-Cox transformation towards normal distribution is shown in Figure 5. The modification (subtraction of l instead of 1; equation 5.1) of the originally proposed transformation (Box and Cox, 1964) is clearly displayed in the worksheet. The reference for the Excel implementation (Swanson, Tayman and Barr, 2000) is also provided, together with the correction of the formula for the maximum-likelihood estimation of the parameter l. Tx (xl-l)/ä ;l 10 ln(x) ;l = 0 (5.1) x Tx Tx-MTx ln x 1,305647 2,76 0,35 0 ,27 1,203121 2,68 0,46 0 ,18 1,079289 2,56 0,62 0 ,08 2,460413 3,51 0,03 0 ,90 4,881052 4,49 1,31 1 ,59 3,578878 4,03 0,45 1 ,28 4,384206 4,33 0,95 1 ,48 2,93886 3,75 0,16 1 ,08 0,685625 2,13 1,49 -0 ,38 0,36412 1,61 3,03 -1 ,01 0,769149 2,23 1,25 -0 ,26 3,05205 3,80 0,20 1 ,12 2,663747 3,62 0,07 0 ,98 0,867819 2,35 1,01 -0 ,14 2,027755 3,27 0,01 0 ,71 0,9949 2,48 0,76 -0 ,01 1,160278 2,64 0,51 0 ,15 2,890844 3,73 0,14 1 ,06 2,658033 3,61 0,07 0 ,98 0,524934 1,90 2,11 -0 ,64 0,931626 2,42 0,87 -0 ,07 2,330271 3,44 0,01 0 ,85 0,609102 2,02 1,76 -0 ,50 8,382877 5,41 4,26 2 ,13 1,398655 2,84 0,26 0 ,34 1,773059 3,11 0,06 0 ,57 3,71418 4,08 0,53 1 ,31 4,060207 4,21 0,74 1 ,40 4,18903 4,26 0,82 1 ,43 5,013684 4,54 1,40 1 ,61 1,320288 2,77 0,33 0 ,28 0,889462 2,37 0,96 -0 ,12 0,135212 0,96 5,70 -2 ,00 1,211183 2,68 0,45 0 ,19 1,829431 3,15 0,04 0 ,60 1,089243 2,57 0,61 0 ,09 2,40821 3,49 0,02 0 ,88 3,840979 4,13 0,60 1 ,35 4,861809 4,49 1,29 1 ,58 3,410325 3,96 0,37 1 ,23 5,82018 4,78 2,04 1 ,76 3,533266 4,01 0,43 1 ,26 1,812653 3,13 0,05 0 ,59 7,012216 5,09 3,04 1 ,95 1,711201 3,07 0,08 0 ,54 7,096034 5,12 3,11 1 ,96 0,950625 2,44 0,84 -0 ,05 1,056819 2,54 0,65 0 ,06 4,159933 4,25 0,80 1 ,43 3,82705 4,12 0,60 1 ,34 3,477401 3,98 0,40 1 ,25 N 51 MTx ssd slnx L 3,35 48,11 34,62 -23,199 Original distribution min 0,14 bin_width 1,03 UL f 0,65 4 1,68 16 2,71 10 3,74 8 4,77 6 5,81 3 6,84 1 7,87 2 max 1 Transformed distribution min 0,96 bin width 0,56 UL f 1,24 1 1,80 1 2,35 5 2,91 13 3,47 6 4,02 10 4,58 11 5,14 3 max 1 14 12 10 8 6 4 2 0 1,24 1,80 2,35 2,91 3,47 4,02 4,58 5,14 max Tx Paste your data in column A, auto-fill-down (or shrink) columns B-D, then find the ML estimate for lambda with Solver! (of course, you can also change lambda manually and observe the effect) Solver parameters: Target cell: $L$2 Set to: max By changing cells: $G$2 (optionally, use constraints, e.g. $G$2 >= -5 and $G$2 <= 5) Reference Swanson, D.A., J. Tayman, and C.F. Barr. 2000. "A Note on the Measurement of Accuracy for the Subnational Demographic Estimates." Demography 37:193-201. Note, though, that their formula ml(A,} - - {nil) (ln[(l/vj)2(fl - y)1] + (k. - 1} (SIti(jc())) has a mistake ! The parentheses are wrong -- the correct formula is ml(l) = (-n/2)(ln(Var(y))) + (l-1)(Sum(ln(x))) ! Figure 5: Worksheet for modified Box-Cox transformation. Statistically Sound Distribution Plots in Excel 91 Automated histogram binning (based on the FREQUENCY array function) and updating is particularly instructive if the user inputs various values of l manually and observes the effect on the transformed distribution. At the same time, the implementation can be used to introduce the user to maximum-likelihood estimation and the Solver add-in. Although the Solver is far from an ideal and universal optimisation tool, it is very useful for a wide array of applications in statistics, probability, operation research, econometrics and related fields, such as teaching generalised linear models (Graham, 2000) or performing robust regression (Barreto and Maharry, 2006). Figure 6: Worksheet for demonstrating windowgrams using rectangular and triangular kernel (15 rows of data cut out). The essential feature of the presented solution that makes it useful for data analysis practice is that the value of N is based on counting the cells with the x data, and that all the vector quantities involved in the calculations are addressed 92 Gaj Vidmar via named ranges that refer to the OFFSET function based on N. Hence, input data of practically any length can be entered, imported or pasted into the x column. 6 Windowgram Smoothing is one of the simplest examples of a graphical method that can substantially facilitate understanding of a given problem, or even enable insight that is far superior to what students can infer from parametric statistical models they are familiar with (Weldon, 2005). Windowgrams provide the most gentle and the least mathematically oriented introduction to kernel density estimation; the latter has long been an essential part of statistics, yet it is too seldom taught in introductory statistics courses (Weldon, 2004). The instructional spreadsheet, which features rectangular kernel (with adjustable window width) and triangular kernel (with 7-point window, as shown in the plot title) is presented in Figure 6. Both windowgrams are implemented only with conditional summing formulas. In-cell comments provide further guidance and explanation: pointing the cursor to the cell for entering the kernel width displays the message "?3 (larger value produces more smoothing)", while the comment of the cell with the heading f for rectangular kernel gives an excerpt from the Excel's help on some commonly used array formulas using the SUMIF, COUNTIF, SUM and IF functions. As shown in the worksheet, the data are 50 values sampled from a normal distribution with m = 50 and s = 15. As an exercise, the user can find the smallest d that produces a completely flat density estimate with rectangular kernel. 7 Coplot Coplots – short for conditional plots – are a another name for what is usually called panel plots, trellis display (Becker, Cleveland and Shyu, 1996) or lattice graphics. Like windowgrams, they are arguably under-used in introductory statistics courses (Weldon, 2004). Similarly, the interactive Excel solution (Figure 7) demonstrates that they can be implemented in a spreadsheet without programming and hence more likely understood by non-mathematicians. The comments in the headers of the columns a and b explain how the data is generated (7.1). Excel's Data Validation feature is used to guide break-point entry upon selecting the input cells: since c ranges from 1 to 50, the lower value is limited to the interval [1 , 46] and thus the valid range for the upper value is between the lower value + 1 and 48. To emphasise that non-linearity should regularly be considered when studying relations between a pair of quantities, quadratic trend is fitted to the points in each panel (using Excel's convenient Statistically Sound Distribution Plots in Excel 93 routine built into the scatter-plot). The scatter-plot of the total sample below the panels allows clear visual comparison with the conditional plots. a ~ U(0,1) b ~U(0,1) + 0 ; c < lower 2a2 ; lower