From the July 28, 2014 SoMa Tech Talk series.
Abstract: Evan Miller will be speaking about visually appealing ways to “supercharge” traditional descriptive visualizations with inferential statistics. Evan is the author of the popular Wizard statistics application for Mac.
Transcript
00:12 Yeah. So, I’m talking to you about a visual display statistical information, and I wanna talk a little bit about visualization more generally ’cause I think sort of what it’s doing to the data analysis market is very interesting. So, we have all these kind of visualization tools coming on the market that make it much easier to visualize data than it was before. Like back in the ’60s or ’70s, you kind of had to do everything by hand, kind of manually go through every row, and now you’ve got D3, and you got Tableau and you’ve got it so a lot more people can visualize data than could before. But, I think it’s really good in a lot of ways ’cause it’s good to visualize data, right? But, there’s a downside to that, which is so you have a lot of people who don’t really know a lot about data doing visualization, and that can have some negative or unforeseen consequences.
01:12 So the best example of this idea is summed up in this picture. So, this might be familiar to people taking a stats class. But the… You have two pictures of random dots, right? And I guess, supposedly, there was a study with this that asked people which picture they thought was randomly generated and which one had sort of some patterns built into it. So, which one where the points were all just completely independent of one another, uniform, zero one zero one distribution, if you’re in that sort of thing, and which one maybe had something else going on? And the answer here is that a lot of people think that the one on the right looks sort of more random than uniform ’cause everything’s kind of spread out, and the left one looks a little more clustered. You might say, “Oh, I see a little cluster down here, I see cluster over here. I found a pattern. Let me tell my boss.” And in reality, this is the one, the one on the left is the one that was really randomly generated, and some things just kind of clustered together just randomly. Whereas when this picture was generated on the right, they kind of added some more constraints saying, “Don’t put a point where there are already some points.”
02:31 And the reason I think this is important is that you really need statistics to know when a pattern that you see is a real pattern, and that to me is the real danger of visualization without statistics, is you can create a lot of pictures, and maybe you think you see some patterns on data. But, without having sort of formal statistical knowledge about it, you might be finding clusters where there aren’t any or kind of having insights that aren’t really insights. So, I think it’s really important to apply that, and so, for me, visualization in statistics has a couple of roles. One is to sort of trick people who are used to doing sort of regular data visualization into learning more about statistics, so providing like accessible pretty pictures hoping to entice more people into learning about stats and realizing the importance of applying statistical rigor and formal statistical tests to their data. But, there’s a second role that I think is also very important, just summarized by this. So, I don’t know if we have any Stata users in the audience.
03:45 So, this is a picture I took from the “Five ThirtyEight” blog. So, “Five ThirtyEight” is a bunch of statisticians and researchers writing about applied problems and trying to make those, their findings accessible to the general public. And right now, the best tool that they have for doing that is taking the screen grab of this program whose interface was designed in 1984. This is on the blog itself, which I think is really sad. And, I mean, sort of most obviously, it’s just a bunch of numbers without any context, and while experts are often used to seeing these that kind of have some intuition built in, I really don’t think this is the best way to digest this kind of information. So, I think visualization is the second role of the site’s tricking people into learn about statistics, which is helping professional statisticians be more productive and see their numbers in context and see, “Okay, my F stat is 7.13. What should it be? What’s a big value for that? What’s a small value for that?” And so, that’s the other role that I see.
04:56 So, I’m gonna talk a little bit about both these roles going forward here. Before I get into specific examples, I wanna talk to you about some books. These are a couple of classics in the data visualization field. You might recognize one or both of them. This one on the left is very influential. John W. Tukey, who’s actually a very good scientist and statistician, wrote this kind of odd book if you go read it. It’s very important. So, this is where the box plot was introduced. John Tukey invented the box plot and with like the median and the upper quartile and the lower. He did it in this book, but if you read the book, it’s like 400 pages. There’s a lot of other stuff besides box plots that he just kind of invented and gave strange names to and told people to use even though they don’t make any sense at all. He’s just taking square roots and logarithms for no discernible reason.
06:02 He has all these weird names for things He just invented this whole vocabulary, like the hinge and field and, I don’t know. I thought he was a crackpot after reading it. [laughter] This, like, kind of crazy old man because he came up with all sorts of visualization techniques, but there’s really no coherence to any of it. At the same time, he came up with the box plot, which is a pretty cool contribution. And, sort of on the other end of the spectrum, but I wanted to put an apposition to Tukey is this Grammar Graphics by Leland Wilkinson. Leland Wilkinson basically tried to bring some order to statistical graphics and particularly two-piece graphics by trying to unify them into this kind of coherent mathematical framework.
06:53 So, for example, and there’s all this kind of unnecessary formalism, in my opinion, in The Grammar of Graphics, but you might know it more commonly from the GG Plat 2 package in our, which is inspired by Leland Wilkinson. And so, just to give you an example of mathematical formalism, he… You know like proportion bars? They’re like a proportion bar. It’s like, “Oh, okay. If you’ve got this much red, that means like 60% of Republicans. We got this much blue, that’s like Democrats.” And he says, “Well, guess what? If you take a proportion bar and plot it in polar coordinates, you get a pie chart.” It’s like, “Okay, you can do that.” But, I don’t think it’s… He tries to kind of formalize it and create this coherent system for all possible data graphics, which I don’t really think is a good… I think it’s sort of a failed, sort of a doomed approach to these things. But, what I like about these two guys is they kind of represent these two almost world views for trying to understand things out there.
08:08 Do any of you know the phrase “the hedgehog and the fox”? Any hands? All right, we’ve got one… We’ve got two. So, I like to think of these guys as the hedgehog and the fox. That phrase sort of most famously comes from a book by Isaiah Berlin. It’s about Leo Tolstoy. But, it comes from, I guess, this Greek fountain inscription… It’s like a little fragment of poetry. It says, “The fox knows many things, but the hedgehog knows one big thing.” And I always really liked that quote because I think that there are really just different personality types, that kind of understand a lot of little things very well and have a lot of tricks in their bag, and then people who try to synthesize everything into one big coherent, impregnable framework. So, I really see Tukey as kind of being the fox whose got like a lot of little tricks that work really well, and Wilkinson having this unitary view.
09:15 One final criticism I want to offer on Wilkinson and explain to you why I’m talking about all this, is the name itself, “Grammar of Graphics.” So, that actually comes from a much older book, called “The Grammar of Ornament” by Owen Jones, and this was published in 1856, and it’s a really interesting book. So, Owen Jones went out. He was trying to… He actually is one of the first sort of color perception theorists, and the way he came up with his observations about color, like which colors work well together, is he went out and found every example of decorative tiling and decorative art that he could in every culture that he could. So, he’s got a chapter on, like, Byzantine decorative art and a chapter on like Roman decorative art, and just kind of lists all the examples that he found out there, and then from those examples, tried to synthesize some general principles.
10:14 And I really think that this is sort of the right synthesis of these two approaches, trying to both support all the successful things that are out there that might necessarily be generated by like the single grammar, but also kind of trying to provide some unifying guidance. So, I think the best spiritual successor of this book isn’t the Leland Wilkinson, but it’s more the Edward Tufte. So, if you’ve read his books, he really just goes out sort of the same way that Owen Jones does and tries to find successful data graphics and then synthesize a few principles out of those. And sometimes, they work, and sometimes, maybe a good example doesn’t accord with the principle that he’s derived.
11:05 So, anyway, this is a very long-winded way of saying that I don’t have a unified theory of statistical graphics, and I don’t think one exists, and I think that’s okay. The rest of this talk is basically gonna be a bag of tricks for statistical visualization things that work pretty well, and hopefully, some puristics to go along with them. So, let’s go along for the ride here. All right, so I’ve got four basic areas that we’ll talk about. The first one we’ll talk about is Bayesian statistics, then I’m going to talk about sort of it’s twin, Frequentist statistics and representing that visually, and then I’m going to talk about the data graphics you’d see in Tukey or Wilkinson and try to apply statistical knowledge to those to make them more interesting and informative, and then I’m going to talk about my ideal data graphic that I found on Google Image Search.
12:00 Alright, so Bayesian beliefs. A quick refresher. Bayesians and Frequentists don’t really get along because they see the world fundamentally different. Bayesians sort of say that everything you see is that’s all you know. That is truth that is given, and if you think they’re related in some way, that’s just the world in your head, and you just have a belief about that, and you gotta update your beliefs. The Frequentists, on the other hand, kind of assume there’s a true relationship between the data out there, and all this data you see is like randomly generated based on those parameters. Anyway, that’s like the two sentence difference. The point is, Bayesians have beliefs, and they need to model those, and they need to think about those, and they need to represent them visually, often. Unfortunately, I haven’t seen beliefs represented visually very well. I’ve seen some things on some blogs that just kinda made me mad, so I thought I’d take this opportunity to point them out.
13:04 So, this is a graph from a blog about a Bayesian belief of some kind, and this is fairly typical. It’s like your Bayesian belief will be a continuous probability distribution function, and you can plot it because you say, “Okay, this is a function. Let me just do the most obvious thing. Let me plot that function.” But, I’ve got a couple criticisms of this graphic that I just want people to be conscious of, when they’re trying to publicize or sort of publish results about Bayesian functions or probability density functions. One is that the y-axis here actually doesn’t help you at all. Knowing the amp of.1 = 10.5 is… There’s zero things you can do with that piece of information. So, I intend to fix it. You can just wipe out these numbers, and life is okay.
14:10 So, the way these density functions work is, the only thing that matters, is the area under the curve. The height of the curve just does not matter at all like what that value is. The reason for that being, the probability of you having any particular value like.1024568 is exactly zero. The only thing that you care about is integrating an interval and saying, “What’s the probability that it’s between.1 and.2, or between 0.0 and 0.1?” So, the emphasis on these graphics really should be the area under the curve.
14:49 So, I took a moment with Photoshop. This is my real ghetto PDF graphic. But, one possible way you might do this, and this isn’t an ultimate answer, but you can maybe kinda draw different areas that have the same, or draw boxes that have the same area just to communicate, say, rotated, there’s a 25% chance it’s in this range, and a 25% there, and a 25% there, etc. And you can do that with unadorned lines, or with alternating blue and white areas, whatever. But, yeah, if you’re applying these functions, you should be thinking about the area under the curve, not the curve itself.
15:34 So, here’s another graphic from a blog, and this is trying to put actually two PDFs on the same plot. Okay. So, on the left here, you just have the regular PDFs and kinda put on the same axes. Over here on the right, it’s a little more sophisticated. So, here’s each parameter space. So, it’s kinda like taking the blue one and kinda rotating it out here. So, we’ve got a big square and then plotting all the values for each combination of this version A and version B. For example, this point right here means that version B is equal to 0.4, and version A equals, yeah, 0.2, and the value there is dark blue. And I liked this graphic when I first saw it because it’s a good way to just to go and say, “Okay, you’ve got this joint probability distribution function, and you should plot it always, right?” But, I started thinking a little bit more about this graphic, and particularly this sort of Eye of Sauron sitting in the bottom left, and as I thought about it, I actually liked it less and less.
17:00 So, if you look at it, the actual values being plotted are over here. So, it’s this nice, smooth function that kind of goes up and goes back down. But when you’re looking at plotting it in this color dimension, which is one of Wilkinson’s dimensions you can plot things in, this doesn’t look smooth at all, right? It looks kinda like eight rings. A light blue, cyan, green, yellow, etc. There’s no real way to see the same continuity there. And that’s purely a function of our own color perception, and just how the cones in our eyes work. So, I think the graphic suffers from some problems. The other is that, I know the value equals red, or 2 or 25 or something, but again, there’s nothing I can do with that piece of information. It’s the same as the y-axis on the density curve.
18:01 The only thing that matters here is the volume under the curve rather than the value itself. So, in the same way, we saw in one dimension, the only thing that matters is the area. Here, the only thing that matters is the interval volume, and there’s no way to see that in this graphic. Again, I decided to rectify the situation by taking to Photoshop, and I think something like this would work a bit better after trying to represent that belief in a way that you actually know what you’re looking at. So, here, what I have done is created multiple levels and said that the volume would contain probability in each of these spaces is equal to 99% or 95% or 90%. This is a kind of a graph where you can do a little bit more and have more of a take home message. “Okay there’s a 99% chance it’s in here.” Which you couldn’t before, and what’s neat about math is you can notice some things. So, like since this is 95%, and that’s 99%, you know the stuff in this reading is equal to 4%. So, there are other representations you can do to show that these are kind of volumes that can be added together just with simple arithmetic, and you can get some take home information.
19:31 Anyway, that’s all I have to say about Bayesian statistics and beliefs. But, I think people can do a lot better, and this is actually a fairly difficult graph. Well, not in Photoshop, it took me like two minutes. But, if you want to do this right in your plotting program, it requires a fair amount of sophistication, and I had this numerical problem of having the smallest area that contains some percent and drawing an isometric line or whatever. But, it’s something that… It’s not a function that you’ll see in a typical plotting package, but that doesn’t mean you shouldn’t try to do it. So, all these… These plotting packages that are out there aren’t really optimized for presenting statistical concepts and I think… It really pays off to try to go the extra mile or if you implement one of this plotting packages, it would be great if you could support functions like this too.
20:32 Alright, enough about Bayesianism. Hypothesis testing. So, this is actually where most of my work is, so I spend a lot of time thinking about how to visually represent a hypothesis test. So, now we are going away from the Bayesian half the world, entering the Frequentist half the world and trying to present ideas from Frequentist statistics in a way that lay-people can understand, and also in a way that’s useful to people who are used to just seeing these giant tables of numbers. So, typically, if you are in a statistics class, like learning about statistics and hypothesis testing, you get a chart that looks like this. It’ll say something like “Under the null hypothesis, U equal zero.” Some percent of time, you’ll see a test statistics in this range, and some of the times see it in that range, and if it’s far enough then it’s below your significance level.
21:36 And you say it’s significant, and you’ve got P value and projection area, all sort of stuff. You have all these concepts nicely embedded in this graphic. When you’re reading the book, you’re like “Alright, alright. I got, I kinda understand the concept here.” And then you get to the problem set, and you’re computing this test statistic by hand, and you’re going to the back of the book, and you’re looking at this… [laughter] To try decide whether something is significant or not, which is, this table should never be printed ever again. [laughter] We have computers that can produce the value that you, I don’t know.
22:20 My very basic expedition like the absolute least that you can do if you’re presenting the results of a hypothesis test, is combine the values from the test with the textbook illustration. Okay? So, here’s what I do on my wizard program. I just added this early this year. But, just take the textbook graphic and adapt it to the values that apply to the person’s data. Alright? So, here’s the textbook graphic of all the major concepts. You’ve got your test statistic here at some value. Gee, that’s less then this critical value beyond which we have 5% of the overall probability over the null hypothesis. The area beyond the test statistic, that’s equal to P value. So, like 0.0977. That’s equal to all this dark blue, and that’s just what an F distribution looks like. [chuckle] And you don’t even know what the F distribution looks like. This is the table, but it’s very difficult to get any kind of intuition here, especially with something that has two parameters like 7 and 15, 26 degrees and I am like, I have no idea whether 1.731 is a big value or a small value just by looking at that, but when I have the overall picture, and I got the critical value. Okay. That’s getting close to the critical value or that’s not. So, this is the absolute least amount I think people should be doing any time they work with a P value because P values are always associated with some probability distribution.
24:01 It’s always an area on the curve, basically. So, if it’s an area on the curve… There’s the curve, and you should just draw it. And it’s not that hard, so please do that. Alright. So, here’s the deal on visualization, this is a little bit more involved. So, this is taken from my website. I got a little two sample T tester there. But, I wanted to talk about the graphics in here a little bit, ’cause they’re really sneaky. It’s hard to appreciate exactly why there sneaky, but I’ll try to explain what exactly is going on here. So, just ignore these two for a minute, and just look at the top one. This is representing a confidence interval around the estimated mean, or like some sample of these eight numbers. And normally when your thinking about confidence intervals, it’s the sort of a set of numbers… The set of hypotheses that can’t be rejected. It’s given a confidence level, it’s called. It’s like one minus this, you get this level, and when you think about the individual hypothesis… When you’re testing a hypothesis that’s equal to 40, normally, you plant the null distribution here, at sort of the center of there, and then look at the actual value, which would be over here. So, like… Technically, the dash line should be here, but we’re just looking at the single hypothesis that’s equal to 40, and this should be centered here and equal to the area beyond the curve over here or over there.
25:39 But, mathematically, since this thing is symmetric, you can kinda cheat. And make this… This visualization where you just put the whole distribution in the middle, like right at the test statistic, which is never what you would do with… You don’t… The whole point of null hypothesis, you know the set is equal to the test statistic. It’s a sort of an external hypothesis, but if you do this, then visually and kinda geometrically, it does what you want. So, at a given confidence level of 95%, 95% of the area under this curve is between these two dashed lines. And so, you can kinda visually predict what’s going happen to the confidence interval when you change the level. So, if we, like ratcheted it up to 99%, you could see that these would go out. That’s just this geometric trick that just sort of works, and I think this is sort of a good visualization for connecting the confidence level to the interval that people see, even though it’s this cheaty thing.
26:49 Yeah. The other thing I want to say about this graphic is or I want to talk about this thing at the bottom a little bit, which is, actually, try to test the hypothesis that the means in these two groups are different, and the first thing one should notice, one of the first sort of lessons you get in this class is that just because the confidence levels overlap doesn’t mean that you won’t reject the hypothesis. God, I’m using so many negatives in that sentence. Already got lost.
[chuckle]
27:24 The point is, you might think looking at the overall data set, the two groups could be different, even though their confidence intervals overlap, if that makes sense. And you can kinda see this in this graphic, ’cause you’ve got some overlap here, but the test says these things are different. Okay. So, this test is… You can’t just derive from these two tests. This takes into account a little bit more information. But, it’s trying to achieve sort of the same goals before of showing you, what’s the P value in relation to the test statistic? And here you can see, “Alright, here’s my test statistic, and this is like the width of this curve is related to this standard error, and the area beyond zero is the ‘P’ value.”
28:17 But again, this is kinda cheating, because we’ve got this dark area over here, which doesn’t make a lot of sense. The way this actually should be represented is you put this curve right at zero, and you’d say, “Oh my test statistic was 24,” and then it’s like dark orange beyond that, and dark orange back at -24 too. But, the benefit of shifting this and cheating is that… You have… You trick people into being Bayesians basically. What you can do is line these up, so you have multiple estimates. As you’re comparing, you can put them on the same axes, and trick people into thinking in a Bayesian way. And what I mean by that is if you put some of these together, people think, “Oh. This is the value, but it’s really like this random distribution over this long thing.” And, it’s useful in particular just for like instantly being able to compare say multiple coefficients, and judge the relative P values for their statistical significance. And this graphic, in particular, I like, because these heights are different for a reason. So the fundamental property of all PDF’s is they generate only one, right? So, you get a wider one, it’s gotta be kinda shorter, so that the integral is still one, and with C to C you have to order these sort of like a family portrait, like the tallest, the shortest.
30:01 And the P values are kind of directly geometrically comparable. So, these exist in the same space, and you can see that the dark area is, in this one, it’s larger than the dark area of this one, and you know that that P value is larger. So, that’s a nice visual representation of having two of these multiple coefficients in the same space. Alright. Anyway, it’s a good way to represent these kind Delta D or Beta coefficients from progression models.
30:36 Alright. Enough about hypothesis testing and P values and all that stuff. This is a sort of the most fun article, I think, which is trying to improve your traditional charts with some statistical knowledge. And I tried out a bunch of different techniques just over the sort of months and years trying to represent this stuff and show people what a confidence interval is, and I came up with a single, unifying dictum that works most the time. And that is: “When in doubt cut it out.” Alright. So, you’ve got a bar chart, and you want to show what’s the… How can we show the confidence interval around say 12.4%? Maybe it’s 7%, maybe it’s 14%, just like kinda gray it out. Add an extra, little box there, and you’ve got a good visual representation. So, it doesn’t have a whole curve the same way the previous graphic did, but this is a nice sort of very quick way to show what the statistical uncertainty is.
31:45 So, similarly… Okay [chuckle] This is just me showing that I’ve gotten quite a mile… Bit of mileage of these two different technique. These are different things from my own website. I like this technique a lot, just for representing a little uncertainty, add a little extra gray box at the end to show that it’s actually between 8.5 and 22%. It’s not 16 exactly or whatever. It works for Instagrams too. So, you got Instagram, and you want to show the uncertainty, you just kind of cut out the confidence intervals for each of these bits, and you have a nice visual representation. You can start to see, “Okay, these things actually will be helpful.” They look different at first, but actually these confidence intervals overlap. Maybe we should look look a little closer. Keeping in mind that’s not the same thing as the [32:35] ____ test.
32:36 Similarly, we’ve got a pie chart. It’s like cut out confidence intervals. It’s a really simple thing to do that shows you how much uncertainty there is in the data set, which is important for things like this when you’re comparing multiple groups. You think, “Oh man, yeah. I see a pattern here in my program that visualizes data.” There are many more men in this group than that group. But, if you cut out confidence intervals, you might have second doubts. So, this is a lower bow in a confidence intervals, which basically means like the actual number could be anywhere in in the white zone. This graphic’s a little weird the way I’ve done it because you’ve got white areas on the sides. So, sorta technically, this piece should be moved over to the left, and that should be moved to the right, but I thought this looked kind of cooler, so I did it this way. [laughter] Yeah, I don’t believe in [33:47] ____.
33:48 All right, but the point of this is that you can show people visually that [a] maybe that that patterns not there, and you can just use these white white areas to represent that nicely. Some of the white. This is one, I haven’t actually done it I just thought of this the other day, but I think it would work. Box plots. You can have confidence intervals run medians and percentiles, like personal walls or like, I think, a couple were, well, whatever. You can come up with them. When in doubt, cut it out. It’s like that was the median. Now you say “Hey, the median might be like anywhere in there.” And it’s… What I really like about this… So, you can do it for your other statistics too, is that it gives people something familiar to start with and then kind of ease them into thinking about confidence intervals and ranges and things. But, so, I think this is a good graphic even though its just… It’s actually implemented as white boxes pasted into my key note document here. [laughter] But, I think I think that would work pretty well.
35:03 So, here’s the the last one in this section, which is the the CDF, the Cumulative Distribution Function, and you have the observe CDF in your data, but you can construct a confidence interval around it about what sort of underlying distribution that was actually brought. Using like the Kolmogorov-Smirnov test, which I’ll endorse more heartily in a minute. So, for that, using the fantastic power of the Photoshop, we can show this range, and say “Hey the CDF might be anywhere in there. The lines might not match up.” [laughter] So, I think that it could work pretty well. So, holy grail of statistical graphics, something I spend time thinking about. So, the main… Well, one of the main problems with what I’ve shown you so far… And one of the problems I kind of face in trying to design these graphics, is summed up here, which is that we kind of show people confidence intervals in two groups.
36:13 But that’s not the same as doing a formal test of difference between the two groups. It’s the same overlapping confidence interval problem I was talking about earlier. So, right now, I basically have two different graphics that I have to show people. I gotta show one graphic which has each individual group, and then I have to formal test with the test statistic, and the distribution, and the whole hypothesis and all that, which is okay. But, in my mind, it would be a lot better if I could kinda tie this back into here, somehow, have one graphic where you gotta see everything here at once, and I don’t think there’s always a good solution here. It’s just for practical reasons.
37:02 So, for example, the chi-squared test, it’s basically a sum of squares of stuff. So, it’s like this difference between this and this squared, plus the difference between this and this squared, to a first approximation. If you try to show that graphically or geometrically, it gets really nasty ’cause you’re like… You gotta difference, and you can show that graphically okay. But, squaring it, you end up with giant squares on your page, and then you have the sum of squares, so then you’re trying to like add up the sum of these giant squares everywhere. That’s really ugly, and it’s hard to represent. Incidentally, that’s one of the reasons it’s hard to represent linear regression graphically, and come up with nice pictures for people ’cause linear regression is a least squares problem, so it’s minimizing the sum of the squares of the errors. It looks like the distance between the line and the point, and then you’ve gotta a big square out of that, and then you’re trying to add up all those squares, and it’s just like squares everywhere. It’s not informative. That’s just purely, ’cause that’s where the test comes from.
38:16 So, this is the graphic I showed you earlier. I think it’s a little more successful in connecting the two groups to the test of the difference between them in the sense that you put… So, it’s sort of putting them on the same axes, we can put this here, and it lines up with the zero. And this here, it kinda lines up with the 24. So, this is a test of saying is one bigger than two? So, you visually see the subtraction between one and two, in this graphic. But it’s not… It’s still not totally successful, because it’s… One problem is with the degrees of freedom… Like this shape is not clear like how wide it should be based on these two graphics. That doesn’t sort of fall out naturally. So, it’s not a totally coherent picture, but those two things combined.
39:16 Alright. Finally, we come to the image I found on Google Search, or Google Images. This is a graphic that pulls it off. It’s in Dutch, or something. [laughter] But I’ll try to explain what’s going on here. I really think this is like the ultimate in statistical graphics. This is a visual representation of testing the difference between an observed distribution and a theoretical distribution. This is a CDF a cumulative distribution function. So, the black line is the observed CDF. So, at least 20% of the observations are 1.6 or less. This blue line is the theoretical one that we’re testing against. Is this the normal distribution or not? The way the… It’s nice, ’cause visually, first, you can give an approximation. Go down and say, “Okay. This looks like they’re kind of the same.” But, you actually do a formal test here in using this Kolmogorov-Smirnov test. And the test statistic, instead of being something squared, which is just gonna totally mess you up, it’s actually just the maximum difference between these two lines. So, wherever that, this sort of red thing, you sort of drew one everywhere, got the biggest one, which happens to be here in this picture. That’s your test statistic. Graphically, you can test that by drawing these green lines, so this is sort of the boundary, and if D goes outside the boundary, you reject the hypothesis.
41:08 I think that’s just great, ’cause you can kind of instantly see this is the hypothesis succeed or fail, just by looking at these lines and seeing whether the blue one crosses over the green one at any point. What’s nice about this too is that it works for comparing multiple groups, in addition to just this empirical distribution. Theoretically, you compare like two observed distribution and say, “Are they drawn from the same thing?” It’s a really nice… I think this is a really great graphic for kind of getting all that information back into one image that is very low resolution and in Dutch. [laughter] Yeah. So.
42:01 You guys should like… If you’re into stats at all, you should know more about Kolmogorov-Smirnov test because I think it’s a great test, and you can represent it graphically. It’s basically for… To seeing if two groups come from the same distributions and most people are always testing the means between two groups. This can test everywhere. So, if maybe one’s more skewed than the other, you know they have the same means. And so, there’s… Anyway, it’s just like a 70-page book, very beautiful book. Check it out at your local library. And that’s my last slide. It’s… Yeah, so I got more… Like most of these graphics are either taken from Photoshop on my computer or from my online tools I have and the desktop program where a couple of those came from. Thank you.
The post Evan Miller – Understanding Statistics with Visualization appeared first on Thumbtack Engineering.