Probability-generating functions
205 points by todsacerdoti 7 days ago | 48 comments
  • shiandow 7 days ago |
    Probably worth mentioning the moment-generating function as well, since it's a bit more elementary than the characteristic function and shares many of the properties. It's also more simply related to the probability generating function, you can go from one to the other with a basic change of coordinates (t -> log(x)). I also estimate calculating the moment generating function to be easier in most cases.

    In fact most properties of the PGF come from the monent-generating/characteristic function. Including why the second derivative is related to the variance. The second derivative of the moment generating function is the second moment E[X^2]. The second derivative of the logarithm of the MGF is the variance by definition.

    The one property that's somewhat unique to the PGF is how composition relates to drawing a randomly-sized sample, which I can see could be useful.

    • adiM 7 days ago |
      One can also think of probability generating functions as (flipped) Z transforms, moment generating functions as (flipped Laplace transforms), and characteristic functions as Fourier transforms of the respective PMF/PDF. Lot of their properties then follow from simple properties of Signals and Systems.
      • glial 7 days ago |
        Do you have a reference that explains this in more detail? I'd be curious to know.
        • adiM 7 days ago |
          Don't have a reference on the top of my head, but the main idea is as follows:

          The definition of MGF of a random variable with PDF f(x) is

          E[e^{sX}] = int_{-inf}^{inf} f(x) e^{sx} dx

          The definition of Laplace Transform of a signal f(t) is

          F(s) = int _{-inf}^{inf} f(t) e^{-st} dt

          Hence MGF is 'flipped' Laplace transform

          Now for we know that the MGF of sum independent RVs is the product of their MGFs. So if we take the inverse Laplace transform, the density of the sum is convolution of the individual densities.

          Similarly, if we take derivative in frequency domain, that is same as multiplying in time domain: So M'_X(s) is the 'flipped Laplace transform' of x f(x) and its value at s=0 is the 'DC-gain' of the signal.

          And so on... the properties are all immediate consequence of the definition of MGF and since the definition is essentially the same as that of a Laplace transform , there is an equivalent property in signals and systems as well.

  • sampo 7 days ago |
    Herbert S. Wilf (1990): Generatingfunctionology. https://www2.math.upenn.edu/~wilf/gfology2.pdf
  • JPC21 7 days ago |
    Generating functions are also a great tool in combinatorics (see for example the book Analytic Combinatorics by Flajolet and Sedgewick).
  • v9v 7 days ago |
  • jampekka 7 days ago |
    Lost the plot at "... and for sensible v, this is equivalent to ..." :(
    • Fellshard 7 days ago |
      For that, I would look at early calculus/pre-calc, I think, examining infinite series and their properties and equivalencies.

      There's certain forms like that that have well known values that they converge to as you continue adding terms into infinity. Sometimes that convergence is only possible if your domain is limited, eg. [0,1].

      • jampekka 7 days ago |
        That clears it, thanks! I didn't figure out that "sensible" referred to convergent G(t).
        • foldU 7 days ago |
          You also don't really have to worry about convergence, since these are formal power series.
        • madcaptenor 7 days ago |
          It doesn't mean that in general; I think the author is saying "for sensible v" somewhat informally to mean "for v for which this makes sense".
    • coliveira 7 days ago |
      This is just the sum of a geometric sequence.
  • pedroth 7 days ago |
    "I have long struggled with understanding what probability-generating functions are and how to intuit them. There were two pieces of the puzzle missing for me, and we’ll go through both in this article."

    Great article. For more, I really recommend Analytic Combinatorics:

    https://ac.cs.princeton.edu/home/

    • hintymad 7 days ago |
      Second this. This class is a classical example of conceptual blockbuster. Once one learns it, the complexity analysis of algorithms will never be the same again. In general, if a techie wants to spend their spare time learning new stuff, they will be better off focusing more on such conceptual stuff, as the return will compound over the years.
  • KvanteKat 7 days ago |
    For those interested in looking slightly more into the characteristic function, it may be worth pointing out that the characteristic function is equal to the Fourier-transform (with the sign of the argument being reversed) of the probability distribution in question.

    In my own experience teaching teaching probability theory to physicists and engineers, establishing this connection is often a good way of helping people build intuition for why characteristic functions are so useful, why they crop up everywhere in probability theory, and why we can extract so much useful information about a distribution by looking at the characteristic function (since this group of students tends to already be rather familiar with Fourier-transforms).

    • bc569a80a344f9c 7 days ago |
      I had not made that connection and find that incredibly useful. Thank you for pointing that out.
    • ysofunny 7 days ago |
      but isn't a characteristic function just "the" way to bridge the gap between sets, functions, and logic(? ...a 3way bridge!?)

      I mean, it was useful for me to think about like a translation between sets and logic (this variable x is in the set xor not) into functions (a function f(x) that returns 1 or true whenever x is in set S)

      how the heck is that a fourier transform!??

      • steppi 7 days ago |
      • jamessb 7 days ago |
        You're thinking of a "characteristic function" in the sense of "indicator function" of a subset (https://en.wikipedia.org/wiki/Indicator_function), which is different thing to the characteristic function of a probability density function.
      • KvanteKat 7 days ago |
        You can think of it like this:

        - The characteristic function of a random variable X is defined as the function that maps t --> ExpectedValue[ exp( i * t * X ) ]

        - Computing this expected value is the same as regarding t as a constant and integrating the function x --> exp( i * t * x) with respect to the distribution of X, i.e. if X has the density f, we compute the integral of f(x) * exp( i * t * x) with respect to x over the domain of f.

        - on the other hand: computing the Fourier transform of f (here representing the density of X) and evaluating it at point t (i.e. computing (F(f))(t) if F represents the Fourier transform) is the same as fixing t and computing the integral of f(x) * exp( -i * t * x) with respect to x.

        - Rearranging the integrand in the previous expression to f(x) * exp( i * -t * x), we see that it is the same as the integrand used in the characteristic function, only with a -t instead of a t.

        Hope that helps :)

      • beagle3 7 days ago |
        “Characterstic function” is (was) an overloaded term.

        What you described is more often referred to as an “indicator function” these days, with “characteristic functions” denoting the transform (Fourier, laplace, z - depending on context). Closely related to “moment generating functions” to the point of being almost interchangeable.

        • ysofunny 7 days ago |
          so the same thing but, characterisic function as I knew them before these posts is a rudimentary 2-variable finite version. point and line (but the line is a curve, a circle because e).

          but the new and improved 21st century characteristic functions are n-variable and have a full continious spectrum of variables between zero (false) and one (true) but only potentially lest infinite realizes itself (which would make the theories illogical).

          this way of thinking about this makes sense to me, even if it's ever so slighly wrong by some nitpickable point https://en.wikipedia.org/wiki/Moment-generating_function

    • jamessb 7 days ago |
      Yes, this provides good intuition about why it is useful: the PDF of the sum of two random variables is the convolution of the original PDFs. A convolution is awkward to work with, but by the convolution theorem it is a multiplication in the Fourier domain. This immediately suggests that the Fourier transform of a PDF would be a useful thing to work with.

      If you don't say that this is what you are doing then it all seems quite mysterious.

      • creata 7 days ago |
        > the PDF of the sum of two random variables is the convolution of the original PDFs

        (Probably obvious to everyone reading, but the variables should be independent.)

        • schmidtleonard 7 days ago |
          But I'd rather assume the variables are independent and then blame statistics when I get the wrong answer!
          • bokenator 5 days ago |
            This is a good place to use cumulants. Instead of working with joint characteristic functions, which gets messy, it lets you isolate the effects of correlation into a separate term. The only limitation is that this doesn't work if the moment doesn't exist.
    • fermisea 7 days ago |
      As a physicist, the moment when everything just clicked was when I realised that connected Feynman diagrams were basically the cumulants of that distribution. Then almost everything in physics is about "what is the characteristic/moment/cumulant generating function?" and associated Legendre transforms
      • lr1970 6 days ago |
        > As a physicist, the moment when everything just clicked was when I realised that connected Feynman diagrams were basically the cumulants of that distribution.

        And the generating function of the cumulants is the logarithm of the generating function of the distribution (Fourier transform).

      • GemesAS 6 days ago |
        A little known bit of history is Feynman developed a diagrammatic method for expressing the moments of PGFs in his study of the stochastic theory of fission chains. This was before his work on QED. See:

        https://www.osti.gov/biblio/1775045

      • credit_guy 6 days ago |
        Wow. I am not a physicist, but I use pdfs and moments and cumulants all the time. I came up with my own method to calculate cumulants for affine processes using some recursions, and they work. But if I hear you right, I might have stumbled upon something that Feynman did 70 years ago, and he probably did it better. Any good links you can recommend?
    • nycticorax 7 days ago |
      I feel like it's almost criminal of textbook writers not to mention this when introducing the characteristic function... At least as an aside or a footnote, for readers already familiar with Fourier transforms.
  • vector_spaces 7 days ago |
    I think the 12th order G(t) example is missing another term with coefficient 1/5 -- since these coefficients must sum to 1
    • stocknoob 7 days ago |
      Unless it’s already been fixed, the queen’s term (t^12) has 2/5 in the article.
  • gorgoiler 7 days ago |
    ”If we want to encode the vector [6,2,8,4] in a single expression we can create a function containing those numbers: f(x) = 6 + 2x² + 8x³ + 4x⁴

    …or if you flip the vector and use x=10:

      6284
    • AgentMatt 7 days ago |
      Indeed. However, note that this is limited to encoding values between 0 and 9.
    • Sharlin 7 days ago |
      Yes, but the polynomial form generalizes to coefficients of an arbitrary field, not just naturals. If your vector were, say, [1.3, 2.197656, pi, -1/2, 3*2i] then there wouldn’t be a reasonable base you could pick for a place-value representation.
  • esafak 7 days ago |
    I rather like the derivation of the CLT using MGFs.

    https://courses.cs.washington.edu/courses/cse312/20su/files/...

  • Sharlin 7 days ago |
    Probably worth noting that as we know know, polynomials (over a field) are a vector space, not just convertible to one. The set of formal variables { x^0, x^1, x^2, … } is an orthogonal basis.
    • sampo 6 days ago |
      A basis yes, but not orthogonal.
      • Sharlin 6 days ago |
        How so? It’s impossible to express x^n as any linear combination of x^k, k != n.
        • sampo 6 days ago |
          • Sharlin 6 days ago |
            Okay, I think I get it… I thought one would/could define orthogonality of polynomials in terms of an inner product equivalent to the familiar K^n dot product, ie. sum of pairwise products of equal-degree coefficients, but I guess polynomials just don’t work that way. I can’t say I understand the hows and whys of the "correct" inner product given by Wikipedia :/
            • sampo 6 days ago |
              > I thought one would/could define orthogonality of polynomials in terms of an inner product equivalent to the familiar K^n dot product, ie. sum of pairwise products of equal-degree coefficients

              Well, you can. But then you essentially get the real coordinate space, and them being polynomials and not just real-valued vectors is just an extra story that you slapped on top, but that has no relevance.

              Polynomial vector spaces become useful, when we treat polynomials as functions, and the vector space of polynomials as a subspace of some other space of functions. And with function spaces the inner product is defined as an integral of the product of two functions (possibly with a weight function thrown in).

              Or maybe the coefficient-based inner product spaces of polynomials have some uses, too. But they are less common than the function-based (integral) inner product.

              • Sharlin 4 days ago |
                Yeah, that's what I figured. Thanks for your comments, I learned something new!
  • jldugger 7 days ago |
    I've always wondered why the hell generating functions existed, and I think this line sums it up:

    > When de Moivre invented much of modern probability in the mid-1700s, he didn’t have vectors! Vectors are an 1800s invention.

    Doesn't explain why we still teach them 300 years later though. Thats what the second half of the article covers.

  • derefr 7 days ago |
    Is there a relationship between algebraic polynomial encoding of sequences, and https://en.wikipedia.org/wiki/G%C3%B6del_numbering_for_seque... ?

    Does an encoding of a sequence in a given Gödel numbering, also somehow "retrievably" encode the probability space of the sequence's terms?

  • mturmon 7 days ago |
    In a footnote OP says:

    > I’m not yet good enough to intuitively get why the curvature of the probability-generating would be related to variance, but I’d be happy to receive pointers here.

    Here’s my intuition for this.

    The characteristic function is the Fourier transform of the density.

    If the density is in “t” units, the ch.f. is in f= 1/t units. It is the “inverse domain.” (I’m using “f” to suggest frequency, ie the Fourier coordinate.)

    Of course it is not a simple coordinate transformation! But some intuition does carry over.

    This is reflected in all sorts of ways. It’s one reason why the IFT formula is so functionally close to the FT formula.

    Anyway.

    Because of this, the behavior of the FT (ch.f.) very close to the origin (“f=0”) tells about the tails of the distribution (t = 1/f is large).

    In particular, high curvature around the origin tells you the tails are heavy. That’s the variance.

    This extends to the fourth moment. You can get even sharper curvature around the origin of the FT (ch.f. at f=0) with a large coefficient on the fourth order term. This corresponds to a large fourth moment of the pdf, or a high kurtosis.

    It’s useful to recall that, because of analytic continuation, knowing all the derivatives at the one point f=0 determines the ch.f. everywhere, and thereby determines the complete density. This corresponds to the fact that knowing all the moments determines the full density.

    So in a very real sense, you only need the ch.f. in a tight neighborhood of the origin!

    (Provided all moments are finite.)