{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n.. redirect-from:: /gallery/statistics/histogram_features\n\n# Histogram bins, density, and weight\n\nThe `.Axes.hist` method can flexibly create histograms in a few different ways,\nwhich is flexible and helpful, but can also lead to confusion.  In particular,\nyou can:\n\n- bin the data as you want, either with an automatically chosen number of\n  bins, or with fixed bin edges,\n- normalize the histogram so that its integral is one,\n- and assign weights to the data points, so that each data point affects the\n  count in its bin differently.\n\nThe Matplotlib ``hist`` method calls `numpy.histogram` and plots the results,\ntherefore users should consult the numpy documentation for a definitive guide.\n\nHistograms are created by defining bin edges, and taking a dataset of values\nand sorting them into the bins, and counting or summing how much data is in\neach bin.  In this simple example, 9 numbers between 1 and 4 are sorted into 3\nbins:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import matplotlib.pyplot as plt\nimport numpy as np\n\nrng = np.random.default_rng(19680801)\n\nxdata = np.array([1.2, 2.3, 3.3, 3.1, 1.7, 3.4, 2.1, 1.25, 1.3])\nxbins = np.array([1, 2, 3, 4])\n\n# changing the style of the histogram bars just to make it\n# very clear where the boundaries of the bins are:\nstyle = {'facecolor': 'none', 'edgecolor': 'C0', 'linewidth': 3}\n\nfig, ax = plt.subplots()\nax.hist(xdata, bins=xbins, **style)\n\n# plot the xdata locations on the x axis:\nax.plot(xdata, 0*xdata, 'd')\nax.set_ylabel('Number per bin')\nax.set_xlabel('x bins (dx=1.0)')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Modifying bins\n\nChanging the bin size changes the shape of this sparse histogram, so its a\ngood idea to choose bins with some care with respect to your data.  Here we\nmake the bins half as wide.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "xbins = np.arange(1, 4.5, 0.5)\n\nfig, ax = plt.subplots()\nax.hist(xdata, bins=xbins, **style)\nax.plot(xdata, 0*xdata, 'd')\nax.set_ylabel('Number per bin')\nax.set_xlabel('x bins (dx=0.5)')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can also let numpy (via Matplotlib) choose the bins automatically, or\nspecify a number of bins to choose automatically:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "fig, ax = plt.subplot_mosaic([['auto', 'n4']],\n                             sharex=True, sharey=True, layout='constrained')\n\nax['auto'].hist(xdata, **style)\nax['auto'].plot(xdata, 0*xdata, 'd')\nax['auto'].set_ylabel('Number per bin')\nax['auto'].set_xlabel('x bins (auto)')\n\nax['n4'].hist(xdata, bins=4, **style)\nax['n4'].plot(xdata, 0*xdata, 'd')\nax['n4'].set_xlabel('x bins (\"bins=4\")')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Normalizing histograms: density and weight\n\nCounts-per-bin is the default length of each bar in the histogram.  However,\nwe can also normalize the bar lengths as a probability density function using\nthe ``density`` parameter:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "fig, ax = plt.subplots()\nax.hist(xdata, bins=xbins, density=True, **style)\nax.set_ylabel('Probability density [$V^{-1}$])')\nax.set_xlabel('x bins (dx=0.5 $V$)')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "This normalization can be a little hard to interpret when just exploring the\ndata. The value attached to each bar is divided by the total number of data\npoints *and* the width of the bin, and thus the values _integrate_ to one\nwhen integrating across the full range of data.\ne.g. ::\n\n    density = counts / (sum(counts) * np.diff(bins))\n    np.sum(density * np.diff(bins)) == 1\n\nThis normalization is how [probability density functions](https://en.wikipedia.org/wiki/Probability_density_function) are defined in\nstatistics.  If $X$ is a random variable on $x$, then $f_X$\nis is the probability density function if $P[a<X<b] = \\int_a^b f_X dx$.\nIf the units of x are Volts, then the units of $f_X$ are $V^{-1}$\nor probability per change in voltage.\n\nThe usefulness of this normalization is a little more clear when we draw from\na known distribution and try to compare with theory.  So, choose 1000 points\nfrom a [normal distribution](https://en.wikipedia.org/wiki/Normal_distribution), and also calculate the\nknown probability density function:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "xdata = rng.normal(size=1000)\nxpdf = np.arange(-4, 4, 0.1)\npdf = 1 / (np.sqrt(2 * np.pi)) * np.exp(-xpdf**2 / 2)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "If we don't use ``density=True``, we need to scale the expected probability\ndistribution function by both the length of the data and the width of the\nbins:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "fig, ax = plt.subplot_mosaic([['False', 'True']], layout='constrained')\ndx = 0.1\nxbins = np.arange(-4, 4, dx)\nax['False'].hist(xdata, bins=xbins, density=False, histtype='step', label='Counts')\n\n# scale and plot the expected pdf:\nax['False'].plot(xpdf, pdf * len(xdata) * dx, label=r'$N\\,f_X(x)\\,\\delta x$')\nax['False'].set_ylabel('Count per bin')\nax['False'].set_xlabel('x bins [V]')\nax['False'].legend()\n\nax['True'].hist(xdata, bins=xbins, density=True, histtype='step', label='density')\nax['True'].plot(xpdf, pdf, label='$f_X(x)$')\nax['True'].set_ylabel('Probability density [$V^{-1}$]')\nax['True'].set_xlabel('x bins [$V$]')\nax['True'].legend()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "One advantage of using the density is therefore that the shape and amplitude\nof the histogram does not depend on the size of the bins.  Consider an\nextreme case where the bins do not have the same width.  In this example, the\nbins below ``x=-1.25`` are six times wider than the rest of the bins.   By\nnormalizing by density, we preserve the shape of the distribution, whereas if\nwe do not, then the wider bins have much higher counts than the thinner bins:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "fig, ax = plt.subplot_mosaic([['False', 'True']], layout='constrained')\ndx = 0.1\nxbins = np.hstack([np.arange(-4, -1.25, 6*dx), np.arange(-1.25, 4, dx)])\nax['False'].hist(xdata, bins=xbins, density=False, histtype='step', label='Counts')\nax['False'].plot(xpdf, pdf * len(xdata) * dx, label=r'$N\\,f_X(x)\\,\\delta x_0$')\nax['False'].set_ylabel('Count per bin')\nax['False'].set_xlabel('x bins [V]')\nax['False'].legend()\n\nax['True'].hist(xdata, bins=xbins, density=True, histtype='step', label='density')\nax['True'].plot(xpdf, pdf, label='$f_X(x)$')\nax['True'].set_ylabel('Probability density [$V^{-1}$]')\nax['True'].set_xlabel('x bins [$V$]')\nax['True'].legend()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Similarly, if we want to compare histograms with different bin widths, we may\nwant to use ``density=True``:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "fig, ax = plt.subplot_mosaic([['False', 'True']], layout='constrained')\n\n# expected PDF\nax['True'].plot(xpdf, pdf, '--', label='$f_X(x)$', color='k')\n\nfor nn, dx in enumerate([0.1, 0.4, 1.2]):\n    xbins = np.arange(-4, 4, dx)\n    # expected histogram:\n    ax['False'].plot(xpdf, pdf*1000*dx, '--', color=f'C{nn}')\n    ax['False'].hist(xdata, bins=xbins, density=False, histtype='step')\n\n    ax['True'].hist(xdata, bins=xbins, density=True, histtype='step', label=dx)\n\n# Labels:\nax['False'].set_xlabel('x bins [$V$]')\nax['False'].set_ylabel('Count per bin')\nax['True'].set_ylabel('Probability density [$V^{-1}$]')\nax['True'].set_xlabel('x bins [$V$]')\nax['True'].legend(fontsize='small', title='bin width:')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Sometimes people want to normalize so that the sum of counts is one.  This is\nanalogous to a [probability mass function](https://en.wikipedia.org/wiki/Probability_mass_function) for a discrete\nvariable where the sum of probabilities for all the values equals one.  Using\n``hist``, we can get this normalization if we set the *weights* to 1/N.\nNote that the amplitude of this normalized histogram still depends on\nwidth and/or number of the bins:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "fig, ax = plt.subplots(layout='constrained', figsize=(3.5, 3))\n\nfor nn, dx in enumerate([0.1, 0.4, 1.2]):\n    xbins = np.arange(-4, 4, dx)\n    ax.hist(xdata, bins=xbins, weights=1/len(xdata) * np.ones(len(xdata)),\n                   histtype='step', label=f'{dx}')\nax.set_xlabel('x bins [$V$]')\nax.set_ylabel('Bin count / N')\nax.legend(fontsize='small', title='bin width:')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The value of normalizing histograms is comparing two distributions that have\ndifferent sized populations.  Here we compare the distribution of ``xdata``\nwith a population of 1000, and ``xdata2`` with 100 members.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "xdata2 = rng.normal(size=100)\n\nfig, ax = plt.subplot_mosaic([['no_norm', 'density', 'weight']],\n                             layout='constrained', figsize=(8, 4))\n\nxbins = np.arange(-4, 4, 0.25)\n\nax['no_norm'].hist(xdata, bins=xbins, histtype='step')\nax['no_norm'].hist(xdata2, bins=xbins, histtype='step')\nax['no_norm'].set_ylabel('Counts')\nax['no_norm'].set_xlabel('x bins [$V$]')\nax['no_norm'].set_title('No normalization')\n\nax['density'].hist(xdata, bins=xbins, histtype='step', density=True)\nax['density'].hist(xdata2, bins=xbins, histtype='step', density=True)\nax['density'].set_ylabel('Probability density [$V^{-1}$]')\nax['density'].set_title('Density=True')\nax['density'].set_xlabel('x bins [$V$]')\n\nax['weight'].hist(xdata, bins=xbins, histtype='step',\n                  weights=1 / len(xdata) * np.ones(len(xdata)),\n                  label='N=1000')\nax['weight'].hist(xdata2, bins=xbins, histtype='step',\n                  weights=1 / len(xdata2) * np.ones(len(xdata2)),\n                  label='N=100')\nax['weight'].set_xlabel('x bins [$V$]')\nax['weight'].set_ylabel('Counts / N')\nax['weight'].legend(fontsize='small')\nax['weight'].set_title('Weight = 1/N')\n\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        ".. tags:: plot-type: histogram, domain: statistics\n\n.. admonition:: References\n\n   The use of the following functions, methods, classes and modules is shown\n   in this example:\n\n   - `matplotlib.axes.Axes.hist` / `matplotlib.pyplot.hist`\n   - `matplotlib.axes.Axes.set_title`\n   - `matplotlib.axes.Axes.set_xlabel`\n   - `matplotlib.axes.Axes.set_ylabel`\n   - `matplotlib.axes.Axes.legend`\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.13.2"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}