Distribution

Blake Tolman
4 min readMar 5, 2021

As a Data Scientist, when you work with data it is often you will encounter different statistical distributions. Part of the work as a Data Scientist is selecting which distribution best represents a given set of data. Understanding the steps taken to generate the data is important to identifying the trends shown.

There are many different types of distributions out there, but for the purpose of this article we can look at a handful that can represent the majority of situations the average person will come across.

So what is distribution? A statistical distribution can be summarized as a representation of the frequencies of potential events or the percentage of time each event occurs. Distribution can be separated into two main categories: Discrete and Continuous distributions.

Discrete distributions have a countable number of outcomes, which means that the potential outcomes can be put into a list. An example of this would be rolling a six sided die. You know that when rolling dice once, you will obtain a number between 1 and 6, with each outcome to be as likely, as denoted in this table:

Graphically it would be viewed as

Note how, with a fair die, the chance of throwing each number is exactly 1/6 (or 0.1666). When working with discredited data you use the Probability Mass Function (PMF) to describe it.

The other distribution category to be aware of is Continuous. A continuous distribution has an infinite number of possible values, and the probability associated with any particular value of a continuous distribution is null. An example of continuous distribution would be the weather on any given day. IF we where to look at the temperatures in NYC on June 1st we would get a graph as shown below:

Thinking about this, you could say that the temperature would generally range between 65 and 95 Degrees, with the average around 80 Degrees Fahrenheit. Like PMF for discrete data, for continuous data a Probability Density Function (PDF) is used for descriptive statistics.

Examples of Discrete Distributions

The Bernoulli Distribution:

The Bernoulli distribution represents the probability of success for a certain experiment (the outcome being “success or not”, so there are two possible outcomes). A coin toss is a classic example of a Bernoulli experiment with a probability of success 0.5 or 50%, but a Bernoulli experiment can have any probability of success between 0 and 1.

The Poisson Distribution:

The Poisson distribution represents the probability of n events in a given time period when the overall rate of occurrence is constant. A typical example is pieces of mail. If your overall mail received is constant, the number of items received on a single day (or month) follows a Poisson distribution. Other examples might include visitors to a website, or customers arriving at a store, or clients waiting to be served in a queue.

The Uniform Distribution:

The uniform distribution occurs when all possible outcomes are equally likely. The dice example shown before follows a uniform distribution with equal probabilities for throwing values from 1 to 6. The dice example follows a discrete uniform distribution, but continuous uniform distributions exist as well.

Examples of Continuous Distributions

The Normal or Gaussian distribution:

A normal distribution is the single most important distribution, you’ll basically come across it very often. The normal distribution follows a bell shape and is a foundational distribution for many models and theories in statistics and data science. A normal distribution turns up very often when dealing with real-world data including heights, weights of different people, errors in some measurement or grades on a test.

As you can begin to see there are many types of distribution for many different scenarios. Each distribution type describes a different set of data and trends. An example of several different distribution trends can be seen below. The horizontal axis in each chart represents the set of possible numeric outcomes. The vertical axis describes the probability of the respective outcomes.

--

--