Benford’s Law
“God built the universe on numbers.”
If you are working with a random dataset of numerical values, you would expect the distribution of the leading digits to be uniform. However, it is often not exactly so.
Benford’s Law (also known as the Newcomb-Benford Law, the Law of Anomalous Numbers or the First-Digit Law) states that in the numerical datasets of decimal format, the leading digit of any number is much more likely to be a 1, than any other digit. More formally, a leading digit d is expected to appear with probability:
This can be generalized to other bases as well.
It was first discovered by an astronomer called Simon Newcomb in 1881:
That the ten digits do not occur with equal frequency must be evident to any one making much use of logarithmic tables, and noticing how much faster the first pages wear out than the last ones. The first significant figure is oftener 1 than any other digit, and the frequency diminishes up to 9.
Later, in 1938, a physicist called Frank Benford re-introduced this law in his paper:
It has been observed that the pages of a much used table of common logarithms show evidences of a selective use of the natural numbers. The pages containing the logarithms of the low numbers 1 and 2 are apt to be more stained and frayed by use than those of the higher numbers 8 and 9. Of course, no one could be expected to be greatly interested in the condition of a table of logarithms, but the matter may be considered more worthy of study when we recall that the table is used in the building up of our scientific, engineering, and general factual literature. There may be, in the relative cleanliness of the pages of a logarithm table, data on how we think and how we react when dealing with things that can be described by means of numbers.
Let’s look at two examples of the Benford’s law in practice.
I. Powers of two
The sequence of powers of 2, {2, 4, 8, 16, 32, …}, is a clear example of the Law of Anomalous Numbers.
# Import the necessary libraries
import pandas as pd # for working with CSV files
import matplotlib.pylab as plt # for plotting
from math import log # for logarithms
# define the path to the data
path = r'population_by_country_2020.csv'
# Import the data with the only two columns that we beed
data = pd.read_csv(path, usecols=('Country (or dependency)', 'Population (2020)'))
# rename the columns for easier access
data.columns = ['Country', 'Population']
# create a list of all the leading digits in the populations
leading = [int(str(x)[0]) for x in list(data.Population)]
# creating a list of the digits between 1 and 10
digits = list(range(1, 10))
# calculate the frequency of each digit
total = [leading.count(digit) for digit in digits]
freq = [x/sum(total) for x in total]
# Predictions of the Benford's Law
predi = [round(log(1+1/d, 10),2) for d in range(1, 10)]
# plot the distribution
plt.bar(range(len(digits)), freq, tick_label=digits, color = 'green')
plt.plot(predi, color = 'black')
plt.show()
Quick code explanation:
- After importing the required libraries, the sequence of powers of 2 is created and assigned to variable ‘dataset’.
- The leading digits are extracted, counted and their frequencies are stored in the variable ‘freq’.
- The predictions of the Benford’s Law are also calculated according to the formula mentioned above and the results are stored in the variable ‘predi’.
- finally the results from the second and the third steps are plotted.
The resulting plot looks like this:
The orange bars show the actual frequencies of the leading digits, while the blue line shows the predictions of the Benford’s Law. As it can be clearly seen from the graph, the predictions and the reuslts are almost identical.
II. Populations of countries
Another great example of this phenomenon can be seen in the populations of countries. To check for its validity, we follow the same steps as in the last example. You can find the dataset here.
# Import the necessary libraries
import matplotlib.pylab as plt # for plotting
from math import log # for logarithms
# Create a dataset of the first 200 powers of 2
dataset = [2 ** i for i in range(200)]
# create a list of all the leading digits
leading = [int(str(x)[0]) for x in dataset]
# creating a list of the digits between 1 and 10
digits = list(range(1, 10))
# calculate the frequency of each digit
total = [leading.count(digit) for digit in digits]
freq = [x/sum(total) for x in total]
# Predictions of the Benford's Law
predi = [round(log(1+1/d, 10),2) for d in range(1, 10)]
# plot the distribution
plt.bar(range(len(digits)), freq, tick_label=digits, color = 'orange')
plt.plot(predi, color = 'blue') # plot the predictions
plt.show()
Quick code explanation:
This code is very similar to the previous one, except that we have used an external dataset. After organizing the external dataset for our purposes, the results are plotted.
The graph shows that the actual results are indeed very similar to the predictions.
References
Newcomb, Simon: “Note on the Frequency of Use of the Different Digits in Natural Numbers”
Benford, Frank: “The Law of Anomalous Numbers”