Generating a normal distribution in SQL

09.2020 | Category: | Tags: development , sql help

09.2020

Category:

SQL and PostgreSQL are perfect tools to analyze data. However, they can also be used to create sample data which has to possess certain statistical properties. One thing many people need quite often is a normal distribution. The main question therefore is: How can I create this kind of sample data?

Table of Contents

Tablefunc: Creating normal distributions

The first thing you have to do is to enable the tablefunc extension, which is actually quite simple to do:

test=# CREATE EXTENSION tablefunc;
CREATE EXTENSION

1 2	test=# CREATE EXTENSION tablefunc; CREATE EXTENSION

“tablefunc” is there by default if “postgresql-contrib” has been installed. Once the module has been enabled the desired functions will already be there:

test=# df *normal_rand*
                                         List of functions
 Schema |    Name     |    Result data type    |             Argument data types             | Type 
--------+-------------+------------------------+---------------------------------------------+------
 public | normal_rand | SETOF double precision | integer, double precision, double precision | func
(1 row)

1

2

3

4

5

6

test=# df *normal_rand*

List of functions

Schema | Name | Result data type | Argument data types | Type

--------+-------------+------------------------+---------------------------------------------+------

public | normal_rand | SETOF double precision | integer, double precision, double precision | func

(1 row)

The normal_rand function takes 3 parameters:

Number of rows to be generated
Average value
Desired standard deviation

If you want to run the function, you can simply put it into the FROM-clause and pass the desired
parameters to the function:

test=# SELECT row_number() OVER () AS id, x 
	FROM normal_rand(10, 5, 1) AS x 
	ORDER BY 1;
 id |         x          
----+--------------------
  1 |  4.332941804386073
  2 |  4.905662881624426
  3 |  3.661038976418651
  4 | 6.0087510163144415
  5 |  4.934066454147052
  6 |  5.909371874123449
  7 |  5.016528121699469
  8 |  4.640932937572484
  9 |  7.695984939477616
 10 |  5.647677569953539
(10 rows)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

test=# SELECT row_number() OVER () AS id, x

FROM normal_rand(10, 5, 1) AS x

ORDER BY 1;

id | x

----+--------------------

1 | 4.332941804386073

2 | 4.905662881624426

3 | 3.661038976418651

4 | 6.0087510163144415

5 | 4.934066454147052

6 | 5.909371874123449

7 | 5.016528121699469

8 | 4.640932937572484

9 | 7.695984939477616

10 | 5.647677569953539

(10 rows)

In this case 10 rows were created. The average value is 5 and the standard deviation has been set to 1. At first glance the data looks ok.

Testing your sample data

Let us test the function and see if it really does what it promises. 10 rows won't be enough for that so I decided to repeat the test with more data:

test=# SELECT count(*), avg(x), stddev(x) 
	FROM normal_rand(1000000, 5, 1) AS x;
  count  |        avg        |      stddev       
---------+-------------------+-------------------
 1000000 | 5.000273593685213 | 1.000627143792761
(1 row)

1

2

3

4

5

6

test=# SELECT count(*), avg(x), stddev(x)

FROM normal_rand(1000000, 5, 1) AS x;

count | avg | stddev

---------+-------------------+-------------------

1000000 | 5.000273593685213 | 1.000627143792761

(1 row)

Running the test with 1 million rows clearly shows that the output is perfect. The average value is very close to 5 and the same holds true for the standard deviation. You can therefore safely use the output to handle all your calculations.

Once you have a Gaussian distribution, you can nicely turn it into some other distribution of your choice or simply built on this data.

Finally...

If you want to know more about data, statistical distributions and so on you can check out one of our other posts about fraud detection.

Generating a normal distribution in SQL

Tablefunc: Creating normal distributions

Testing your sample data

Finally...

Leave a Reply Cancel reply

Hans-Jürgen Schönig

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts