SQL and PostgreSQL are perfect tools to analyze data. However, they can also be used to create sample data which has to possess certain statistical properties. One thing many people need quite often is a normal distribution. The main question therefore is: How can I create this kind of sample data?

Tablefunc: Creating normal distributions

The first thing you have to do is to enable the tablefunc extension, which is actually quite simple to do:

test=# CREATE EXTENSION tablefunc;
CREATE EXTENSION

“tablefunc” is there by default if “postgresql-contrib” has been installed. Once the module has been enabled the desired functions will already be there:

test=# \df *normal_rand*
                                         List of functions
 Schema |    Name     |    Result data type    |             Argument data types             | Type 
--------+-------------+------------------------+---------------------------------------------+------
 public | normal_rand | SETOF double precision | integer, double precision, double precision | func
(1 row)

 

The normal_rand function takes 3 parameters:

  • Number of rows to be generated
  • Average value
  • Desired standard deviation

If you want to run the function, you can simply put it into the FROM-clause and pass the desired
parameters to the function:

test=# SELECT row_number() OVER () AS id, x 
	FROM normal_rand(10, 5, 1) AS x 
	ORDER BY 1;
 id |         x          
----+--------------------
  1 |  4.332941804386073
  2 |  4.905662881624426
  3 |  3.661038976418651
  4 | 6.0087510163144415
  5 |  4.934066454147052
  6 |  5.909371874123449
  7 |  5.016528121699469
  8 |  4.640932937572484
  9 |  7.695984939477616
 10 |  5.647677569953539
(10 rows)

In this case 10 rows were created. The average value is 5 and the standard deviation has been set to 1. At first glance the data looks ok.

Testing your sample data

Let us test the function and see if it really does what it promises. 10 rows won’t be enough for that so I decided to repeat the test with more data:

test=# SELECT count(*), avg(x), stddev(x) 
	FROM normal_rand(1000000, 5, 1) AS x;
  count  |        avg        |      stddev       
---------+-------------------+-------------------
 1000000 | 5.000273593685213 | 1.000627143792761
(1 row)

Running the test with 1 million rows clearly shows that the output is perfect. The average value is very close to 5 and the same holds true for the standard deviation. You can therefore safely use the output to handle all your calculations.

Once you have a Gaussian distribution, you can nicely turn it into some other distribution of your choice or simply built on this data.

Finally …

If you want to know more about data, statistical distributions and so on you can check out one of our other posts about fraud detection.