SQL and PostgreSQL are perfect tools to analyze data. However, they can also be used to create sample data which has to possess certain statistical properties. One thing many people need quite often is a normal distribution. The main question therefore is: How can I create this kind of sample data?
Table of Contents
The first thing you have to do is to enable the tablefunc extension, which is actually quite simple to do:
1 2 |
test=# CREATE EXTENSION tablefunc; CREATE EXTENSION |
“tablefunc” is there by default if “postgresql-contrib” has been installed. Once the module has been enabled the desired functions will already be there:
1 2 3 4 5 6 |
test=# df *normal_rand* List of functions Schema | Name | Result data type | Argument data types | Type --------+-------------+------------------------+---------------------------------------------+------ public | normal_rand | SETOF double precision | integer, double precision, double precision | func (1 row) |
The normal_rand function takes 3 parameters:
If you want to run the function, you can simply put it into the FROM-clause and pass the desired
parameters to the function:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
test=# SELECT row_number() OVER () AS id, x FROM normal_rand(10, 5, 1) AS x ORDER BY 1; id | x ----+-------------------- 1 | 4.332941804386073 2 | 4.905662881624426 3 | 3.661038976418651 4 | 6.0087510163144415 5 | 4.934066454147052 6 | 5.909371874123449 7 | 5.016528121699469 8 | 4.640932937572484 9 | 7.695984939477616 10 | 5.647677569953539 (10 rows) |
In this case 10 rows were created. The average value is 5 and the standard deviation has been set to 1. At first glance the data looks ok.
Let us test the function and see if it really does what it promises. 10 rows won't be enough for that so I decided to repeat the test with more data:
1 2 3 4 5 6 |
test=# SELECT count(*), avg(x), stddev(x) FROM normal_rand(1000000, 5, 1) AS x; count | avg | stddev ---------+-------------------+------------------- 1000000 | 5.000273593685213 | 1.000627143792761 (1 row) |
Running the test with 1 million rows clearly shows that the output is perfect. The average value is very close to 5 and the same holds true for the standard deviation. You can therefore safely use the output to handle all your calculations.
Once you have a Gaussian distribution, you can nicely turn it into some other distribution of your choice or simply built on this data.
If you want to know more about data, statistical distributions and so on you can check out one of our other posts about fraud detection.
+43 (0) 2622 93022-0
office@cybertec.at
You are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
Leave a Reply