Normalization by Osmosis

Part 2: Practical Examples of Data Normalization in PostgreSQL

Having gone through a theoretical introduction in part 1, it is now time to discuss some more practical examples. I'll first talk about the…

type modeling

… of columns. It is important to realize that during the time Codd formulated the base 3 normal forms (see Part 1 for more details), the thinking about data types supported by a database was formed by a relatively primitive set of types available in commonly used programming languages. The model by itself does not make any assumptions about types, but the reality of using PostgreSQL today - with its fabulously rich set of available data types - is very, very different. Particularly if we realize how many of these types are considered simple, which of them can really be considered…

atomic structures

…and, when and why is it appropriate to see them as such?

For the most extreme example of such structures, we can consider simple textual data — namely char, varchar and text types. In its most simplistic form (hello C) they can be considered arrays of bytes, accessible by offset. If we consider multibyte encoding like UTF-8, it becomes an array of characters. Not very practical, but thanks to a rich set of operators in PostgreSQL, it is still quite doable.

On the level of standard features, a database offers capabilities for performing equal/unequal comparisons and some other scalar operations, and for accelerating those operations with indices.

But PostgreSQL goes further:

textual data can be processed and constrained with the use of pattern matching (e.g. regular expressions) and those operations can be accelerated by GIN indices using trigrams. Fuzzy matching can be performed (with trigrams or Levenshtein distance) answering questions in a more subtle manner than true/false. Domains can be defined over a typical textual type that gives this type much more internal structure. That means the database starts to understand its data more.

Going even further, a full-text search mechanism can be defined over textual data, and then the array of characters turned into information that has some language, some discernible data items in it (email addresses, phone numbers, prices, words), and some of those items can have an additional structure that can be processed: a base form that can be derived from them, or synonyms to other base forms.

Other “base” types can provide similarly extended capabilities — date, timestamp and interval, despite being simple scalars under the hood, are equipped with so many operators that their internal structure, intricate and irregular, can be both processed naturally and accessed easily.

And then there are types openly defined by PostgreSQL as complex, and thus violating 1NF outright: staring into the twilight zone with geometric types and then range types defined over scalar types, and them boldly moving (in historical order) into the realm of arrays, complex types, hstore, XML and JSON — with extremely rich sets of operators, operator classes and index methods associated with them, which allow you to process them efficiently and conveniently on the database level, without the necessity of moving data back and forth into the application realm.

At the end of the day, what counts is not whether or not data elements (fields) are really atomic, but if they are at the same time…

opaque structures

…inaccessible from within a database. The very essence of an opaque structure (or file) is that the database kernel neither “understands” its content nor does it have any specific means of processing it. Two archetypical examples: First, a JPG image is stored in a bytea field. For the database kernel it is just a string of bytes. Second, a CSV file is stored in a text field. The database can sort the column according to text collation, can access words inside it using string processing functions, and can even apply a full-text search. But the database has no “idea” that the field stores a set of records with syntax and meaning.

For PostgreSQL, in all the cases like XML and JSON, they are certainly not opaque structures. And by storing those as text, we are not using all of the server’s capabilities, essentially wasting its potential: for simplifying the application code, and for achieving the best performance.

At any rate, it is important to distinguish between two use cases:

A complex type is used as an extension to a relational model and treated like an object through a well-defined interface. It certainly covers the use of geometric and range types, and to some degree the use of arrays, hstore, XML and JSON types — the latter on the condition that they are considered as documents with specific, distinguished properties.
In such cases, all those types fit well into the normalization rules and will not significantly alter a relational data model.
A complex type is used as a replacement for a relational model. This is especially true when using arrays, hstore and JSON to create structures that could be modeled as tables and columns.
In these cases, the relational principles are broken - and clearly 1NF cannot be applied. Whatever is the rationale for doing so (and there is no doubt there are valid reasons I am going to discuss in the third part) this decision must be made consciously, keeping in mind there is a price to pay:
- exploiting benefits of a built-in optimiser is much harder if not impossible
- data manipulation is more tedious
- the access language is very different to SQL (e.g. XPath for XML, jsonpath for JSON)
- schema enforcement is much harder

Hybrid data model

Building a hybrid data model which encompasses both a relational and non-relational approach is possible and well supported in PostgreSQL. I will discuss hybrid data models in the next part. In any case, even if a formal approach like normalization cannot be directly applied, it is important to remember what its objective is: to reduce data duplication and unintentional dependencies, in this way reducing the opportunity for anomalies (inconsistencies and errors) to occur.

To end our discussion of practical examples of data normalization, I want to talk about a few typical situations…

when following the book makes your life harder

…in simple cases like deciding how to model structures on the level of a single (or a few) columns.

When trying to create a representation of various real-life objects in a database, it is often helpful to remember that…

People do not have primary keys

…and not only real people for that matter. Things that in daily use might seem to have a clearly identifiable structure that everyone intuitively understands might not really need that structure from the modeling point of view :

It might be irrelevant for that particular application
It might not be generic enough to cover all actual and projected use cases in all circumstances.

And I am not talking about some requirements that might occur in the future — I am referring specifically to requirements that are known now to the designer but might be masked by assumptions and preconceptions.

Some issues when this can manifest itself are:

introducing too much structure (or decomposing a whole into parts more than necessary)
introducing overly strict or unnecessary constraints, particularly with regard to the format of data
limiting a valid set of values by imposing insufficiently flexible dictionaries or enumerations

All of them have a particularly nasty habit of popping up, especially when the set of values to be stored in the database is a bit more international than a designer expected.

A few examples:

Storing names of people (and titles or salutations) often tempts the designer to perform decomposition into separate fields for the given name and family name. Or worse, adding a separate field for a middle name (or initial) and a dictionary for allowed salutations. To add insult to injury — impose upper and lower limits of length to all those fields and constrain them with an allowed set of characters (like precluding whitespaces).
- What if a person always has two names and two surnames (like in Spain or Portugal)
- What if the order is reversed (last, first — like in Hungary)
- What if for that particular person, the concept of a surname does not make sense at all (like in Iceland)?
- and so on…

Storing addresses — same story, just a bit more complicated, because it is even more dependent on cultural and national factors and conventions than with personal names. Even when there is a reasonable probability that the model will contain mostly data from one geographical location, most often that does not imply exclusivity. Within one location, the assumptions as to what components are requisite, what are allowed and in what order can vary greatly. This is an area that contains a lot of unstructured data - and trying to force it into a rigid framework can easily backfire.
Storing dictionaries of categories for people or places that can be strongly personal, specific to a relatively narrow cultural circle, or highly dependent on one’s nationality or language. Examples of this are titles, salutations, or the type of place someone lives or works in. Or even the proverbial gender assignment limited to {M,F}

Theory vs practice

All of these situations, which in theory would aid in finding and modeling the data structures within the database, can in practice make data less useful, more prone to contain errors, impose lots of work in ever more elaborate data validation and sanitation rules, and in the end lead to a much less stable schema than is to anyone’s liking.

Yes, there are situations where modeling must be done precisely:

legal or compliance — when particular applications impose regulations on how data must be stored
very specific applications — when there is an actual business requirement, openly expressed by data users and data uses to decompose data in a particular way (like storing information to provide medical services or requiring processing full legal names)
compatibility (often backward compatibility) — when the need to exchange data with 3rd parties impose the model structure (like exchanging information regarding localities with a local cadaster / land registry or with an airline, when the info regarding gender is really mandatory and limited to {M,F} )

However, neither of those will make the use of data easier or more natural.

So all the benefits provided by a well-normalized schema should be balanced against project practicalities and the designer’s ability to support his work, otherwise, it can easily turn into an exercise in…

how to make developers (and often users) hate you

…with unmaintainable, underperforming code delivering an overall bad experience to all parties involved. Because while being able to answer how many users have a second name Bettina or compute the distribution of family names with regard to which building floor they live on might be an interesting possibility, it is often an exercise in futility (unless the model is supposed to support a national census application).

So before applying data normalization to a model, I would strongly recommend:

asking the question why? What is the purpose of this data being stored and how are they going to be processed? And the answer to that question often belongs not to the modeler, but to the author of an actual use case.
admitting that personal experiences and preconceived notions of a modeller might be worthless in a broader context, especially in modelling natural concepts and objects other people might have opinions about.. This is not to undermine the value of years of practical experience, but to underline the importance of a fresh look and open mind with every modeling task.
focusing on a core business need, being honest about naming an object and its properties, describing their true purposes and resisting the urge to collect as much detail as possible. Often less is more and it leads to more maintainable data of better quality.
trying to rather under-model than overdo relational structures. Strive for simplicity and store as little as necessary to achieve the business purpose of an application — especially considering that, should the need arise to store more or in more detail, PostgreSQL allows you to store semi-structured, non-relational data that can be made relational, if necessary..

Now that I’ve given you an overview of practical examples of data normalization in PostgreSQL...

be sure to stay alert for the next installment of this blog, which will describe building a hybrid data model which encompasses both a relational and non-relational approach. Hybrid data modeling is not only possible but also well-supported in PostgreSQL.

<< Read Part 1 of Normalization by Osmosis: Data Normalization in PostgreSQL

For further reading about data normalization, see the CYBERTEC blogpost on primary keys and unique constraints.

Normalization by Osmosis

Part 1: A Guide to Data Normalization in PostgreSQL

By Michał Małecki

Introduction

Although I installed my first PostgreSQL database on RedHat Linux (not RHEL) 6.0 around 1999, I probably limited my interaction with this 6.5 release to starting the server with SysinitV scripts. PostgreSQL was at that time an underdog to MySQL (or so I thought) so it was cool. At that time I knew nothing about SQL (except the name), databases or data modelling.

Not long after that, I found myself a student of a big technical university - and to my surprise, I was hired by the university IT centre to help with the deployment of a brand-new student records management system being developed internally. Suddenly, without any magical improvement in my knowledge and skills, I became an Oracle DBA. You’ll learn on the job, they said. Indeed, on the job I learned that Big Red is a thing (or rather — a beast).

I served my days as best as I could, sitting between two huge ERD diagrams (A0 size) I printed out and meticulously glued together from smaller sheets. I still was not much smarter about data modelling, but since the system was developed by the university’s most experienced academic staff, it was supposed to be designed correctly. And correctly designed it was.

Without any formal training and proper tooling...

(no money for anything more that 70 or so named user licenses — Win SQLPlus was all I had) I was trying to find my way through the maze of tables, views and foreign keys. I found myself writing ad-hoc reports and queries, correcting operators’ mistakes, and so on. Over the years, as if by some magic influence, the ancient knowledge of how to properly decompose and normalize data diffused into my mind, my thinking and my reflexes from those two huge diagrams, as if by osmosis.

Since then, whenever I see a database schema, I can immediately say whether or not it feels right. Much later on, I learned the reasons behind those feelings.

And yes, the database was designed in (mostly) Codd’s 3rd normal form.

So… what am I talking about?

The concept of…

normal forms

…is older than SQL, or practical relational database software, for that matter. It was introduced by F. Edgar Codd @ IBM in his seminal work in 1970. That work laid the foundation for all relational databases we use today, including PostgreSQL.

Unfortunately, despite the fact that these concepts have very practical applications, they are somewhat academic in nature and origin; in order to talk about them, I need to introduce some theory and define some terms.

Note: The following few paragraphs are not specific to PostgreSQL, so you can apply them to any relational database. Also, if you already know all this like the back of your hand, feel free to skip ahead as much as is comfortable for you.

The important thing to remember is: The objective of introducing the concept of normal forms was to avoid…

data anomalies

…by means of the elimination of all possible data duplication. That’s what it is really all about:

avoid ambiguity (update anomaly)
avoid loss of information (delete anomaly)
avoid incompleteness (insert anomaly)

How do anomalies look in practice? Let’s imagine we are back in the early 1970’s, and our thriving paperback romance bookstore uses an application that keeps all data in one set of records (e.g., in an ISAM file):

Author	Origin	Title	Pages	Price
Rommy la Dequell	Altdorf, HU	Power of silence	123	$0.99
Angie Nadajoskova	Bratislava, SK	Irritated soul	234	$23.99
Rommy la Dequell	Altdorf, AT	Desire to suffer	456	$1.99

And we need to add another new author we signed up, but do not yet have details about the new thriller-romance he is writing:

Michael Birg

Bludenz, AT

$4.99

So we can see the following anomalies:

First, our best selling author of two books (record 1 & 3) has an inconsistency: Is Altdorf in Hungary or in Austria? Or has the author moved? Second, when the book “Irritated Soul” ran out of print, while removing it from the record we also lost all the information about the author - that we might very well want to keep (record 2). And finally, the new record cannot be added — it just fails to deliver the essential information. We cannot use it as it is.

These problems might seem naive, or even ridiculous today, now that we are all accustomed to the fantastic capabilities of PostgreSQL and other RDBMSes. But remember, this is 50 years ago when all joins were programmed by hand, and more importantly, remember that those anomalies can still appear today, although in a less obvious form.

As I do not expect anyone to have the same knowledge diffusion experience I have described above, there are a few…

basic concepts

…and terms I need to clarify first:

We have a record (also called a tuple or row) consisting of named and typed fields (also called attributes or columns). A collection of identically typed rows is a table (also known as a relation). Fields in a record or records in a table are not guaranteed to have any particular order. So far so good; all users of any relational database should already be familiar with these terms.

Now about…

keys

A trivial way to identify any row in a table is to create a key that is a combination of all fields: this is the trivial superkey (values which are part of a key are underlined below)

Author	Origin	Title	Pages	Price
Rommy la Dequell	Altdorf, HU	Power of silence	123	$0.99

It is usually possible to create superkeys that can still identify rows, but do not contain all fields.

Author	Origin	Title	Pages	Price
Rommy la Dequell	Altdorf, HU	Power of silence	123	$0.99

Author	Origin	Title	Pages	Price
Rommy la Dequell	Altdorf, HU	Power of silence	123	$0.99

If a column belongs to a superkey,

...it is a prime attribute (or key attributes). If it does not, it is a non-prime attribute (non-key attribute). This distinction is important in describing what can depend on what. E,g, in the examples above, the numbers of pages or the prices may be the same for other books; there is no functional dependency.

If it is no longer possible to remove a field from a superkey without losing its ability to identify a row, we have come to a candidate key — there may be more than one such combination of fields.

Author	Origin	Title	Pages	Price
Rommy la Dequell	Altdorf, HU	Power of silence	123	$0.99

Author	Origin	Title	Pages	Price
Rommy la Dequell	Altdorf, HU	Power of silence	123	$0.99

To reliably access any particular row and enforce uniqueness, one of the candidate keys must be selected as the primary key — it consists of a designated combination of one or more columns (when there are multiple columns it is a compound key), and their combined contents must be unique and non empty.

Author	Origin	Title	Pages	Price
Rommy la Dequell	Altdorf, HU	Power of silence	123	$0.99

Keys of the record can be natural — when they are derived from natural, inherent attributes of a real live thing described by the record — or surrogate. A natural key comes straight out of one or more than one existing columns; an example would be a social security number. A surrogate key has its own, separate column and is artificially created due to the need to uniquely identify each record. Surrogate keys have their benefits: they can be easily generated, cheaply stored and sorted (bigint). If necessary, they can have guaranteed global uniqueness (UUID). This type of key is often automatically used when a schema is autogenerated by some persistence framework or ORM.

However, surrogate keys do have one big disadvantage:

they do not say anything about the actual properties of whatever real-world object the row describes. As a consequence, they cannot tell us anything about it; in particular, they can’t identify its uniqueness. As such, the only uniqueness a surrogate key can guarantee is their own one. And that means that in practice, duplicates can occur; the very thing normalization is trying to avoid.

On the other hand, since in the real world, objects are identified rather by a combination of features and not by a single one, natural primary keys are usually compound — that makes them rather inconvenient to use in practice. Also, repeating them in other tables as foreign keys leads to data repetition in the form of multiple rows and tables containing the same (possibly long) values over and over again which is just an inefficient use of space.

Since we haven't even started normalizing, and yet the use of a surrogate key in the role of primary key is common practice, here is an example:

Let’s create a table describing people, and let’s include a sufficient variety of attributes to capture their uniqueness: name, surname, place and date of birth, height, eye-colour, net worth, favourite dish, pet name and so on. Let’s decide on the combination of these attributes that will for our use case uniquely identify each person. Now, that exact combination will be our compound primary key — so that the database can assure only one record with exactly these values of selected attributes could exist in a table.

However, if we just decide to give such a person a number and make it a primary key, nothing will prevent us from having multiple identical records — for all intents and purposes, describing exactly the same person — differing just by the meaningless value of that one artificially assigned number (row 1 & 4) or worse, containing small differences caused by typos or mistakes (row 3). Note however, that a typo made in row 2 would not have been prevented anyway, since it happened in a column which was originally part of a primary key.

Id	Author	Origin	Title	Pages	Price
1	Rommy la Dequell	Altdorf, HU	Power of silence	123	$0.99
2	Rommy la Dequell	Altdorf, AT	Power of silence	123	$0.99
3	Rommy la Dequel	Altdorf, HU	Power of silence	123	$0.99
4	Rommy la Dequell	Altdorf, HU	Power of silence	123	$0.99

Note: To cope with the issue of assigning uniqueness and yet still help the database do its job in an efficient manner, the primary key should always be natural, compound if necessary to assure uniqueness. To conveniently refer to the table rows, however, create one additional surrogate key and make it unique. Then we will separate the concern of assuring row uniqueness (server just in one place by primary key) from the concern of providing a way to conveniently refer to that row from other tables (server by column with just unique constraint). That will suffice. See the CYBERTEC blog on unique constraints for more info.

The last concept I need to introduce is the functional dependency between attributes. An attribute is functionally dependent on another attribute if its value always appears with the particular value of an attribute it is dependent on. This dependency is unidirectional. You can imagine it as a function or as a mapping table, unequivocally transforming one value into another.

Trivial example would be to have two columns expressing some physical property (e.g. weight), one metric, the other – imperial. The value of one is functionally dependent on the other, since the conversion formula is well known.

Less trivial example would be to assign a less precise property (e.g. category) based on some more specific property (e.g. price or number of pages). There is somewhere (usually in application) a precise formula to derive first from the second. If it is applied consistently and without exceptions — we have a functional dependency. That dependency can be encoded in an additional table in relation form and then calculated upon each query or in a view. Note however, that if there is a business intention to have exceptions (that is to assign category in an arbitrary way — even if only incidentally) there is no strict functional dependency and it needs to be reflected in schema design.

Back to normal forms —some are mostly…

relevant

…to daily practice, but with the constant evolution of database software capabilities (especially PostgreSQL) the division “relevant/ not relevant” should be taken with a grain of salt.

Each normal form has a precise requirement that must be satisfied, and builds upon the requirement expressed in the previous form, so let’s talk about them in order:

0NF — aka NNF (Not Normalized Form) requires no additional assurances except that a record has a primary key that uniquely identifies it.

1NF — requires that no field has a table as its value. The original definition was even more strict: that a field value is atomic (that is — not compound). As it is with current RDBMSes, and especially PostgreSQL this presents a rather challenging requirement; let’s stick with the «no table as value» form and discuss what atomic, compound and opaque mean later.

- non-compliant

Person

Born in

Born on

Mother’s name

Born in Land

Mother born in

Mother born on

Lived in

Anna Gruber

Klagenfurt

12.10.1967

Angela Bauer

Kärnten

Salzburg

30.02.1948

Wien

Linz

Betty Huber

Bregenz

14.05.1989

Regina Wagner

Vorarlberg

Graz

22.11.1970

Linz

Salzburg

- compliant — composite field values (lived in) are moved to a separate table

Person	Born in	Born on	Born in Land	Mother’s name	Mother born in	Mother born on
Anna Gruber	Klagenfurt	12.10.1967	Kärnten	Angela Bauer	Salzburg	30.02.1948
Betty Huber	Bregenz	14.05.1989	Vorarlberg	Regina Wagner	Graz	22.11.1970

Person	Born in	Born on	Lived in
Anna Gruber	Klagenfurt	12.10.1967	Wien
Anna Gruber	Klagenfurt	12.10.1967	Linz
Betty Huber	Bregenz	14.05.1989	Linz
Betty Huber	Bregenz	14.05.1989	Salzburg

compliant with unique surrogate key (Person ID)

Person	Born in	Born on	Person ID	Born in Land	Mother’s name	Mother born in	Mother born on
Anna Gruber	Klagenfurt	12.10.1967	1	Kärnten	Angela Bauer	Salzburg	30.02.1948
Betty Huber	Bregenz	14.05.1989	2	Vorarlberg	Regina Wagner	Graz	22.11.1970

Person ID	Lived in
1	Wien
1	Linz
2	Linz
2	Salzburg

2NF — requires that (for a compound primary key) no non-key attribute depends just on part of the primary key. In other words, non-key attributes must depend on the whole primary key. If such a partial dependency is identified, the table must be split, and both dependent attribute(s) and part of the primary key it depends upon moved to a separate table.

- non-compliant

Person	Born in	Born on	Person ID	Born in Land	Mother’s name	Mother born in	Mother born on
Anna Gruber	Klagenfurt	12.10.1967	1	Kärnten	Angela Bauer	Salzburg	30.02.1948
Betty Huber	Bregenz	14.05.1989	2	Vorarlberg	Regina Wagner	Graz	22.11.1970

compliant — fields (born in land) dependent on part of the primary key (born in) are moved to a separate table

Person	Born in	Born on	Person ID	Mother’s name	Mother born in	Mother born on
Anna Gruber	Klagenfurt	12.10.1967	1	Angela Bauer	Salzburg	30.02.1948
Betty Huber	Bregenz	14.05.1989	2	Regina Wagner	Graz	22.11.1970

This assumes that for the cities table below, the city name and only the city name is a primary key (which usually is not the case in the real world).

City	Land
Klagenfurt	Kärnten
Bregenz	Vorarlberg

3NF — requires that there is no transitive dependency (through another field depending directly on the primary key) on the primary key. This means that if the values of two or more fields consistently appear together, they should be moved to a separate table, and one of them should be made a key, so that information repetition is avoided.

- non-compliant

Person	Born in	Born on	Person ID	Mother’s name	Mother born in	Mother born on
Anna Gruber	Klagenfurt	12.10.1967	1	Angela Bauer	Salzburg	30.02.1948
Betty Huber	Bregenz	14.05.1989	2	Regina Wagner	Graz	22.11.1970

compliant — fields (mother born in & mother born on) transitively dependent (through mother’s name) on the primary key (person, born in & born on) are moved to separate rows. This is possible because “mother” is also a “person”. Should this be a different object type, it will be moved to a separate table.

Person	Born in	Born on	Person ID	Mother ID
Anna Gruber	Klagenfurt	12.10.1967	1	3
Betty Huber	Bregenz	14.05.1989	2	4
Angela Bauer	Salzburg	30.02.1948	3	«null»
Regina Wagner	Graz	22.11.1970	4	«null»

BCNF — Boyce-Codd Normal Form aka 3.5NF is just a bit of a stronger version of 3NF; often, a table in 3NF is also compliant with BCNF.
It requires that no part of a primary key is functionally dependent on a non-key attribute. The difference is quite subtle — till now I have been talking about non-key attributes which are dependent on key attributes. Now it is the opposite way round.

- non-compliant — Born in is part of the primary key and functionally depends on non-key attribute Birth Post Code

Person	Born in	Born on	Person ID	Mother ID	Birth Post Code
Anna Gruber	Klagenfurt	12.10.1967	1	3	9073
Betty Huber	Bregenz	14.05.1989	2	4	6900
Angela Bauer	Salzburg	30.02.1948	3	«null»	5020
Regina Wagner	Graz	22.11.1970	4	«null»	8032

compliant — attribute "Birth Post Code" replaces "Born in" in the primary key. Also, the structure of the Cities table is changed (and is much more realistic). Note that with this change, the data are correctly normalized, but less readable.

Person	Birth Post Code	Born on	Person ID	Mother ID
Anna Gruber	9073	12.10.1967	1	3
Betty Huber	6900	14.05.1989	2	4
Angela Bauer	5020	30.02.1948	3	«null»
Regina Wagner	8032	22.11.1970	4	«null»

Post Code	City	Land
9073	Klagenfurt	Kärnten
6900	Bregenz	Vorarlberg
5020	Salzburg	Salzburg
8032	Graz	Steiermark

This is usually where information about normal forms ends, particularly because the remaining normal forms are the ones any practitioner would usually not be able to quote from memory, and are generally considered…

not-so relevant

…for practical applications. This however is not quite true as some are currently — even non-deliberately — applied in day-to-day design practice and it is a good thing.
These forms were introduced after the original Edgar F. Codd work. Tables violating them actually can happen, and should be taken into account. I have given you examples below:

4NF — requires that there is no multivalued dependency on non-super key attributes

- non-compliant — this table has only one key (no non-primary attributes), but assuming that each trainer is able to deliver all his courses in every language he can speak, adding a new course would require inserting multiple rows. Note however, that if this assumption is not true (e.g. some trainers do deliver only specific courses in particular languages) we cannot stop here and must go one step further to 5NF.

Trainer	Course	Language
Michal	PostgreSQL Professional	EN
Michal	PostgreSQL Professional	PL
Pavlo	PostgreSQL Professional	EN
Pavlo	PostgreSQL Professional	SK
Michal	Administration & Performance Tuning	EN
Michal	Administration & Performance Tuning	PL
Hans	Administration & Performance Tuning	EN
Hans	Administration & Performance Tuning	DE
Hans	PostgreSQL Professional	EN
Hans	PostgreSQL Professional	DE

compliant — table is split in two

Trainer	Course
Michal	PostgreSQL Professional
Michal	Administration & Performance Tuning
Pavlo	PostgreSQL Professional
Hans	PostgreSQL Professional
Hans	Administration & Performance Tuning

Trainer	Language
Michal	EN
Michal	PL
Pavlo	EN
Pavlo	SK
Hans	EN
Hans	DE

5NF — requires that there are no joint dependencies that do not have only superkey components. In other words a table is in 5NF if it cannot be split (reduced) further into smaller tables that have different (smaller) key without losing information.

- non-compliant — let’s use the table from the example above, but insert additional data: Pavlo started to offer an additional course, but does not have all materials in all languages yet

Trainer	Course	Language
Michal	PostgreSQL Professional	EN
Michal	PostgreSQL Professional	PL
Pavlo	PostgreSQL Professional	EN
Pavlo	PostgreSQL Professional	SK
Pavlo	Administration & Performance Tuning	EN
Michal	Administration & Performance Tuning	EN
Michal	Administration & Performance Tuning	PL
Hans	Administration & Performance Tuning	EN
Hans	Administration & Performance Tuning	DE
Hans	PostgreSQL Professional	EN
Hans	PostgreSQL Professional	DE

compliant — one additional table must be introduced.

Trainer	Course
Michal	PostgreSQL Professional
Michal	Administration & Performance Tuning
Pavlo	PostgreSQL Professional
Hans	PostgreSQL Professional
Hans	Administration & Performance Tuning

Trainer	Language
Michal	EN
Michal	PL
Pavlo	EN
Pavlo	SK
Hans	EN
Hans	DE

Course	Language
PostgreSQL Professional	EN
PostgreSQL Professional	DE
PostgreSQL Professional	PL
PostgreSQL Professional	SK
Administration & Performance Tuning	EN
Administration & Performance Tuning	DE
Administration & Performance Tuning	PL

DKNF — requires that there are no constraints other than those expressed by domain or key constraints. In other words, constraints must be precisely defined, there is no place for an arbitrary or fuzzy choice for constrained values.

- non-compliant — the table does not contain any precise indication about how to select values for the columns Length and Pace

Course	Hours	Days	Length	Pace
PostgreSQL Professional	15	3	normal	standard
Administration & Performance Tuning	25	5	long	standard
High Availability & Patroni	24	3	normal	intense

compliant — the values for Length and Pace are precisely constrained, note that in order to determine the pace of the course, an additional calculation must be performed

Course	Hours	Days
PostgreSQL Professional	15	3
Administration & Performance Tuning	25	5
High Availability & Patroni	24	3

Length	min days	max days
short	1	1
normal	2	3
long	4	5

Pace	min hours per day	max hours per day
introductory	3	4
standard	5	5
intense	6	8

6NF — requires that a row must not contain more than one non-primary attribute in addition to a primary key. As impractical as this normal form might seem, it has its applications in e.g. data warehouses and very sparse schemas, but it is most efficient in so-called columnar storage engines. For typical OLTP, 6NF is not very practical and is too close for comfort to the dreaded EAV antipattern.

Now that you are familiar with data normalization in PostgreSQL and normal forms, keep alert for my next blog on the practical usage of these forms. With a little practice, you'll soon have an intutitve feeling for which form belongs to what situation. Or, you can try printing out this blog and posting it on the wall behind your computer, if you too want to try out "Normalization by Osmosis!"

To continue reading, see Part 2 of this series, Practical Examples of Data Normalization.

You may also be interested in this CYBERTEC blog on auto-generating primary keys.