Normalization by Osmosis

Part 2: Practical Examples of Data Normalization in PostgreSQL

Having gone through a theoretical introduction in part 1, it is now time to discuss some more practical examples. I’ll first talk about the…

type modeling

… of columns. It is important to realize that during the time Codd formulated the base 3 normal forms (see Part 1 for more details), the thinking about data types supported by a database was formed by a relatively primitive set of types available in commonly used programming languages. The model by itself does not make any assumptions about types, but the reality of using PostgreSQL today – with its fabulously rich set of available data types – is very, very different. Particularly if we realize how many of these types are considered simple, which of them can really be considered…

atomic structures

…and, when and why is it appropriate to see them as such?

For the most extreme example of such structures, we can consider simple textual data — namely char, varchar and text types. In its most simplistic form (hello C) they can be considered arrays of bytes, accessible by offset. If we consider multibyte encoding like UTF-8, it becomes an array of characters. Not very practical, but thanks to a rich set of operators in PostgreSQL, it is still quite doable.

On the level of standard features, a database offers capabilities for performing equal/unequal comparisons and some other scalar operations, and for accelerating those operations with indices.

But PostgreSQL goes further:

textual data can be processed and constrained with the use of pattern matching (e.g. regular expressions) and those operations can be accelerated by GIN indices using trigrams. Fuzzy matching can be performed (with trigrams or Levenshtein distance) answering questions in a more subtle manner than true/false. Domains can be defined over a typical textual type that gives this type much more internal structure. That means the database starts to understand its data more.

Going even further, a full-text search mechanism can be defined over textual data, and then the array of characters turned into information that has some language, some discernible data items in it (email addresses, phone numbers, prices, words), and some of those items can have an additional structure that can be processed: a base form that can be derived from them, or synonyms to other base forms.

Other “base” types can provide similarly extended capabilities — date, timestamp and interval, despite being simple scalars under the hood, are equipped with so many operators that their internal structure, intricate and irregular, can be both processed naturally and accessed easily.

And then there are types openly defined by PostgreSQL as complex, and thus violating 1NF outright: staring into the twilight zone with geometric types and then range types defined over scalar types, and them boldly moving (in historical order) into the realm of arrays, complex types, hstore, XML and JSON — with extremely rich sets of operators, operator classes and index methods associated with them, which allow you to process them efficiently and conveniently on the database level, without the necessity of moving data back and forth into the application realm.

At the end of the day, what counts is not whether or not data elements (fields) are really atomic, but if they are at the same time…

opaque structures

…inaccessible from within a database. The very essence of an opaque structure (or file) is that the database kernel neither “understands” its content nor does it have any specific means of processing it. Two archetypical examples: First, a JPG image is stored in a bytea field. For the database kernel it is just a string of bytes. Second, a CSV file is stored in a text field. The database can sort the column according to text collation, can access words inside it using string processing functions, and can even apply a full-text search. But the database has no “idea” that the field stores a set of records with syntax and meaning.

For PostgreSQL, in all the cases like XML and JSON, they are certainly not opaque structures. And by storing those as text, we are not using all of the server’s capabilities, essentially wasting its potential: for simplifying the application code, and for achieving the best performance.

At any rate, it is important to distinguish between two use cases:

  • A complex type is used as an extension to a relational model and treated like an object through a well-defined interface. It certainly covers the use of geometric and range types, and to some degree the use of arrays, hstore, XML and JSON types — the latter on the condition that they are considered as documents with specific, distinguished properties.
    In such cases, all those types fit well into the normalization rules and will not significantly alter a relational data model.
  • A complex type is used as a replacement for a relational model. This is especially true when using arrays, hstore and JSON to create structures that could be modeled as tables and columns.
    In these cases, the relational principles are broken – and clearly 1NF cannot be applied. Whatever is the rationale for doing so (and there is no doubt there are valid reasons I am going to discuss in the third part) this decision must be made consciously, keeping in mind there is a price to pay:

    • exploiting benefits of a built-in optimiser is much harder if not impossible
    • data manipulation is more tedious
    • the access language is very different to SQL (e.g. XPath for XML, jsonpath for JSON)
    • schema enforcement is much harder

Hybrid data model

Building a hybrid data model which encompasses both a relational and non-relational approach is possible and well supported in PostgreSQL. I will discuss hybrid data models in the next part. In any case, even if a formal approach like normalization cannot be directly applied, it is important to remember what its objective is: to reduce data duplication and unintentional dependencies, in this way reducing the opportunity for anomalies (inconsistencies and errors) to occur.

To end our discussion of practical examples of data normalization, I want to talk about a few typical situations…

when following the book makes your life harder

…in simple cases like deciding how to model structures on the level of a single (or a few) columns.

When trying to create a representation of various real-life objects in a database, it is often helpful to remember that…

People do not have primary keys

…and not only real people for that matter. Things that in daily use might seem to have a clearly identifiable structure that everyone intuitively understands might not really need that structure from the modeling point of view :

  • It might be irrelevant for that particular application
  • It might not be generic enough to cover all actual and projected use cases in all circumstances.

And I am not talking about some requirements that might occur in the future — I am referring specifically to requirements that are known now to the designer but might be masked by assumptions and preconceptions.

Some issues when this can manifest itself are:

  • introducing too much structure (or decomposing a whole into parts more than necessary)
  • introducing overly strict or unnecessary constraints, particularly with regard to the format of data
  • limiting a valid set of values by imposing insufficiently flexible dictionaries or enumerations

All of them have a particularly nasty habit of popping up, especially when the set of values to be stored in the database is a bit more international than a designer expected.

A few examples:

  • Storing names of people (and titles or salutations) often tempts the designer to perform decomposition into separate fields for the given name and family name. Or worse, adding a separate field for a middle name (or initial) and a dictionary for allowed salutations. To add insult to injury — impose upper and lower limits of length to all those fields and constrain them with an allowed set of characters (like precluding whitespaces).
    • What if a person always has two names and two surnames (like in Spain or Portugal)
    • What if the order is reversed (last, first — like in Hungary)
    • What if for that particular person, the concept of a surname does not make sense at all (like in Iceland)?
    • and so on…
  • Storing addresses — same story, just a bit more complicated, because it is even more dependent on cultural and national factors and conventions than with personal names. Even when there is a reasonable probability that the model will contain mostly data from one geographical location, most often that does not imply exclusivity. Within one location, the assumptions as to what components are requisite, what are allowed and in what order can vary greatly. This is an area that contains a lot of unstructured data – and trying to force it into a rigid framework can easily backfire.
  • Storing dictionaries of categories for people or places that can be strongly personal, specific to a relatively narrow cultural circle, or highly dependent on one’s nationality or language. Examples of this are titles, salutations, or the type of place someone lives or works in. Or even the proverbial gender assignment limited to {M,F}

Theory vs practice

All of these situations, which in theory would aid in finding and modeling the data structures within the database, can in practice make data less useful, more prone to contain errors, impose lots of work in ever more elaborate data validation and sanitation rules, and in the end lead to a much less stable schema than is to anyone’s liking.

Yes, there are situations where modeling must be done precisely:

  • legal or compliance — when particular applications impose regulations on how data must be stored
  • very specific applications — when there is an actual business requirement, openly expressed by data users and data uses to decompose data in a particular way (like storing information to provide medical services or requiring processing full legal names)
  • compatibility (often backward compatibility) — when the need to exchange data with 3rd parties impose the model structure (like exchanging information regarding localities with a local cadaster / land registry or with an airline, when the info regarding gender is really mandatory and limited to {M,F} )

However, neither of those will make the use of data easier or more natural.

So all the benefits provided by a well-normalized schema should be balanced against project practicalities and the designer’s ability to support his work, otherwise, it can easily turn into an exercise in…

how to make developers (and often users) hate you

…with unmaintainable, underperforming code delivering an overall bad experience to all parties involved. Because while being able to answer how many users have a second name Bettina or compute the distribution of family names with regard to which building floor they live on might be an interesting possibility, it is often an exercise in futility (unless the model is supposed to support a national census application).

So before applying data normalization to a model, I would strongly recommend:

  • asking the question why? What is the purpose of this data being stored and how are they going to be processed? And the answer to that question often belongs not to the modeler, but to the author of an actual use case.
  • admitting that personal experiences and preconceived notions of a modeller might be worthless in a broader context, especially in modelling natural concepts and objects other people might have opinions about.. This is not to undermine the value of years of practical experience, but to underline the importance of a fresh look and open mind with every modeling task.
  • focusing on a core business need, being honest about naming an object and its properties, describing their true purposes and resisting the urge to collect as much detail as possible. Often less is more and it leads to more maintainable data of better quality.
  • trying to rather under-model than overdo relational structures. Strive for simplicity and store as little as necessary to achieve the business purpose of an application — especially considering that, should the need arise to store more or in more detail, PostgreSQL allows you to store semi-structured, non-relational data that can be made relational, if necessary..

Now that I’ve given you an overview of practical examples of data normalization in PostgreSQL…

be sure to stay alert for the next installment of this blog, which will describe building a hybrid data model which encompasses both a relational and non-relational approach. Hybrid data modeling is not only possible but also well-supported in PostgreSQL.

<< Read Part 1 of Normalization by Osmosis: Data Normalization in PostgreSQL

For further reading about data normalization, see the CYBERTEC blogpost on primary keys and unique constraints.

See also the CYBERTEC archive on primary keys