I once received a mail with question: Can you tell me why I can’t select the column `references`?


=# \d v_table_relation
View "public.v_table_relation"
Column      | Type                                | Collation | Nullable | Default
------------+-------------------------------------+-----------+----------+---------
schema      | information_schema.sql_identifier   |           |          |
table       | information_schema.sql_identifier   |           |          |
columns     | information_schema.sql_identifier[] |           |          |
references  | jsonb[]                             |           |          |

=# select * from v_table_relation ;
schema     | table      | columns         | references
-----------+------------+-----------------+--------------------------------------------------------------------------------------------------------------------
public     | a          | {b,c}           | {}
public     | a2         | {b,c}           | {"{\"toTable\": \"a\", \"toSchema\": \"public\", \"toColumns\": [\"b\", \"c\"], \"fromColumns\": [\"b\", \"c\"]}"}
workspace  | t_employee | {id,name,state} | {}
(3 rows)

=# select references from v_table_relation;
ERROR: syntax error at or near "references"
LINE 1: select references from v_table_relation;

 

Well, the quick answer will be: because REFERENCES is a keyword you should use double quotes around it, e.g.


=# select references from v_table_relation;

But why do you sometimes need to explicitly quote identifiers and sometimes it works without them? And how do you know what keywords exactly are used in the PostgreSQL? And what keywords may be used as object names? And so on…

As usual one may find the answers in the PostgreSQL manual:
www.postgresql.org/docs/current/sql-keywords-appendix.html

There we have detailed table that lists all tokens that are key words in the SQL standard and in PostgreSQL. From the manual you will know about reserved and non-reserved tokens, SQL standard compatibility and much more which is out of scope of this post.

There are two more ways to know what tokens can or cannot be used in certain situations. First is for true programmers, and the second is for true admins. Choose your pill, Neo.

Investigating the sources

PostgreSQL uses LALR(1)* parser to work with SQL statements. Implementation of the grammar (parser itself) may be found in the gram.y and implementation of the lexical analyzer (lexer) in the scan.l file.

Query first chewed to by lexer, which splits the text into tokens and sends them to the parser. It is logical to assume that lexer is the first who knows if the token is a keyword or not. Examining the source of it we find such lines:

<pre><code>
...
{identifier}	{
			const ScanKeyword *keyword;
			char	   *ident;
			SET_YYLLOC();

			/* Is it a keyword? */
			keyword = ScanKeywordLookup(yytext,
							yyextra->keywords,
							yyextra->num_keywords);
			if (keyword != NULL)
			{
				yylval->keyword = keyword->name;
				return keyword->value;
			}

			/*
			 * No.  Convert the identifier to lower case, and truncate
			 * if necessary.
			 */
			ident = downcase_truncate_identifier(yytext, yyleng, true);
			yylval->str = ident;
			return IDENT;
		}
...
</code></pre>

So, the lexer knows if the identifier is a keyword, but this doesn’t give us much, since during lexical analyzis we’re lacking context. And that is what parser takes care of.

Our journey started from simple SELECT statement producing error, so let’s try to examine gram.y from the top to bottom. Our route will be like this:

stmtblock -> stmtmulti -> stmt -> CreateStmt

Here we see that our columns are listed in a OptTableElementList node, so let’s dig it:

OptTableElementList -> TableElementList -> TableElement -> columnDef -> ColId

Bamn! We found the right place!


/*
 * Name classification hierarchy.
 *
 * IDENT is the lexeme returned by the lexer for identifiers that match
 * no known keyword.  In most cases, we can accept certain keywords as
 * names, not only IDENTs.  We prefer to accept as many such keywords
 * as possible to minimize the impact of "reserved words" on programmers.
 * So, we divide names into several possible classes.  The classification
 * is chosen in part to make keywords acceptable as names wherever possibl
 */
/* Column identifier --- names that can be column, table, etc names.
 */
ColId:      IDENT                                   { $$ = $1; }
            | unreserved_keyword                    { $$ = pstrdup($1); }
            | col_name_keyword                      { $$ = pstrdup($1); }
        ;

By the way, I’m absolutely sure PostgreSQL sources are the best academic reading in the world. From here we see that we want to accept as much “reserved” keywords as possible, but this produces difficulties in different places causing grammar conflicts (reduce\reduce mostly, because shift\reduce are solved automatically). Thus, to prevent these errors developers divided keywords into several classes, e.g. “unreserved_keyword” and “col_name_keyword”. And only one step left: to find these classes. Piece of cake! Let’s check “unreserved_keyword” node:

...
/*
 * Keyword category lists.  Generally, every keyword present in
 * the Postgres grammar should appear in exactly one of these lists.
 *
 * Put a new keyword into the first list that it can go into without causing
 * shift or reduce conflicts.  The earlier lists define "less reserved"
 * categories of keywords.
 *
 * Make sure that each keyword's category in kwlist.h matches where
 * it is listed here.  (Someday we may be able to generate these lists and
 * kwlist.h's table from a common master list.)
 */

/* "Unreserved" keywords --- available for use as any kind of name.
 */
unreserved_keyword:
			  ABORT_P
...

/* Column identifier --- keywords that can be column, table, etc names.
 *
 * Many of these keywords will in fact be recognized as type or function
 * names too; but they have special productions for the purpose, and so
 * can't be treated as "generic" type or function names.
 *
 * The type names appearing here are not usable as function names
 * because they can be followed by '(' in typename productions, which
 * looks too much like a function call for an LR(1) parser.
 */
col_name_keyword:
			  BETWEEN
...

/* Type/function identifier --- keywords that can be type or function names.
 *
 * Most of these are keywords that are used as operators in expressions;
 * in general such keywords can't be column names because they would be
 * ambiguous with variables, but they are unambiguous as function identifiers.
 *
 * Do not include POSITION, SUBSTRING, etc here since they have explicit
 * productions in a_expr to support the goofy SQL9x argument syntax.
 * - thomas 2000-11-28
 */
type_func_name_keyword:
			  AUTHORIZATION
...

/* Reserved keyword --- these keywords are usable only as a ColLabel.
 *
 * Keywords appear here if they could not be distinguished from variable,
 * type, or function names in some contexts.  Don't put things here unless
 * forced to.
 */
reserved_keyword:
			  ALL
...
		;

%%

Asking the server

There is one more way knowing how restrictive keyword usage is: execute query

SELECT * FROM pg_get_keywords();

Function pg_get_keywords() returns list of SQL keywords and their categories:

The catcode column contains a category code: U for unreserved, C for column name, T for type or function name, or R for reserved. The catdesc column contains a possibly-localized string describing the category.