Logical Structure of Postgres

Logical Structure of Postgres

Postgres Cluster -> Databases -> Schemas -> Tables

This is the logical structure we have in Postgres.

Cluster manages multiple databases, which contain schemas.

Within schemas, we have tables, fundamental units for storing data.

To view your databases, you can use pg_database, and for tables, views, indexes and other relations, use pg_class.

Physical Structure of Postgres

Physical Structure of Postgres

Postgres is just a directory in your file system.

You can access the directory by $PGDATA.

If you open it, you will find multiple files related to configs.

It also has multiple subdirectories; a few of them are base and global, which are important.

The base subdirectory stores all your databases.

In simple words, the tables you create come under the base subdirectory.

Table Structure in Postgres

Table Structure in Postgres

Your Postgres tables look like this.

We have 8 KB pages, which hold our data, and each is numbered from 0 to n.

On top, we have 24bytes of header data which contains some generic info about the page like checksum, lower, upper and more.

Then we have some line pointers, which are just pointers that point to actual tuples or data.

At the end, we have actual data sorted from bottom up.

To identify a tuple, we use a tuple ID (TID), which consists of two numbers.

One of them is the page number, and the line pointer number.

e.g. TID(4, 3) indicates 4 is the page number and 3 is the line identifier.

Process Architecture of Postgres

Process Architecture of Postgres

Postgres follows a client-server architecture.

Apps connecting to it are considered clients, and Postgres is itself considered as a server.

Postgres manages everything using processes.

It uses the Postgres Server process or the Postmaster process to handle all admin-level work, i.e. managing other processes.

For handling client queries, it spins up a new process called as Backend Processes.

But the problem is that for each new client connection, it spins up a new backend process, which leads to high CPU and memory consumption.

For that reason, we use pgBouncer or pgPool-II for connection pooling.

Then we have background processes, which handle the rest of the task like replications, streaming, vacuuming, etc.

Query Processing in Postgres

Query Process

Whenever a backend process receives a query to process, it passes through 5 phases.

  1. Parser: parse the query into a parse tree
  2. Analyzer: do semantic analysis and generate a query tree
  3. Rewriter: transfer it using rules if you have any
  4. Planner: generate a cost-effective plan
  5. Executor: execute the plan to generate the result

1. Parser

Parse Tree

The parser parses the query into a tree-like structure, and the root node will be the SelectStmt.

Its main functionality is to check the syntax, not the semantics of it.

That means, if your syntax is wrong, the error will be thrown from here, but if you make some semantic error, i.e. using a table that doesn't exist, it will not throw an error.

2. Analyzer

Query Tree

The analyzer takes the parsed tree as input, analyzes it and forms a query tree.

Here, all semantics of your query are being checked, like whether the table name exists or not.

The main components of a query tree are:

  1. targetlist: the list of columns or expressions we want in our result set. If you use the * sign here, it will be replaced by all columns explicitly.
  2. rengetable: the list of all relations that are being used in the query. It also holds information like the OID and the name of the tables.
  3. jointree: it holds the FROM and WHERE clause. It also contains information about your JOIN strategies with ON or USING conditions.
  4. sortclause: the list of sorting clauses While the query tree has more components, these are some primary ones.

3. Rewriter

Rewriter transforms your query tree using the rule system you have defined.

You can check your rules using the pg_rules system view.

For example, it attaches your views as a subquery.

4. Planner

testdb: EXPLAIN SELECT * FROM tbl_a WHERE id < 300 ORDER BY data;

                          QUERY PLAN
---------------------------------------------------------------
 Sort  (cost=182.34..183.09 rows=300 width=8)
   Sort Key: data
   ->  Seq Scan on tbl_a  (cost=0.00..170.00 rows=300 width=8)
         Filter: (id < 300)
(4 rows)

Plan Tree

The planner receives a query tree as input and tries to find a cost-efficient query plan to execute it.

The planner in Postgres uses a cost-based optimisation instead of a rule-based optimisation.

You can use the EXPLAIN command to see the query plan.

In the tree form, it has a parent node where the tree starts, called PlannedStmt.

In child nodes, we have interlinked plan nodes, which are executed in a bottom-up approach. That means, it will execute the SqeScan node first, then SortNode.

5. Executor

Using the plan tree, the executor will start executing the query.

It will allocate some memory areas, like temp_buffers and work_mem, in advance to store the temporary tables if needed.

It uses MVCC to maintain consistency and isolation for transactions.