What if SQL were a Lisp: SQLisp

Having been around SQL quite a bit since I joined dayjob 4 years ago I've thought about the language quite a bit. I'll be up front, I'm no SQL power user. I don't write much complex SQL like many table join monsters, but I find parsing and processing it is very interesting. Declarative languages like SQL are unique: a general formula is given and its up to the implementation to sort out what is returned within reason.

I'm not going to bore you with SQL's background. What I'll say is that IBM had fear in their hearts when making this language. The more SQL that I write the more convinced its syntax mismatches is what makes it so...odd. My only guess is that 'analysts' were assumed to not be able to handle pre-fix math?

Weird SQL Syntax Choices

Mix of Prefix and Infix

The most glaring aspect of SQL that sticks out to me is the mixture of pseudo-prefix and infix operators. For example, take a look at the simple SELECT list:

SELECT 1, 2, 3, 4

Obviously the SELECT operator comes first, followed by a list of elements. Similar to a CONS list.

Now list look at arithmetic expressions:

1 + 1

Obviously, this in an Infix expression, the + operator is between the two numbers.

SELECT 1 + 1

This mixture of style is confusing? Doesn't the following look much more consistent and reasonable?

SELECT + 1 1, i
FROM my_table;

Inconsistencies

The , operator is overloaded. It delineates expressions in a list, but it also can map to the cross-join operator in some dialects like:

SELECT i, j, k
FROM my_table, my_table2;

Is tricky to read at first.

Random parenthesis requirements

Why does the value list of an INSERT require parenthesis? Why does the target column list require them?

INSERT INTO my_table (i, j, k) VALUES (1, 2, 3);

Why does the SELECT list not require these parenthesis? Or rather, why is INSERT not more shaped like the SELECT statement:

INSERT i, j, k
INTO table
VALUES 1, 2, 3

Or rather, why does SELECT not follow INSERT's format:

SELECT FROM my_table (i, j, k) WHERE 1=1;

Either of these forms would be 'most' consistent.

What if SQL was a Lisp

Lets just jump the shark completely. For the sake of parsers and as we'll see logical optimizers, we could save much hassle if SQL were a consistent Lisp.

(SELECT
    (CONS (SUM (CONS (COLUMN i 0) nil)) nil)
    (FROM my_table)
    (AND
        (= (COLUMN i 0) (CONST '1' INT))
        (= (COLUMN i 0) (COLUMN j 1))))

It takes a bit of time to get used to this, but what is defined is equivalent to the following query:

SELECT SUM(i)
FROM my_table
WHERE i = 1 AND j = i;

There are obvious disadvantages to the SQLisp query above. It is quite a bit more verbose, there are more constructs, and it takes longer to type. All of that being said, lets look at the advantages.

  • Consistency.
    • Everything is a CONS list.
    • The select list is just a series of CONS.
    • The inputs to the SUM functions are also cons.
    • This makes it more obvious that functions are really var-argable like DECODE().
  • Everything is a function.
    • SELECT is a 3 arg function.
    • Projection is a 2 arg function.
    • FROM is a single arg function.
    • Equality is a function.
    • Everything is a function.
  • Less ambiguity
    • In this case, what is the type of 1 in the comparison?
    • Without layering explicit casts to INT or BIGINT or whatever type you desire, some unexpected behavior can happen depending on the database you are using and the implicit upcasting rules.
    • (CONST '1' INT) is a much clearer intent, even though it does lose the elegancy of plain inline values.

If we think of SQL as not just something users are writing but as the lingua franca of data tools then this opens up a level of consistency in the API that SQL has embodied, which it is dearly lacking. Why do Tableau, Superset, QueryBook and the multitude of other tools require an insane amount of logic for each database to work around painful and tedious ambiguities? Intent can be expressed and not derived all while keeping the original semantics. This isn't a 'new' version of SQL, this is SQL deconstructed and simplified. SQL ran through the Pacojet and turned into a nice creamy foam.

Optimization

If we take this concept further to Equality Saturation we have a powerful rules-based logical optimizer in our reach. egg is a great Rust library that does the heavy lifting for this. E-graphs and Equality Saturation provide a way to perform non-destructive optimizations based on a cost function. In our example above, lets consider the following 'rules':

rewrite!("AND commutative"; "(AND ?a ?b)" => "(AND ?b ?a)"),
rewrite!("= commutative"; "(= ?a ?b)" => "(= ?b ?a)"),

rewrite!(
    "Optimzie col-to-col and col-to-const conjnct equivalency";
    "(AND
        (= (COLUMN ?a_oid ?a_idx) (CONST ?b_val ?b_type))
        (= (COLUMN ?a_oid ?a_idx) (COLUMN ?c_oid ?c_idx)))" =>
    "(AND
        (= (COLUMN ?a_oid ?a_idx) (CONST ?b_val ?b_type))
        (= (COLUMN ?c_oid ?c_idx) (CONST ?b_val ?b_type)))"
),

This gives us a powerful interface to add simple logical optimizations to the queries. If we apply these rules to the query above, it gets transformed to:

(SELECT
    (CONS (SUM (CONS (COLUMN i 0) nil)) nil)
    (FROM my_table)
    (AND
        (= (COLUMN i 0) (CONST '1' INT))
        (= (CONST '1' INT) (COLUMN j 1))))

Notice the new equality predicate. These rules successfully transformed the expression i = 1 and i = j to the equivalent i = 1 and j = 1. While simple, this is a powerful optimization as now we don't need to to a column by column comparison while evaluating the conjunction.

And this is just the beginning.

Notes

After writing this I have found that this isn't necessarily a new idea. Some interesting variations of pther rules can be found in this library from risinglightdb.


1029 Words

2024-01-30