Florent Capelli, Oliver Irwin, Sylvain Salvati
CRIL, Université d’Artois
DATA Lab @ Northeastern
11 April 2025
Q \coloneq R(x_1, x_2) \wedge S(x_1, x_3) \wedge T(x_2, x_3)
R | x_1 | x_2 |
---|---|---|
0 | 0 | |
0 | 1 | |
2 | 1 |
S | x_1 | x_3 |
---|---|---|
0 | 0 | |
0 | 2 | |
2 | 3 |
T | x_2 | x_3 |
---|---|---|
0 | 2 | |
1 | 0 | |
1 | 2 |
:::
One recursive call:
Total complexity: number of recursive calls times \tilde{O}(m) where m is the number of atoms.
At most (|dom|+1) \sum_{i=1}^n |Q_i(\mathbb{D})| calls.
Complexity: \tilde{O}(m|\mathsf{dom}|\cdot \sum_{i=1}^{n}|Q_i(\mathbb{D})|).
Consider databases for Q with a bound N on the table size: \mathcal{D}_{Q}^{\leqslant N}= \{\mathbb{D}\mid \forall R \in Q, |R^\mathbb{D}| \leqslant N\}
and let: \mathsf{wc}(Q, N) = \mathsf{sup}_{\mathbb{D}\in\mathcal{D}_{Q}^{\leqslant N}}~|Q(\mathbb{D})|
\mathsf{wc}(Q,N) is the worst case: the size of the biggest answer set possible with query Q and databases where each table are bounded by N.
We know how to compute \rho(Q) such that \mathsf{wc}(Q,N) = \tilde{O}(N^{\rho(Q)}) (this is known as the AGM-bound but we do not need it yet).
A join algorithm is worst case optimal (wrt \mathcal{D}_{Q}^{\leqslant N}) if for every Q, N \in \mathbb{N} and \mathbb{D}\in \mathcal{D}_{Q}^{\leqslant N}, it computes Q(\mathbb{D}) in time \tilde{O}(f(|Q|) \cdot \mathsf{wc}(Q,N))
Rich literature:
\begin{align*} |Q_i(\mathbb{D})| &= |\bigwedge_{R\in Q}\prod_{x_1\dots x_i} R^\mathbb{D}|\\ &= |\bigwedge_{R\in Q} R^{\mathbb{D}'}| \\ &= |Q(\mathbb{D}')| \end{align*}
where R^{\mathbb{D}'} = \prod_{x_1\dots x_i} R^\mathbb{D}\times \{0\}^{X_R \setminus x_1, \dots, x_i}
Crucial observation:
The complexity of the branch and bound algorithm is
\tilde{O}(m|\mathsf{dom}|\cdot \sum_{i=1}^n|Q_i(\mathbb{D})|)
\tilde{O}(m|\mathsf{dom}| \cdot n\mathsf{wc}(Q, N))
\tilde{O}(mn \cdot {\color{red}|\mathsf{dom}|} \cdot \mathsf{wc}(Q, N))
R | x | y |
---|---|---|
1 | 2 | |
2 | 1 | |
3 | 0 |
⇝
\tilde{R}^b | x^2 | x^1 | x^0 | y^2 | y^1 | y^0 | |
---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 1 | 0 | ||
0 | 1 | 0 | 0 | 0 | 1 | ||
0 | 1 | 1 | 0 | 0 | 0 |
We hence compute Q(\mathbb{D}) in time \tilde{O}(m n \cdot \mathsf{wc}(Q, N))!
Given Q and \mathbb{D}, sample \tau \in Q(\mathbb{D}) with probability 1 \over |Q(\mathbb{D})| or fail if Q(\mathbb{D}) = \emptyset.
Naive algorithm:
Complexity using WCOJ: \tilde{O}(\mathsf{wc}(Q,N)).
We can do better: (expected) time \tilde{O}({\mathsf{wc}(Q,N) \over |Q(\mathbb{D})|+1} poly(|Q|))
PODS 23: [Deng, Lu, Tao] and [Kim, Ha, Fletcher, Han]
Let’s do a modular proof of this fact!
Sampling answers reduces to sampling -leaves in a tree with (,)-labeled leaves.
In our case, we do not know \ell(t)…
Only makes sense if \sum_i upb(t_i) \leq upb(t).
Las Vegas uniform sampling algorithm:
Repeat until output: O({upb(r) \over \ell(r)}) expected calls, where r is the root.
AGM bound: there exists positive rational numbers (\lambda_R)_{R \in Q} such that |Q(\mathbb{D})| \leq \prod_{R \in Q}|R^\mathbb{D}|^{\lambda_R} \leq \mathsf{wc}(Q,N)
Define upb(t) = \prod_{R \in Q}|{\color{red}R^\mathbb{D}[\tau_t]}|^{\lambda_R} \leq \mathsf{wc}(Q,N):
Given a super-additive function upperbounding the number of -leaves in a tree at each node, we have:
Las Vegas uniform sampling algorithm:
Repeat until output: O({upb(r) \over \ell(r)})={\mathsf{wc}(Q,N) \over 1+|Q(\mathbb{D})|} expected calls.
Final complexity: binarize to navigate the tree in \tilde{O}(nm): \tilde{O}(nm \cdot {\mathsf{wc}(Q,N) \over 1+|Q(\mathbb{D})|})
Matches existing results, proof more modular.
So far we have considered worst case wrt this class:
Each relation is subject to a cardinality constraint of size N.
What if we know that our instance has some extra properties (e.g., a functional dependency)
In this case, we say that our algorithm is worst case optimal wrt \mathcal{C}.
Q = R(x_1,x_2) \wedge S(x_2,x_3).
We have: \mathsf{wc}(Q,N) = N^2.
Is our simple join worst case optimal for this class?
Short answer: yes if x_2 is set before x_1.
Recall the complexity of our algorithm: \tilde{O}(m |\mathsf{dom}| \sum_{i=1}^n |Q_i(\mathbb{D})|)) where Q_i = \bigwedge_{R \in Q} \prod_{x_1,\dots, x_i} R
A class of database \mathcal{C} for Q is prefix closed for order \pi = (x_1,\dots,x_n) if for each i and \mathbb{D}\in \mathcal{C}:
|Q_i(\mathbb{D})| \leq \mathsf{wc}(\mathcal{C})
\mathcal{D}_{Q}^{\leqslant N} is prefix closed (for any order)!
Our algorithm is (almost) worst case optimal as long as we use an order for which \mathcal{C} is prefix closed!
F = (X_1 \rightarrow Y_1, \dots, X_k \rightarrow Y_k) is a set of functional dependencies:
\mathcal{C}_F^N = \{\mathbb{D}\mid \mathbb{D}\text{ respects $F$} \} \cap \mathcal{D}_{Q}^{\leqslant N}
is prefix closed for order \pi (exactly the same proof as for cardinality constraints).
Hence our algorithm is worst case optimal wrt \mathcal{C}_F^N (as long as we follow \pi).
We need to show that this functional dependencies transfer in the binarised setting but it is almost immediate.
A degree constraint is a constraint (X,Y,N_{Y|X}) where X \subseteq Y. A relation R verifies the constraint if
|\max_{\tau \in \mathsf{dom}^X} \prod_{Y} R[\tau] | \leq N_{Y|X}
\Delta = \{(X_1, Y_1, N_{1}) \dots, (X_k, Y_k, N_k)\} set of degree constraints.
\mathcal{C}_\Delta^N = \{\mathbb{D}\mid \mathbb{D}\text{ respects $\Delta$} \} \cap \mathcal{D}_{Q}^{\leqslant N}
is prefix closed for order \pi (exactly the same proof as for cardinality constraints).
Hence our algorithm is worst case optimal wrt \mathcal{C}_\Delta^N (as long as we follow \pi).
We need to show that this functional dependencies transfer in the binarised setting but it is almost immediate.
We can find (\lambda_R) such that \prod_{R \in Q} |R^\mathbb{D}|^{\lambda_R} \leq \tilde{O}(\mathsf{wc}(Q, \mathcal{C}_\Delta^N)) for any \mathbb{D}\in \mathcal{C}_\Delta^N (polymatroid bound).
Define upb(t) := \prod_{R \in Q} |R^\mathbb{D}[\tau_t]|^{\lambda_R}:
We have sampling with complexity \tilde{O}(nm \cdot {\mathsf{wc}(Q,\mathcal{C}_\Delta^N) \over 1+|Q(\mathbb{D})|})
Future work: