Direct Access for Conjunctive Queries with Negations

Florent Capelli, Oliver Irwin

CRIL, Université d’Artois

Séminaire KRDB

23 Janvier 2025

Direct Access on Join Queries

Join Queries

Join Query : Q(x1,,xn)=i=1kRi(𝐱𝐢)Q(x_1, \dots, x_n) = \bigwedge_{i=1}^k R_i(\mathbf{x_i})

where 𝐱𝐢\mathbf{x_i} is a tuple over X={x1,,xn}X = \{x_1,\dots,x_n\}

Example: Q(city,country,name,id)=People(id,name,city)Capitals(city,country)Q(city, country, name, id) = People(id, name, city) \wedge Capitals(city, country)

People
id name city
1 Alice Paris
2 Bob Lens
3 Chiara Rome
4 Djibril Berlin
5 Émile Dortmund
6 Francesca Rome
Capitals
city country
Berlin Germany
Paris France
Rome Italy
Q(𝔻)Q(\mathbb{D})
city country name id
Paris France Alice 1
Rome Italy Chiara 3
Berlin Germany Djibril 4
Rome Italy Francesca 6

Direct Access

Quickly access Q(𝔻)[k]Q(\mathbb{D})[k], the kthk^{th} element of Q(𝔻)Q(\mathbb{D}).

Q(𝔻)Q(\mathbb{D})
city country name id
Paris France Alice 1
Rome Italy Chiara 3
Berlin Germany Djibril 4
Rome Italy Francesca 6

Q(𝔻)[2]Q(\mathbb{D})[2]? (Rome,Italy,Chiara,3)(Rome, Italy, Chiara, 3).

Naive Direct Access

Naive algorithm: materialize Q(𝔻)Q(\mathbb{D}) in an array, sort it. Access.

city country name id
\dots \dots \dots \dots
Berlin Germany Djibril 4
\dots \dots \dots \dots
Paris France Alice 1
\dots \dots \dots \dots
Rome Italy Chiara 3
Rome Italy Francesca 6
\dots \dots \dots \dots

Q(𝔻)[1432]=??Q(\mathbb{D})[1432] = ??

Precomputation : O(#Q(𝔻))O(\#Q(\mathbb{D})) at least (may be worse), very costly

Access : O(1)O(1), nearly free

Orders ?

  1. Order by weights
  2. Lexicographical orders
    • order on the vars of QQ
    • order on domain DD of 𝔻\mathbb{D}

Variable order (city,country,name,id)(city, country, name, id):

city country name id
Berlin Germany Djibril 4
Paris France Alice 1
Rome Italy Chiara 3
Rome Italy Francesca 6

In this talk: only lexicographical orders.

Applications

Direct Access generalizes many tasks that have been previously studied:

  • Uniform sampling without repetitions
  • Ranked enumeration
  • Counting queries:
    • how many answers between τ1\tau_1 and τ2\tau_2?
    • how many answers extend a partial answer etc.

Beating the Naive Approach

Beating Naive Direct Access

Naive Direct Access:

  • Preprocessing at least O(#Q(𝔻))O(\#Q(\mathbb{D})).
  • Access time O(1)O(1).

Can we have better preprocessing and reasonable access time?

“Ideal” complexity:

  • O(|𝔻|)O(|\mathbb{D}|) preprocessing
  • O(log|𝔻|)O(\log |\mathbb{D}|) access time

Complexity of Direct Access

Computing #Q(𝔻)\#Q(\mathbb{D}) given QQ and 𝔻\mathbb{D} is #P\#P-hard.

No Direct Access algorithm with good guarantees for every QQ and 𝔻\mathbb{D}.

Data complexity assumption: for a fixed QQ, what is the best preprocessing f(|𝔻|)f(|\mathbb{D}|) for an access time O(polylog|𝔻|)O(polylog |\mathbb{D}|)?

In this work, all presented complexity in data complexity will also be polynomial for combined complexity.

An easy query?

Q(a,b,c)=A(a,b)B(b,c).Q(a,b,c) = A(a,b) \wedge B(b,c).

G a a b b a--b c c b--c

Direct Access for lexicographical order induced by (a,b,c)(a,b,c)?

  • Precomputation O(|𝔻|)O(|\mathbb{D}|)
  • Access time O(log|𝔻|)O(\log |\mathbb{D}|)
a b
0 0
1 1
2 1
b c
0 0
0 1
0 2
1 1
1 2

Precomputation :

  • #Q(0,0,_)=3\#Q(0,0,\_) = 3
  • #Q(1,1,_)=2\#Q(1,1,\_) = 2
  • #Q(2,1,_)=2\#Q(2,1,\_) = 2

Access Q[5]Q[5]:

  • a=0,b=0a=0,b=0: not enough solutions
  • a=1,b=1a=1,b=1: enough! 33 solutions smaller than (1,1,_)(1,1,\_)
  • Look for the second solution of B(1,_)B(1,\_): a=1,b=1,c=2a=1,b=1,c=2

A not so easy query

Q(a,c,b)=A(a,b)B(b,c).Q(a,c,b) = A(a,b) \wedge B(b,c).

G a a b b a--b c c b--c

Direct Access for lexicographical order induced by (a,c,b)(a,c,b)?

  • Precomputation O(|𝔻|2)O(|\mathbb{D}|^2)
  • Access time O(log|𝔻|)O(\log |\mathbb{D}|)

Reduces to multiplying two {0,1}\{0,1\}-matrices M,NM,N over \mathbb{N}:

  • (i,j)A(i,j) \in A iff M[i,j]=1M[i,j]=1, (j,k)N(j,k) \in N iff N[j,k]=1N[j,k]=1
  • #Q(i,j,_)=(MN)[i,j]\#Q(i,j,\_) = (MN)[i,j]
  • Direct Access can be used to find #Q(i,j,_)\#Q(i,j,\_) with O(log|𝔻|)O(\log |\mathbb{D}|) queries.

Characterizing preprocessing time

Given a query QQ and order π\pi on its variables, we can compute ι(Q,π)\iota(Q,\pi) such that:

  • Tractable Direct access for QQ on 𝔻\mathbb{D}:
    • preprocessing Õ(|𝔻|ι(Q,π))\tilde{O}(|\mathbb{D}|^{\iota(Q,\pi)})
    • access O(log|𝔻|){O}(\log |\mathbb{D}|)
  • Tight fine-grained lower bounds:
    • if possible with Õ(|𝔻|k)\tilde{O}(|\mathbb{D}|^k) preprocessing with k<ι(Q,π)k < \iota(Q,\pi)
    • then Zero-Clique Conjecture is false
      (we can find 00-weighted kk-cliques in graphs in time <|G|kε< |G|^{k-\varepsilon})
  • Function ι\iota closely related to fractional hypertree width.
  1. Tractable Orders for Direct Access to Ranked Answers of Conjunctive Queries, N. Carmeli, N. Tziavelis, W. Gatterbauer, B. Kimelfeld, M. Riedewald
  2. Tight Fine-Grained Bounds for Direct Access on Join Queries, K. Bringmann, N. Carmeli, S. Mengel

End of the story?

So, if we understand everything for Direct Access and lexicographical orders, what is our contribution?

Signed Join Queries

Definition

Q=i=1kPi(𝐳𝐢)Q = \bigwedge_{i=1}^k P_i(\mathbf{z_i}) i=1l¬Ni(𝐳𝐢)\bigwedge_{i=1}^l \lnot N_i(\mathbf{z_i})

Negation interpreted over a given domain DD:

NN
x1x_1 x2x_2 x3x_3
0 1 0
¬N\lnot N on {0,1}\{0,1\}
x1x_1 x2x_2 x3x_3
0 0 0
0 0 1
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1
  • ¬N(x1,,xk)\neg N(x_1,\dots,x_k) encoded with |D|k#N|D|^{k}-\#N tuples.
  • Relation with SAT: ¬N\neg N is x1¬x2x3x_1 \vee \neg x_2 \vee x_3

Positive Encoding not Optimal

Q(x1,,xn)=¬N(x1,,xn)Q(x_1,\dots,x_n) = \neg N(x_1,\dots,x_n), domain {0,1}\{0,1\}.

Positive encoding: preprocessing O(2n)O(2^n)

N
x1x_1 x2x_2 x3x_3
0 1 0
1 0 1
  • Q(𝔻)[1]Q(\mathbb{D})[1]? x1=0,x2=0,x3=0x_1 = 0, x_2 = 0, x_3=0 ie [0]2[0]_2!
  • Q(𝔻)[2]Q(\mathbb{D})[2]? x1=0,x2=0,x3=1x_1 = 0, x_2 = 0, x_3=1 ie [1]2[1]_2!
  • Q(𝔻)[3]Q(\mathbb{D})[3]? x1=0,x2=1,x3=0x_1 = 0, x_2 = 1, x_3=0 ie [2]2[2]_2? x1=0,x2=1,x3=1x_1 = 0, x_2 = 1, x_3=1 ie [3]2[3]_2!
  • Q(𝔻)[k]Q(\mathbb{D})[k]? [k1+pk]2[k-1+p_k]_2
    where pk={tNtk}p_k=\{t \in N \mid t \leq k\}

Linear preprocessing!

Hardness of subqueries

Q1=R(1,2,3)S(1,2)T(2,3)U(3,1)Q_1 = R(1, 2, 3) \land S(1, 2) \land T(2, 3) \land U(3, 1)

linear preprocessing

Q2=S(1,2)T(2,3)U(3,1)Q_2 = S(1, 2) \land T(2, 3) \land U(3, 1)

non-linear preprocessing

Subqueries may be harder to solve than the query itself!

Subqueries and negative atoms

Q1=Q_1' = ¬R(1,2,3)\lnot R(1, 2, 3) S(1,2)T(2,3)U(3,1)\land S(1, 2) \land T(2, 3) \land U(3, 1)

Equivalent to Q2Q_2 if R=R=\emptyset

Q2=S(1,2)T(2,3)U(3,1)Q_2 = S(1, 2) \land T(2, 3) \land U(3, 1)

non-linear preprocessing (triangle)

DA for Q=PNQ = P \wedge N implies DA for Q=PNQ = P \wedge N' for every NNN' \subseteq N!

Measuring hardness of SJQ

Good candidate for Q=Q+QQ = Q^+ \wedge Q^-:

Signed-HyperOrder Width sfhow(Q,π)=maxQQι(Q+Q,π)sfhow(Q,\pi) = \max_{Q' \subseteq Q^-} \iota(Q^+ \wedge Q', \pi)

For QQ a (positive) JQ signed JQ, and π\pi a variable ordering, we can solve DA with

  • Preprocessing Õ(|𝔻|ι(Q,π))\tilde{O}(|\mathbb{D}|^{\iota(Q,\pi)}) Õ(|𝔻|sfhow(Q,π))\tilde{O}(|\mathbb{D}|^{sfhow(Q,\pi)})
  • Access time O(log|𝔻|)O(\log |\mathbb{D}|)

Our contribution : new island of tractability for Signed JQ!

A word on sfhow

Signed Fractional HyperOrder Width (and incidentally, our result) generalizes:

  • β\beta-acyclicity (#SAT and #NCQ are already known tractable)
  • signed-acyclicity (Model Checking for SCQ known to be tractable)
  • Nest set width (SAT / Model Checking for NCQ known to be tractable)
  • A non-fractional version show can be defined (better combined complexity)

Basically, everything that is known to be tractable on SCQ/NCQ.

  1. Understanding model counting for β-acyclic CNF-formulas, J. Brault-Baron, F. C., S. Mengel
  2. De la pertinence de l’énumération: complexité en logiques propositionnelle et du premier ordre, J. Brault-Baron
  3. Tractability Beyond ß-Acyclicity for Conjunctive Queries with Negation, M. Lanzinger

Our algorithm: a circuit approach

Relational Circuits

x1x_1 x2x_2 x3x_3
0 0 0
0 0 1
0 1 0
0 1 1
1 0 1
1 0 2
1 1 1
1 1 2
1 2 0
1 2 1
2 0 1
2 0 2
2 2 1
2 2 2

Ordered Relational Circuits

Factorized representation of relation RDXR \subseteq D^X:

  • Inputs gates : \top & \bot
  • Decision gates
  • Cartesian products: ×\times-gates

Ordered: decision gates below xix_i only mention xjx_j with j>ij>i.

Direct Access on Relational Circuits

For CC on domain DD, variables x1,,xnx_1,\dots,x_n, DA possible :

  • Preprocessing: O(|C|log|D|)O(|C| \log |D|)
  • Access time: O(nlog|D|)O(n \log |D|)

Preprocessing

Idea : for each gate vv over xix_i and for each domain value dd

compute the size of the relation where xix_i is set to a value ddd'\leqslant d

Preprocessing

Direct Access 7th solution

Compute the 7th solution \to 111

Direct Access the 13th solution

Compute the 13th solution \to 221

Solving DA for SCQ

SCQ Q(x1,,xn)Q(x_1,\dots,x_n), π=(x1,,xn)\pi = (x_1,\dots,x_n).

Preprocessing: Õ(|𝔻|sfhow(Q,π))\tilde{O}(|\mathbb{D}|^{sfhow(Q,\pi)})

  1. Construct π\pi-ordered circuit CC of size Õ(|𝔻|sfhow(Q,π)f(Q))\tilde{O}(|\mathbb{D}|^{sfhow(Q,\pi)} f(Q))
  2. Preprocess CC in time O(|C|log|𝔻|)O(|C| \log |\mathbb{D}|).

Direct Access : O(log|𝔻|)O(\log |\mathbb{D}|)

  1. Directly on CC
  2. in time O(nlog|D|)O(n\log |D|) !

QQ, nn considered constant here!

  • The hidden constants f(Q)f(Q) are exponential in |Q||Q| for bounded sfhow(Q)sfhow(Q).
  • But polynomial in QQ for bounded show(Q)show(Q) (non fractional question).

DPLL: building circuits

Compilation based on a variation of DPLL :

  1. Q(𝔻)=dD[x1=d]×Q[x1=d](𝔻)Q(\mathbb{D}) = \biguplus_{d \in D} [x_1 = d] \times Q[x_1=d](\mathbb{D})
  2. Q(𝔻)=Q1(𝔻)×Q2(𝔻)Q(\mathbb{D}) = Q_1(\mathbb{D}) \times Q_2(\mathbb{D}) if Q=Q1Q2Q = Q_1 \wedge Q_2 with var(Q1)var(Q2)=var(Q_1) \cap var(Q_2) = \emptyset
  3. Top down induction + caching
https://florent.capelli.me/cytoscape/dpll.html

A comment on the complexity of DPLL

  • If implemented this way, gives a |𝔻|sfhow(Q)+1|\mathbb{D}|^{sfhow(Q)+1} complexity…
  • Workaround: reencode the domain in binary and build a circuit iteratively testing the bits of each variable.

Going further

Work in progress

  1. Aggregation
    Q(x1,,xk,F(xk+1,,xn))Q(x_1,\dots, x_k, F(x_{k+1}, \dots, x_n)), generalizing work by I. Eldar, N. Carmeli, B. Kimelfeld.
  2. Understanding combined complexity for sfhow(Q)sfhow(Q), the fractional version of showshow
  3. Comparing showshow and β\beta-hypertree width (the most general parameter for which the complexity is still unknown).