Proposal: Make Org-mode fast for real

#emacs

First: in what way is org-mode slow? It's easiest to illustrate in terms of some things that have been developed as a reaction to org-mode's slowness:

org-ql
org-roam
org-node (I am the developer)

I have barely used org-ql, but as I understand the main distinction between org-ql and the other two:

org-ql searches the current file in realtime. It waits for you to input a search query, then it uses that to search for a specific kind of thing. It's supposed to be very fast, and I believe it can also operate on all files mentioned in an Agenda buffer, so it even works as a limited multi-file search.
- I say limited because you're not going to have that many files in org-agenda. Both org-agenda and org-ql (correct me here) would run into a performance tarpit if you add 100 or 1000 files to the list `org-agenda-files`.
  - That's the root of the issue. We'll get back to it later.
- Even if org-ql would perform well with 1000 files, you have to know what to search for, beforehand. It is not an exploration tool.
  - Admittedly, perhaps a command could be written that uses org-ql to search for "any heading" in "any file" and make a minibuffer prompt out of that – the result would be a similar idea to org-roam's/org-node's minibuffer prompts. But there is probably no need to use a search engine like org-ql if you're going to write such a simplistic query.
org-node (and org-roam) visits all files at some point in time, to cache as much info as possible about those files.
- It's like you wrote a few simplistic org-ql searches and cached the results. But not really.
  - I briefly wondered if I could design org-node to just run on top of org-ql, but they don't have the same tasks. Org-node has to correlate all those results so that it can do things like take some Org entry title and return what entries have that title and what's in their PROPERTIES drawers, all while operating purely off cache so you can write functions that loop over every entry in existence in less than 20ms.
    
    To run on org-ql would be a lot like running on ripgrep, an experiment I already tried. It's a mess of having to do many search passes and correlating different sets of results, and necessarily slower than just giving org-node its own parser.

To my proposal, what if upstream Org did such caching?

That's actually the idea with the org-element-cache, but it is not ambitious enough (yet). It's still the case that most functions that work with Org have to visit the relevant file, turn on org-mode, and then use the org-element functions to grab the info they need. But almost all the CPU cycles are burned at the "visit and turn on org-mode" step.

That's why having 1000 org-agenda-files causes the agenda to take several minutes to build. It has to turn on org-mode 1000 times.

I envision that a function should be able to just ask Org "hey, in that file, get me that piece of information" and Org will return the information without visiting that file at all.

Concretely: say the first time Org loads, it spins up an async process that visits every file in `org-agenda-files`, `org-id-locations`, `recentf-list` and other variables, and returns the org-element tree for each. Then Org has a nice set of hash tables it can just look up.

(Of course, store each file's last-modification time to know if it needs re-scanning.)

The end result might be a lot of commands are suddenly instant, and things like agenda and org-ql can cope with an unlimited number of files the same as if they were concatenated into one file.

If we further extend org-element-cache so that it even contains a copy of the fulltext of all entries, that would enable a fulltext search that competes with ripgrep, and can be filtered by additional metadata in a way you can't do with ripgrep.

The code-bases of org-node and org-roam could then shrink to 1/10 of the original LoC.

Created 2024-Aug-17 (5 months ago)