Dependencies

Sun
09/04/11

What is a Dependency?

Whenever you create a page you generate dependencies.

For example, a page may include its own content, the list of the last 3 pages created on your website, the list of the last 5 news feed titles, and the teaser of another page in a box.

These are all called dependencies.

When you change a page's own content, that page is marked as changed and in general we're done.

However, in cases where a page A is referenced in a page B, we need to refresh page B with the new content in page A.

For example, in our prevent example we mentioned the idea of showing a teaser from another page. This means the teaser of page A is shown in a box on page B. B depends on A since when A changes, B is likely to need to change too to reflect the new content.

As we build pages and have users enter data and references like these, we can automatically add dependencies. For example, when the user marked that page B had to show the teaser of page A, we added a dependency in A pointing to B. The result is that whenever page A is modified we can automatically mark page B as out of date and requiring a rebuild.

Most Recent Lists

One of the main things that people like to have on their dynamic website is a list of the most recent comments, most recent posts, most recent visitors, etc.

All of those generally happen to appear on all the pages. In other words, it breaks the possibility of caching an entire page.

In order to palliate to this problem, we can look into applying cTemplate in stages. Entries that do not change often generate the most advanced cache. The data that changes all the time the least advanced cache.

By that I mean that the page creation process is similar to the following:

Generate the page content with all the content that the page owns—cache the result, it will change only when the page is edited;
Attach all the parts that are marked as low frequencies (i.e. the teaser of another specific node)—cache this result, it will change if the page content or the attached nodes and other low level parts change which should be rare (use a priority mechanism?);
Attach parts with high frequency changes (i.e. list of latest posts, list of most read posts, list of comments)

Point 2 and 3 can be expanded with as many levels as we'd like (although think about the number of caches we want to keep for a given page...)

We also want to calculate the frequency automatically so that way administrators do not have to themselves determine what needs to go here or there. For example, new websites may rare get comments (low frequency). Older websites might get thousands every day (high frequency.) This means the list of cache can dynamically change over time.

XSLT variables and Dependency computation

It is to be noted that the dependency may not be to another page or list. Some specific cases are much more complex.

For example, XSLT allows for show the current year which breaks on each January 1st at midnight. If entering time as HH:MM then the page goes out of date once a minute1.

Cascading Effects

As in a Makefile a dependency may hide another. For example, a Makefile could include something like this:

A: B

B: C

C: D

This means A depends on B, B depends on C, C depends on D.

At any one point, we should never need to know that A depends on D. The system has an automatic ripple effect:

When you change D, it marks that C needs rebuilding.
After C was changed, the system marks that B needs rebuilding.
When B is changed, the system marks that A needs rebuilding.
Now the system can rebuild A.

Dependency Loops

As you create pages, you may say that A shows B. At a later time, you may edit B and ask B to show some part of A.

In general this is fine, especially if rebuilding A or B does not require the other to be rebuilt. The order should also not matter since the content should not be bound by any information such as when the page was last rebuilt.

In order to avoid rebuilding in a loop, any information about a build can be displayed anywhere, but it is not considered in the dependency determination. Thus page A may display the time and date when page B was last built and this will be wrong if A is not rebuilt when B is modified or rebuilt again. The [select("/snap/page/body/created")] token would still be useful if you want to display it in the page itself as information to yourself or your users.

See: Token Replacement feature [core]

Computation of Dependencies

It is our belief that the computation of dependencies can take a long time, especially with the use of very complex layouts.

The computation should take place in the backend. Whenever something changes that can affect the dependencies, the page that changed is marked as requiring a dependency re-computation.

In order to allow easy re-computations, we want to save all the information about all the dependencies in both: the source and destination. This means...

Page A depends on page B
Page A has a dependency link to page B
Page B has a dependency link back to page A

With these two pointers we can easily update both pages as required.

First case: A modification to B emits a rebuild signal to A.

Second case: A modification to A removes the dependency on B, we can use the second pointer to remove the dependency from B so further modifications to B do not emit rebuild signals to A.

Rebuild Order

In order to avoid rebuilding the same pages over and over again, we want to build in a smart order. Also, as we build a cache, we need a time threshold to avoid rebuilding pages that have changed while we were building others in the cache.

Depending on the amount of time necessary to rebuild the entire cache, the order should also be affected by a page priority mechanism. For example, if most of your viewers arrive on the front page, you may want to keep the front page up to date to the minute. This would mean that the system needs to check whether the front page changed each minute. If so, recompute it immediately, ignoring all the other update requests.

So the order is computed using two main mechanism: priority and dependencies.

Let's say you have 5 pages, regular pages A through D and the front page F.

A depends on B
B depends on C
B depends on D
F depends on A, B, and C

We will assume that the time required to rebuild the pages is so long that page F needs to be preemptive once in a while.

In our graph we notice two things:

Following all the dependencies, F depends on all the other pages.
B depends on two other pages.

So, before we rebuild B, we want to build C and D as required. This means, if you modify C and D, then the system wants to rebuild B, it will make sure that both, C and D are rebuild first, then B is rebuilt.

Say our process goes like this:

We first detect that C needs rebuilding. We do that since it does not depend on anything else.
We notice that B depends on C. Emit a signal to B.
We now want to rebuild B. We check its dependencies and see that it has 2.
1. Check whether C was rebuilt, in our case it was.
2. Check whether D was rebuilt, not yet! Go ahead with that rebuild.
Build D.
We come back to building B, we now do it since we checked all its dependencies and they all were rebuilt.

This is how B will be rebuilt in the proper order.

In that previous list of steps, we could say that each step is separated by a check on F. And if F was detected as modified and needy of an update, it would be updated before moving to the next step as presented for rebuilding B.

Implementation

The basic algorithm to implement the dependency order would look something like this:

Get list of pages that need to be rebuilt.
Sort the list.
1. First by priority (if A has a higher priority than B, move A first)
2. Second by dependencies (if A depends on B, move B first)
Rebuild the first page in the list.
Repeat from step 1 until done.

The implementation may include a way to break the loop after a certain amount of time or a certain number of pages were rebuilt to give a chance to other websites to be refreshed too.

Note that the priority is computed using different criteria. For example a page that needs rebuild for more than 1 week should appear first in the list, way before a page such as your front page. In other words, the user specified priority is a fairly small factor in the ordering as shown in step 2.

Dependency Granularity

The question could be: What generates a rebuild?

For example, page A depends on page B means that A needs to be rebuilt when B changes. However, page A may depends on a very specific piece of information from B that may rarely change. For example, you may be showing the creation date of B. The creation date should never change since there is only one point in time when a page gets created.

This means we need a deeper granularity than just page A and page B. In our example, we'd make a strong distinction between B.content and B.creation_date. B.content may change when the user edits page B. B.creation_date should never change (although we can allow changing the creation time, by default that should not be allowed.)

Since the dependencies are determined using XSLT variables, we can use that exact value to determine whether the dependency emits a rebuild event or not.

The granularity can cause problems with multiple dependencies as in:

A depends on D.creationg_date
B depends on D.content
C depends on D.author

Here we can see that what you changed in D needs to be tracked exactly or the change must emit the rebuild signals to A, B, and C every time (although the creation date may be ignored in most cases and the author rarely changes.)