How to Ensure that Visitors Always See Cached Pages in Drupal 7

All the standard caching options of my Drupal 7 site are enabled. I expected those mechanisms to ensure that the access to a cached page would always be quick, both for human visitors and for web spiders like Googlebot. However, for a small site, this is not so. The Drupal caching mechanism does not work as you might expect. For me, there is a clear difference between what I want, and what I get.

What I want: The cached version of a page stays valid indefinitely, until that particular page is either modified, or a comment is added to it.

What I get: Each cron run clears the cache for all pages. Changing any page clears the cache for all pages. Adding a comment to any page clears the cache for all pages (even when it is not posted right away, but only put in the moderation queue).

Now, even with this, the cache is still perfectly able to deal with traffic spikes on popular pages, since almost all pages will be served from the cache for those. However, most other pages will almost never be in the cache, and will have a relatively high load time. For a good explanation of this effect, see How Drupal's cron is killing you in your sleep + a simple cache warmer. Here, I expand on the solution from that article, by also dealing with changes to a page and added comments, instead of only cron runs.

Warming the Cache

Basically, what you can do after each cron run, is running a script like the following to “warm” the cache (fill it) after it has been cleared. I assume that your web server runs Linux, and that you have an up-to-date sitemap.

wget -q http://example.com/sitemap.xml -O - \
| egrep -o "http://example\.com[^<]+" \
| wget -q -i - -O /dev/null

The linked article contains a similar script. The first wget reads the sitemap.xml file and pipes it into the egrep command. This then cuts each URL from the sitemap file and pipes it into the second wget command. This command then does the brunt of the work by fetching each page. Running this script after each cron run solves the problem of the cron clearing the cache.

However, this is only a partial solution, since any change whatsoever to the site (see “what I get” above) invalidates the complete cache for all pages, undoing the effect of the script. To avoid this, you must change the “Minimum cache lifetime” in the Performance tab. Setting it to a value larger than zero will prevent the cache from being invalidated at each change. If you set it to the time between cron runs, the cache will always be warm.

Of course, this produces a new problem, since nobody will see changes or comments anymore before the “Minimum cache lifetime” has passed. And you cannot simply rerun the warming script after a change, since that will not regenerate cached pages due to the “Minimum cache lifetime” that is now in effect. What you need to do is first forcibly clear the cache, and then rerun the warming script. So, we need a “clear cache” command that ignores the “Minimum cache lifetime”.

Clearing the Cache

How to do that? Clearing the Drupal cache is more difficult than warming it. As I’ve shown, warming the cache can be done externally, simply by loading each page. The cache can be (forcibly) cleared by clicking the “Clear all caches” button in the Performance tab. However, we then still need a simple way to rerun the warming script. It would also be nice if we could use the same script for the cron and for running manually, so we need a way to clear the cache from a shell script. There are several ways to do this. One solution is to use Drush (Drupal shell). The simple Drush command drush cc all clears the cache. However, I wanted to present a solution that can also be used on a shared hosting services.

The following code is based on the standard cron.php script that is installed on every Drupal site. If you put this code in a (new) file clear.php in the root of your Drupal site, you can clear the cache from everywhere (including from a local script that you run through a cron job) with a command like wget -q <a href="http://example.com/clear.php?cron_key=YOUR_CRON_KEY">http://example.com/clear.php?cron_key=YOUR_CRON_KEY</a> -O /dev/null. I’ve simply reused the cron key that each Drupal site already has, so that other people can’t clear your cache.

<?php
define('DRUPAL_ROOT', getcwd());
include_once DRUPAL_ROOT . '/includes/bootstrap.inc';
drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);
 
if (!isset($_GET['cron_key']) || variable_get('cron_key', 'drupal') != $_GET['cron_key']) {
  watchdog('clear', 'Cache could not be cleared because an invalid key was used.',
           array(), WATCHDOG_NOTICE);
  drupal_access_denied();
}
else {
  drupal_flush_all_caches();
}
?>

Putting It All Together

If we put all these things together in a single script, we get the following.

#!/bin/bash
 
# Run the cron job.
wget -q http://example.com/cron.php?cron_key=YOUR_CRON_KEY -O /dev/null
 
# Clear the cache.
wget -q http://example.com/clear.php?cron_key=YOUR_CRON_KEY -O /dev/null
 
# Warm the cache.
wget -q http://example.com/sitemap.xml -O - \
| egrep -o "http://example\.com[^<]+" \
| wget -q -i - -O /dev/null

This script first runs the normal Drupal cron job, then (forcibly) clears the cache using the new clear.php script, and finally warms the cache using the script that I’ve shown before. This script can be called

  1. directly from a cron job,
  2. from the command line after a change to the site (for example, after approving comments), and
  3. remotely, which means that it can also be used on shared hosting services that do not allow command line access.

You could run this script automatically as your cron job, and manually after (significant) changes, making sure that all visitors of your site always hit the cache.

Tags:

Submitted by Tom Roelandts on 21 January 2013

Comments

Hi,
Thanks for the post. I am pulling my hair out over this. Every time cron runs I lose the compressed JavaScript file connection on the home page of my site and it wrecks the banner image rotator. Ugly.

In "Putting it all together"... is that a php script that you should point your cron tab to?
Thanks a lot,
Jim

That's almost right. It's a shell script that you can point your cron tab to. I've added the typical #!/bin/bash line to make this more clear.

Hello,
Will you please add to your article instructions on how specifically to implement this? Like thousands of other Drupal users, I am new to it. As such, I have no idea what wget, shell scripts, or other concepts are. Could you please mention what files and how exactly this can be implemented to make it work on my (and other new users) websites?
Thanks, Reid

I'm afraid that introducing basic concepts is outside of the scope of an article like this, and I assume that people that wish to implement this are familiar with system administration. The exact location of files also depends on the setup of your server, so it is difficult to give precise instructions.

Great article, Thank you!

I did have to make one tweak in order to get this working for me (my site, http://kennyfamily.us is hosted). In the shell script I had to remove the first wget and grep directly from the xml file. Now it is working perfectly and my site is so much faster. Thanks Again.


# Warm the cache.


egrep -o 'http://kennyfamily\.us[^<]+' ~/public_html/sitemap.xml | wget -q -i - -O /dev/null

Tom , first of all, thanks for your efforts and explanation.
However, I think there is no need to clear the cache before running the cron job. The cron job itself will clear pages which have passed the expiration time. (minimum_cache_lifetime)

"However, this is only a partial solution, since any change whatsoever to the site (see “what I get” above) invalidates the complete cache for all pages, undoing the effect of the script. To avoid this, you must change the “Minimum cache lifetime” ... What you need to do is first forcibly clear the cache, and then rerun the warming script. So, we need a “clear cache” command that ignores the “Minimum cache lifetime”."

This is not necessary! Individual nodes only clear the cache for that particular page! So there is no need to 'forcefully' clear the cache, and you can set the cache lifetime on 0 so individual node updates immediately clear the cache for that page.

See
The D7 node_save() code works as you would hope - resets just the page cache for a single node - although even then I don't think you're *always* going to want to clear the cache on every node_save())
https://drupal.org/node/768874#comment-6671222

If not, what makes you think that it does? (-maybe you are confusing with drupal 6 which showed this behavior?)

Looking forward to hearing from you
Hyper

Thanks for your detailed comment, I really appreciate that.

I know it seems almost unbelievable, but saving any node or leaving a comment on any node (even if it is just queued for moderation) really clears the entire cache! This is with “Minimum cache lifetime” set to zero, which is the default. Try it! I use the tool that is built into Google Chrome, and the time for fetching the basic html of a page really changes in exactly the same way as when you click the “Clear all caches” button…

This has been discussed at length on the page that I link to in the article. It is after reading the discussion there that I decided to go to the bottom of this thing.

Hi Tom
Finally found some time to get back to this. I really want to get to the bottom of it :) The article you link to is specifically about Drupal 6 behaviour. There, indeed, upon node save the entire cache is cleared. In Drupal 7 this is different.

In Drupal 6 this is what happened:

  1. User saves node
  2. Drupal invokes node_save. See
    https://api.drupal.org/api/drupal/modules!node!node.module/function/node...
  3. node_save invokes cache_clear_all at the end without arguments
    // Clear the page and block caches.
    cache_clear_all();
    "Expire data from the cache. If called without arguments, expirable (read: ALL in drupal 6) entries will be cleared from the cache_page and cache_block tables."
  4. All page/block cache entries have been cleared, thus resetting the ENTIRE PAGE/BLOCK cache upon NODE SAVE
  5. When system_cron runs, Drupal also always clears the ENTIRE cache (all cache tables) since it calls cache_clear_all with 1st arg NULL for all tables. It says expirable items, but probably all items in cache.
    https://api.drupal.org/api/drupal/modules!system!system.module/function/...

In Drupal 7 this happens:

  1. User saves node
  2. Drupal invokes node_save.
  3. This function does not invoke any immediate cache clearing functions. See the difference with D6 node_save! It does however, INVALIDATE the cache FOR THAT NODE.
    // Clear the static loading cache.
    entity_get_controller('node')->resetCache(array($node->nid));
    "If specified, the cache is reset for the entities with the given ids only.
    "
  4. When the system_cron runs, Drupal clears the CACHE if one or more items have been invalidated (on node save).
    So there is a major difference. Drupal 7 will NOT clear the ENTIRE cache until cron.php runs IF items have been INVALIDATED in the cache. Only question remains: will '1 node invalidation' cause drupal 7 to clear the entire cache on cron.php, or only that 1 node? However, there is still a major difference with D6 since it does not happen on node_save.
    https://api.drupal.org/api/drupal/modules!system!system.module/function/...

So Drupal 7 is 'aware' of expiration of a certain node, and only clears the (entire?/that node entry?) cache on cron run when necessary. On node_save nothing is done to clear the cache.
In Drupal 6 all items are removed from the cache:explicitly on node_save, and also on cron run. In drupal 7 node_save only invalidates the cache, and cron clears the cache items when saves have been done in the last X hours (x= cron interval).

In Drupal 6 functions they also talk about expiration, but it is not implemented. Everything is expired on saving a node.

Best wishes

Upon further investigation I read this:
Where in D7 does the cache get cleared on content change? I see in D6 there's an empty call to cache_clear_all(), at the end of node_save, but nothing of the sort in D7.
...
In Drupal 7 those cache clears were moved to submit handlers, mainly so that things like mass imports of tens of thousands of nodes can handle cache clearing themselves at the end instead of once per node.
...
Indeed, it happens in node_form_submit!
https://api.drupal.org/api/drupal/modules!node!node.pages.inc/function/n...
// Clear the page and block caches.
cache_clear_all();

It appears you are right after all. Glad to have sorted it out to the bottom :) It does clear the cache on saving a node, just like D6.
The question I now have, what does the call to>resetCache() do in drupal 7's node_save?

Thanks for your thorough research, Hyper! I think your comments are very useful for people that want to know how Drupal does this cache clearing thing in practice. I didn't delve into the code too deeply when I wrote the article, since I used the tool in Google Chrome to experimentally find out when the cache was being cleared. This also means that I've always known that I was right, even without being able to put my finger on the line of code that actually contains the cache_clear_all()... :-)

Note sure why but when I run each the script with my cron key (D7) from ssh the command prompt is not returned. I ran it line by line and each seems to work fine, but the last line, | wget -q -i - -O /dev/null seems to not to return the command prompt, am I missing something?

nevermind, script just took longer to run than my patience would allow!

Nice explanation of the cache working

Maybe there is sg wrong with this or i dont understand it properly.

Of course, this produces a new problem, since nobody will see changes or comments anymore before the “Minimum cache lifetime” has passed.

What is the aim here? Do you wanna clear the cache after each node save right? If this is the case then we can set “Minimum cache lifetime” to zero and let drupal clear the cache automaticaly and after we can rewarm the cache up. We can call the warmer script after node save but in this case we dont need cron and cache clear to be included in the clear.php.

If you wanna clear the cache after in a given time using cron then the cited problem above wont be solved cos the batch file will be called by cron and not after node save and in this case calling the cron is unnecessary.

And if you wanna run the cron tasks of drupal then clear.php is unnecessary cos cron.php will clear all caches.

Summarizing, there should be two batch scriptes, one of them would be used after each node save and the other would be used for cron tasks.

I do not want to have to run the cache warming script after each change to the site, especially since someone posting a comment also clears the cache (even if the comment is held for moderation). I just want to make all pages stay cached for a reasonable time, and the only way to do that is to increase the “Minimum cache lifetime”. The rest of the mechanism follows from that, and the details are in the article. (Note, for example, that increasing the minimum cache lifetime implies that the cron run will no longer clear the cache, so you do need the clear.php script.)

Thanks for writing up this article!

The cache warming script was not working for me but this script did:

wget -q http://www.example.com/sitemap.xml -O - |
egrep -o "http://www\.example\.com[^<]+" |
wget -q -i - --wait 1

This should be run on one line!

Since I've also images in my sitemap.xml i'm using
wget -q https://www.domain.com/sitemap.xml -O - |
grep -P -o "(?<=<loc>)https:\/\/www\.domain\.com[^<]+" |
wget -q -i - -O /dev/null --wait 1

thanks for this article!

Add new comment