
GNU Wget is a powerful tool when it comes to downloading files from the web or mirroring sites. It’s command line features can be daunting and not very obvious. With some experimentation, reading the (f..) manual and some Googling you can get it to do some pretty neat tricks for you.
All of that is from the command line too, which is great if you want to schedule this kind of magic or use it in a script.
For example, you might want to warm-up your site or WordPress blog so your homepage and all posts linked from it are present in your cache when a visitor arrives. I’m assuming you are using a caching on your site otherwise this is pretty pointless. For WordPress you can use a caching plugin like W3 Total Cache for example.
With Wget, it goes like this:
wget.exe http://n3wjack.net --spider --no-directories --level=1 --recursive
--accept-regex=n3wjack.net/20[1..9].*
The command line parameters (in order) mean something like:
- Crawl n3wjack.net.
- Crawl it like a spider (follow the links).
- Don’t create directories for downloads.
- Crawl 1 level deep (so anything linked on the homepage is OK, but don’t go deeper).
- Do this recursively (so it actually goes 1 level deep).
- Follow only links that start with
"/201..."
to"/209..."
(it’s a regular expression).
This one is a trick to have it only follow links to blog-posts because my URL scheme begins with the year of the post (2015, 2016, …). It’s good until 2099, which should do the trick I guess. :)
This way I’m also avoiding it loading all tag, category or page links.
If your site has a different URL scheme you’ll have to change the accept regex pattern to fit your scheme.
You can download Wget from the GNU site. It’s Open Source and is available for Windows, Mac and various Unix systems.
For Chocolatey users, there is a wget package available to install it on your system.