How To Download A Blog

How to Download an Entire WordPress Blog

Sun, Apr 1, 2012 4-minute read

Sometimes, you stumble on a blog that is so chock full of information that you revel in its every word. And then you realize their archive goes back 5 years!

I've read a bunch of great posts on Nate Lawson's awesome security blog, and decided that I wanted to read it beginning to end.

If you are the owner of said WordPress blog, the solution is easy – use WordPress' built-in Export feature. There are even handy services that will turn this into a PDF, eBook, or even printed book.

If you're not the owner, things aren't so easy:

Sit in front of the computer
Go to the oldest month in the archives menu that I hadn't yet visited
Read that page
Click "Next Page" until those links stop appearing
Go back to the home page
(Repeat steps 2-5 until you're done with the blog)

Oh, and if you intend to take a break at any point in time, add in a few "try to remember where you were, and find that blog post again" entries.

What I was really hoping for was:

Open a PDF on my Kindle, and read the entire thing in chronological order, letting the Kindle software keep track of where I am.

It turns out that the difference between reality and desire is about twelve lines of PowerShell!

PowerShell's recent technology previews (and the Windows 8 consumer and developer preview) include the Invoke-WebRequest cmdlet.Think wget / curl, but with PowerShell's traditional object-based awesome-sauce. For example:

          PS C:\\temp> Invoke-WebRequest http://www.leeholmes.com/blog | >>     Foreach-Object Links | >>     Where-Object InnerText -match "August" | >>     Foreach-Object Href  http://www.leeholmes.com/blog/2011/08/ http://www.leeholmes.com/blog/2010/08/ http://www.leeholmes.com/blog/2008/08/ http://www.leeholmes.com/blog/2007/08/ http://www.leeholmes.com/blog/2006/08/ http://www.leeholmes.com/blog/2005/08/

When you look at links to the monthly archives, they all follow the pattern:

          //">//">http://www.example.com/url/<number><number><number><number>/<number><number>/

When you visit any of these pages, they have another link. The exact text depends on the blog itself – but it may be "Earlier Entries", "Next Page", or similar:

          PS C:\\temp> $page = Invoke-WebRequest http://www.leeholmes.com/blog/2005/06/ PS C:\\temp> $page.Links | Where-Object InnerText -match "Earlier Entries" | >>     Select-Object -First 1 >>   innerHTML : Earlier Entries ? innerText : Earlier Entries ? outerHTML : <A href="http://www.leeholmes.com/blog/2005/06/page/2/">Earlier Entries ?</A> outerText : Earlier Entries ? tagName   : A href      : http://www.leeholmes.com/blog/2005/06/page/2/

Given that knowledge, we can automate the download of the entire blog, dumping it into an HTML file as we go. As a final step, we print this HTML to PDF, and upload it to our Kindle or other reading device.

Note to purists: this HTML file is brutally malformed. It is a collection of HTML pages packed into the same file, rather than one HTML page with all the important content. It is of course possible to make this a valid HTML file by manipulating the content before writing it – there's just no need to do it if the destination is a PDF anyhow.

And how about time effort? In the end, I had a PDF of the entire blog on my Kindle 20 minutes after first having thought of it.

Here's the PowerShell script that automates this all – cleaned up for your consumption, of course :)

                          ## Things you might want to change              $blogUrl              =              "http:/www.leeholmes.com/blog"              $archiveLinkPattern              =              '/\d\d\d\d/\d\d/$'              $nextPageText              =              "Earlier Entries"              ## Get the page              $r              =              Invoke-WebRequest              $blogUrl              ## Extract the archives links              $links              =              $r.Links |              Where-Object              href              -match              $archiveLinkPattern              |              Foreach-Object href              ## Sort the archives in reverse order              $links              =              $links[$links.Count..0]              ## Go through each archive page              foreach($link              in              $links) {              ## Create a variable to hold the HTML content for this month              $monthExport              =              ""              do              {              ## Get the archives for that month              $month              =              Invoke-WebRequest              $link              ## Get the page content, and put it at the beginning of the              ## monthExport variable. That's because "Earlier Entries"              ## should be placed before the content we just got.              $monthExport              =              $month.Content +              "              `r`n              "              +              $monthExport              ## Find the link to "Earlier Entires"              $link              =              $month.Links | ? innertext              -match              $nextPageText              |              Foreach-Object href |              Select-Object              -First 1              ## Keep on doing this while we found an "Earlier Entries" link              }              while($link)              ## Now that we're done with the month, put it at the end of the              ## HTML file (since we're processing months in order)              $monthExport              >> leeholmes.html }