How To Download A Blog
How to Download an Entire WordPress Blog
Sun, Apr 1, 2012 4-minute read
Sometimes, you stumble on a blog that is so chock full of information that you revel in its every word. And then you realize their archive goes back 5 years!
I've read a bunch of great posts on Nate Lawson's awesome security blog, and decided that I wanted to read it beginning to end.
If you are the owner of said WordPress blog, the solution is easy – use WordPress' built-in Export feature. There are even handy services that will turn this into a PDF, eBook, or even printed book.
If you're not the owner, things aren't so easy:
- Sit in front of the computer
- Go to the oldest month in the archives menu that I hadn't yet visited
- Read that page
- Click "Next Page" until those links stop appearing
- Go back to the home page
- (Repeat steps 2-5 until you're done with the blog)
Oh, and if you intend to take a break at any point in time, add in a few "try to remember where you were, and find that blog post again" entries.
What I was really hoping for was:
- Open a PDF on my Kindle, and read the entire thing in chronological order, letting the Kindle software keep track of where I am.
It turns out that the difference between reality and desire is about twelve lines of PowerShell!
PowerShell's recent technology previews (and the Windows 8 consumer and developer preview) include the Invoke-WebRequest cmdlet.Think wget / curl, but with PowerShell's traditional object-based awesome-sauce. For example:
PS C:\\temp> Invoke-WebRequest http://www.leeholmes.com/blog | >> Foreach-Object Links | >> Where-Object InnerText -match "August" | >> Foreach-Object Href http://www.leeholmes.com/blog/2011/08/ http://www.leeholmes.com/blog/2010/08/ http://www.leeholmes.com/blog/2008/08/ http://www.leeholmes.com/blog/2007/08/ http://www.leeholmes.com/blog/2006/08/ http://www.leeholmes.com/blog/2005/08/ When you look at links to the monthly archives, they all follow the pattern:
//">//">http://www.example.com/url/<number><number><number><number>/<number><number>/ When you visit any of these pages, they have another link. The exact text depends on the blog itself – but it may be "Earlier Entries", "Next Page", or similar:
PS C:\\temp> $page = Invoke-WebRequest http://www.leeholmes.com/blog/2005/06/ PS C:\\temp> $page.Links | Where-Object InnerText -match "Earlier Entries" | >> Select-Object -First 1 >> innerHTML : Earlier Entries ? innerText : Earlier Entries ? outerHTML : <A href="http://www.leeholmes.com/blog/2005/06/page/2/">Earlier Entries ?</A> outerText : Earlier Entries ? tagName : A href : http://www.leeholmes.com/blog/2005/06/page/2/ Given that knowledge, we can automate the download of the entire blog, dumping it into an HTML file as we go. As a final step, we print this HTML to PDF, and upload it to our Kindle or other reading device.
Note to purists: this HTML file is brutally malformed. It is a collection of HTML pages packed into the same file, rather than one HTML page with all the important content. It is of course possible to make this a valid HTML file by manipulating the content before writing it – there's just no need to do it if the destination is a PDF anyhow.
And how about time effort? In the end, I had a PDF of the entire blog on my Kindle 20 minutes after first having thought of it.
Here's the PowerShell script that automates this all – cleaned up for your consumption, of course :)
## Things you might want to change $blogUrl = "http:/www.leeholmes.com/blog" $archiveLinkPattern = '/\d\d\d\d/\d\d/$' $nextPageText = "Earlier Entries" ## Get the page $r = Invoke-WebRequest $blogUrl ## Extract the archives links $links = $r.Links | Where-Object href -match $archiveLinkPattern | Foreach-Object href ## Sort the archives in reverse order $links = $links[$links.Count..0] ## Go through each archive page foreach($link in $links) { ## Create a variable to hold the HTML content for this month $monthExport = "" do { ## Get the archives for that month $month = Invoke-WebRequest $link ## Get the page content, and put it at the beginning of the ## monthExport variable. That's because "Earlier Entries" ## should be placed before the content we just got. $monthExport = $month.Content + " `r`n " + $monthExport ## Find the link to "Earlier Entires" $link = $month.Links | ? innertext -match $nextPageText | Foreach-Object href | Select-Object -First 1 ## Keep on doing this while we found an "Earlier Entries" link } while($link) ## Now that we're done with the month, put it at the end of the ## HTML file (since we're processing months in order) $monthExport >> leeholmes.html } How To Download A Blog
Source: https://www.leeholmes.com/how-to-download-an-entire-wordpress-blog/
Posted by: lewisvengland.blogspot.com

0 Response to "How To Download A Blog"
Post a Comment