Make Offline Mirror of a Site using `wget`

Sometimes you want to create an offline copy of a site that you can take and view even without internet access. Using wget you can make such copy easily:

wget --mirror --convert-links --adjust-extension --page-requisites 
--no-parent http://example.org

Explanation of the various flags:

  • --mirror – Makes (among other things) the download recursive.
  • --convert-links – convert all the links (also to stuff like CSS stylesheets) to relative, so it will be suitable for offline viewing.
  • --adjust-extension – Adds suitable extensions to filenames (html or css) depending on their content-type.
  • --page-requisites – Download things like CSS style-sheets and images required to properly display the page offline.
  • --no-parent – When recursing do not ascend to the parent directory. It useful for restricting the download to only a portion of the site.

Alternatively, the command above may be shortened:

wget -mkEpnp http://example.org

Note: that the last p is part of np (--no-parent) and hence you see p twice in the flags.

22 thoughts on “Make Offline Mirror of a Site using `wget`

  1. David Wolski

    wget usually doesn’t work very well for complete offline mirrors of website. Due to its parser there is always somethings missing, i.e. stylesheets, scripts, images. It simply isn’t the right tool for this task.
    HTTrack is much slower than wget but a powerful parser. It’s GPL and available in most Linux-Distributions.
    Documentation and sorce-code is available at http://www.httrack.com

  2. Pingback: Make an Offline Mirror of a Site Using `wget` - John Haynes

  3. Pingback: Linux 如何抓取網頁頁面 並 將相關連結置換 或 完整抓取下來 - Tsung's Blog

  4. Pingback: Mirror | stigmatedbrain's corner

  5. Pingback: Download a complete single page with wget - justnorris

  6. bhl

    I second David Wolski’s comment. HTTrack is an outstanding website mirroring tool. I like it because it performs incremental updates. Nothing like sucking down the Washington Post without adverts.

  7. Pingback: web archiving resources for NDSA NE crew (and anyone else reading this!) | Archive Hour

  8. Pingback: Niente stronzate ©

  9. Pingback: Download a complete single page with wget

  10. Pingback: 用wget抓取整站 | my life, my blog

  11. Gwyneth Llewelyn

    I saw the comment related to HTTrack only after reading this very useful article (and successfully copying 99% of a website written in ColdFusion, the remaining 1% being embedded JavaScript which had to be done manually; also, moving everything to HTTPS took me a minute or so!).

    Unfortunately, HTTrack made sense in 2014 (when this article was written), but it stopped being developed in 2017 (last commit on github) and has 112 pending issues (a bad sign — it’s probably abandoned by now). One major issue with HTTrack is the apparent lack of support of HTML5 (or at least incomplete support for the new tags).

    wget continues to be thoroughly developed, and, although I haven’t tried it personally (I’m mostly copying ‘legacy’ websites…), it seems to be able to deal with HTML5 tags so far as one ‘forces’ wget to identify itself as a recent version of, say, Chrome or Firefox; if it identifies itself by default, the webserver it connects too may simply think that it’s a very old browser trying to access the site and ‘simplify’ the HTML being passed back (i.e. ‘downgrading’ it to HTML4 or so). This, of course, is not an issue with wget per se but rather the way webservers (and web designers!) are getting more and more clever in dealing with a vast variety of users, browsers, and platforms.

    Also, contemporary versions of wget (which means mid-2019 by the time I’m writing this comment!) will have no trouble ‘digging deep’ to extract JS and CSS files etc. Obviously it cannot make miracles and doesn’t deal with everything; I had some issues with imagemaps, for instance (something nobody uses these days), as well as HTML generated on-the-go by Javascript. And of course there is a limit to what it can actually do with very complex and dynamic websites which adjust their content to whatever browser the user has, page by page — especially on those cases where the different versions of the same page all have the same URL (a bad practice IMHO). Still, it remains useful for a lot of situations, and the results are better than what you get out of archive.org…

    I just wanted to point this out since this article is old but still relevant for today’s wget — and sadly HTTrack was abandoned and isn’t an option any longer…

  12. Pingback: Mirroring a site using Wget – The Publishing Project

Leave a Reply

Your email address will not be published. Required fields are marked *

 

This site uses Akismet to reduce spam. Learn how your comment data is processed.