How To Download All Non-HTML Files From A Website Using Wget?

Published September 26, 2024

Problem: Downloading Non-HTML Files from Websites

Downloading many non-HTML files from a website can take a long time when done by hand. This task becomes harder when working with many files or big websites. An automated method is needed to speed up this process and save time.

The Wget Solution: Customizing Your Download

Introducing Wget

Wget is a command-line tool for downloading files from the web. It supports HTTP, HTTPS, and FTP protocols, making it useful for many download tasks. Wget can get single files, entire directories, or copy complete websites.

Tip: Wget for Beginners

If you're new to Wget, start with simple commands. For example, to download a single file, use:

wget https://example.com/file.zip

This command downloads 'file.zip' from the specified URL to your current directory.

The Command Structure

The basic Wget command format is:

wget [options] [URL]

Key options and parameters include:

  • -A: Sets file types to accept
  • -m: Copies the website
  • -np: Stops ascending to parent directories
  • -p: Downloads needed page elements
  • -E: Adds file extensions
  • -k: Changes links for local viewing
  • -K: Makes a backup of the original file before changing links

These options let you tailor your download process to focus on specific file types and control how Wget works with the website structure.

Crafting the Perfect Wget Command

Options for Non-HTML File Download

To download non-HTML files from a website using Wget, you need to use specific options. The -A option lets you specify which file types to accept. For example, -A pdf,jpg,png tells Wget to download only PDF, JPG, and PNG files. The -m option mirrors the website, creating a local copy of the site's structure. The -np option prevents Wget from ascending to parent directories, keeping your download focused on the specified directory and its subdirectories.

Example: Downloading specific file types

wget -m -np -A pdf,jpg,png http://example.com/files/

This command will download all PDF, JPG, and PNG files from the specified directory and its subdirectories on example.com.

Improving the Command

For a more complete download, add the -p option. This downloads all elements needed to display the web pages properly, such as images and stylesheets. The -E option adds suitable extensions to filenames, ensuring correct file identification. Use -k to convert links for local viewing, making the downloaded files accessible offline. The -K option creates a backup of the original file before converting links, preserving the original content.

By combining these options, you can create a Wget command that downloads only the non-HTML files you want while keeping the website's structure and functionality.

Tip: Using a user agent string

Add the --user-agent option to your Wget command to avoid being blocked by servers that restrict access to web crawlers. For example:

wget --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3" [other options] http://example.com/

This makes your Wget request appear as if it's coming from a regular web browser.

Executing the Wget Command

The Complete Command

To download non-HTML files from a website using Wget, use this command:

wget -m -np -p -E -k -K -A pdf,jpg,png,gif,mp3,zip http://example.com/files/

Here's what each part of the command does:

  • -m: Mirrors the website, creating a local copy of the directory structure.
  • -np: Stops Wget from going to parent directories.
  • -p: Downloads files needed to display web pages correctly.
  • -E: Adds proper extensions to filenames.
  • -k: Converts links for local viewing.
  • -K: Makes a backup of the original file before converting links.
  • -A pdf,jpg,png,gif,mp3,zip: Sets the file types to download.
  • http://example.com/files/: The URL of the website directory to download from.

Customizing File Types

You can change the -A option in the Wget command to pick different file extensions. Here are some common non-HTML file types you might want to include:

  • Documents: pdf, doc, docx, txt, rtf, odt
  • Images: jpg, jpeg, png, gif, bmp, tiff
  • Audio: mp3, wav, ogg, flac
  • Video: mp4, avi, mkv, mov
  • Archives: zip, rar, 7z, tar, gz

To download these file types, change the -A option like this:

wget -m -np -p -E -k -K -A pdf,doc,docx,txt,jpg,png,mp3,mp4,zip http://example.com/files/

You can add or remove file extensions from this list to match your needs. Separate each file extension with a comma, without spaces.

Tip: Excluding file types

To exclude specific file types, use the -R option followed by the file extensions you want to skip. For example:

wget -m -np -p -E -k -K -R html,php,asp http://example.com/files/

This command will download all files except those with html, php, or asp extensions.

Tip: Setting download limits

You can control the download speed and file size using Wget options. For example:

wget -m -np -p -E -k -K -A pdf,jpg,png --limit-rate=200k --quota=500m http://example.com/files/

This command limits the download speed to 200 KB/s and sets a total download quota of 500 MB.