Unix Shell Script to Find Big Unreferenced or Unused Files on Your Website

Amazon If you are maintaining a website, you may be motivated to find big unreferenced files such as images, videos, PDF documents on the web server so that you can remove them to save disk space, especially when there were people before you who maintained the site and you have no idea if they had forgotten to delete unused files on the web server.

First let's make sure what an unreferenced file is. Suppose you have a video.mp4 in your video directory, but no webpages on your website use this file. Then video.mp4 is an unreferenced file. You can safely remove it without breaking your website.

Let's go over a few Unix commands that can help you identify the unreferenced or unused big static files such as large images and large videos for removal for minimizing disk usage.

We usually target big files for removal as they take up more disk space than small files.

List the base file names of big files recursively

Suppose $1 is the document root, $2 is the size in MB, $3 is the extension of the target files. In our examples, here are the values:

$1 = /usr/share/nginx/
$2 = 5
$3 = mp4

The Unix command to list base names of files with a specific extension above a certain file size in MB looks like the following:

find $1 -type f -size +$2M -name *.$3 -exec basename {} \; | sort | uniq

The result looks like the following:

video.mp4
video2.mp4

You can append the output to some file, say /tmp/big-files.txt. This example uses the .mp4 extension, which is a common extension for video files. In your case, you may want to run the same command with many extensions such as .jpg, .png, .pdf, .doc, and so on.

Do NOT think you should include all extensions because you do NOT want to include extensions that are used to serve your website's content, such as .html, .css, .js, .php, just to name a few.

However, listing just the base file names may lead to false results. So we need the next step.

List the relative paths of big files recursively

We are only interested in the relative paths of the big files, not the absolute paths, because your webpages won't reference the absolute paths. For example, suppose one big file's absolute path is:

/usr/share/nginx/project1/video2.mp4

Since the document root is /usr/share/nginx/, chances are your webpage will reference the following text in the HTML markup in some HTML tag such as <video> and <a>:

project1/video2.mp4

Therefore, we want to know the paths relative to document root of the big files, too. The following command will list the relative paths of the big files recursively.

find $1 -type f -size +$2M -name *.$3 | sed -r "s|^$1||" | sort | uniq

The output looks like this:

project1/video2.mp4
en/project1/video.mp4

You can append the output to the same file, say /tmp/big-files.txt.

Identifying unreferenced files

Now you simply go through each entry in /tmp/big-files.txt and report if the text string does not exist in any of the files that reside in your website's document root. The grep command is particularly useful for this purpose, as follows:

for f in $(cat /tmp/big-files.txt); do
    grep -R $f $1 > /dev/null || echo $f;
done

In this loop, if a match of the current entry is found in some file in the document root, grep will generate some output, and the following command will return true:

grep -R $f $1 > /dev/null

And therefore this entry won't be printed out. Otherwise, grep will not generate any output, and the above command will return false, which causes the entry to be printed out on the screen, which means this entry is not found anywhere from the document root.

The final output is a list of big files that are not referenced anywhere on your website's document root, but don't delete them yet. You still need to go through them one by one to make sure they are indeed unreferenced anywhere because there's a chance for false results depending on how your website is written.

For example, if you see the following in the output:

tutorial.mp4
en/video/tutorial.mp4
tc/video/tutorial.mp4

You can be rest assured that tutorial.mp4 is not used anywhere on your website, and you can
safely remove en/video/tutorial.mp4 and tc/video/tutorial.mp4.

However, if you see the following output instead:

en/video/tutorial.mp4

Then you must double check because you know for a fact that tutorial.mp4 is referenced somewhere in your document root.

Simply do a grep to see where tutorial.mp4 is referenced to determine if it can be deleted.

Now you can easily write a script to include these commands. One thing you may have noticed is the loop is case-sensitive. To make it case-insensitive, simply add the -i option in the grep command.

Questions? Let me know!