Shell script to move duplicate files

Posted

I setup Firefox on macOS to automatically download files to my “Downloads” folder without prompting (under Preferences... > Downloads, and select Save files to... instead of Always ask you where to save files). If there is an existing file with the same name, Firefox appends (1) to the filename (then (2), (3) and so on). Over time, I land up with files with similar file names, which may or may not be identical. This shell script finds and moves duplicate files to a different folder... for subsequent manual deletion.

#!/bin/bash
shopt -s nullglob extglob
moveto=Duplicates
oldifs="$IFS"
IFS="
"

# create folder for duplicates if does not exist
echo Looking for duplicates and moving them to ./$moveto
[[ ! -d "$moveto" ]] && mkdir "$moveto"

# list all filenames ending with number, e.g. test(1).txt, test(2).txt...
# get filename without prefix, e.g. test.txt and save it to an array
# finally, make the array only contain unique items
files=()
for f in *\(+([[:digit:]])\)\.*; do
 e=${f/\(+([[:digit:]])\)\./.}
 [[ -e "$e" ]] && files+=("$e")
done
files=($(echo "${files[*]}" | sort -u))

echo - checking ${#files[@]} potential duplicates by filename
tput rmam
trap "tput smam" EXIT

# for each filename e.g. test.txt, look for files ending test(1).txt, test(2).txt...
# if exist, compare difference between test.txt and test(1).txt onwards
# if identical, move to duplicates folder
i=0
for f in "${files[@]}"; do
 for e in "${f%.*}"\(+([[:digit:]])\)\.$"{f##*.}"; do
  printf "\r   $e\033[0K"
  diff -q --binary "$f" "$e" > /dev/null 2>&1
  if [[ $? -eq 0 ]]; then
   mv "$e" "./$moveto"
   [[ $? -eq 0 ]] && let i=i+1
  fi 
 done
done
tput smam
printf "\r\033[0K\r"

# show the human-readable file sizes and and total size of the folder
echo - moved $i files, ./$moveto folder contains:
cd "./$moveto"
du -cah *

As you can see there is minimal error checking, don’t use as is! Things may go wrong if you have file names like this(2).is.bad.txt, so your mileage will vary! Also, the Duplicates folder is assumped be empty to start of with...

Some interesting points:

shopt -s nullglob is so that the first for loop using filename globbing does not enter the loop with an empty $f variable if no matches are found.

shopt -s extglob enables extended globbing in Bash. Without this, I can only match filenames with a one digit in brackets (1). to (9). using for f in *\([1-9]\).*. With extended globbing, I can match a more complex string with one or more digits for f in *\(+([[:digit:]])\)\.*. The alternative would be to resort to (slower) for f in $(ls | grep -E "\(\d+\)\.") which effectively does the same thing.

After I find these files, I get the base filename without the numeric postfix i.e. I remove (n). Without extended globbing, I have to use e=$(echo "$f" | sed -E "s/\([[:digit:]]+\)\././") but with I can just do e=${f/\(+([[:digit:]])\)\./.}

Now, I want to add the base filename, if it exists ([[ -e "$e" ]]), to an array. To do this, the Bash internal field separatorIFS variable needs to be a new line - using a space will break filenames with spaces.

The next problem is then removing duplicates from the array... because Bash v3 on macOS does not support associative arrays! I use a very hacky method to do this, and this is the only time I drop into a subshell, with this line files=($(echo "${files[*]}" | sort -u)).

That explains the first for loop. In the second for loop, I use the base filenames in the array to look for all related files and then compare them properly using diff. For example, test(1).txt will stripped to the base filename test.txt and then compared to anything that exists like test(1).txt, test(2).txt or even test(12).txt. diff sets the exit code to 0 if files are identical.

Since diff takes time for many large files, I wanted to print out the filename being compared. To limit this output to only one (same) line:

  • turn off line wrap with tput rmam (so the filename is truncated to one line only) and turn it on again when done or exiting via trap "tput smam" EXIT,
  • to make each filename output on the same line, reset cursor to start of line (carridge return without new line) and delete to end of line with printf "\r $e\033[0K",
  • and finally, erase the final line with printf "\r\033[0K\r".

The last few steps are self-explanatory. If the files are the same, then move all to a folder called Duplicates. And finally, show the size of each file (-a) and the grand total (-c) of the Duplicates folder “human-readable” format (-h) via du -cah.

Wow, that was a lot of words to describe a short script...