De-duplicating Files with jdupes!
Part of my day job involves looking at Nagios and checking up on systems that are filling their disks. I was looking at a system with a lot of large files, which are often duplicated, and I thought this would be less of an issue with de-duplication. There are filesystems that support de-duplication, but I recalled the fdupes command, a tool that “finds duplicate files” … if it can find duplicate files, could it perhaps hard-link the duplicates? The short answer is no.
But there is a fork of fdupes called jdupes, which supports de-duplication! I had to try it out.
It turns out your average Hadoop release ships with a healthy number of duplicate files, so I use that as a test corpus.
> du -hs hadoop-3.3.4
1.4G hadoop-3.3.4
> du -s hadoop-3.3.4
1413144 hadoop-3.3.4
> find hadoop-3.3.4 -type f | wc -l
22565
22,565 files in 1.4G, okay. What does jdupes think?
> jdupes -r hadoop-3.3.4 | head
Scanning: 22561 files, 2616 items (in 1 specified)
hadoop-3.3.4/NOTICE.txt
hadoop-3.3.4/share/hadoop/yarn/webapps/ui2/WEB-INF/classes/META-INF/NOTICE.txt
hadoop-3.3.4/libexec/hdfs-config.cmd
hadoop-3.3.4/libexec/mapred-config.cmd
hadoop-3.3.4/LICENSE.txt
hadoop-3.3.4/share/hadoop/yarn/webapps/ui2/WEB-INF/classes/META-INF/LICENSE.txt
hadoop-3.3.4/share/hadoop/common/lib/commons-net-3.6.jar
There are some duplicate files. Let’s take a look.
> diff hadoop-3.3.4/libexec/hdfs-config.cmd hadoop-3.3.4/libexec/mapred-config.cmd
> ls -l hadoop-3.3.4/libexec/hdfs-config.cmd hadoop-3.3.4/libexec/mapred-config.cmd
-rwxr-xr-x 1 djh djh 1640 Jul 29 2022 hadoop-3.3.4/libexec/hdfs-config.cmd
-rwxr-xr-x 1 djh djh 1640 Jul 29 2022 hadoop-3.3.4/libexec/mapred-config.cmd
Look identical to me, yes.
> jdupes -r -m hadoop-3.3.4
Scanning: 22561 files, 2616 items (in 1 specified)
2859 duplicate files (in 167 sets), occupying 52 MB
Here, jdupes says it can consolidate the duplicate files and save 52 MB. That is not huge, but I am just testing.
> jdupes -r -L hadoop-3.3.4|head
Scanning: 22561 files, 2616 items (in 1 specified)
[SRC] hadoop-3.3.4/NOTICE.txt
----> hadoop-3.3.4/share/hadoop/yarn/webapps/ui2/WEB-INF/classes/META-INF/NOTICE.txt
[SRC] hadoop-3.3.4/libexec/hdfs-config.cmd
----> hadoop-3.3.4/libexec/mapred-config.cmd
[SRC] hadoop-3.3.4/LICENSE.txt
----> hadoop-3.3.4/share/hadoop/yarn/webapps/ui2/WEB-INF/classes/META-INF/LICENSE.txt
[SRC] hadoop-3.3.4/share/hadoop/common/lib/commons-net-3.6.jar
How about them duplicate files?
> diff hadoop-3.3.4/libexec/hdfs-config.cmd hadoop-3.3.4/libexec/mapred-config.cmd
> ls -l hadoop-3.3.4/libexec/hdfs-config.cmd hadoop-3.3.4/libexec/mapred-config.cmd
-rwxr-xr-x 2 djh djh 1640 Jul 29 2022 hadoop-3.3.4/libexec/hdfs-config.cmd
-rwxr-xr-x 2 djh djh 1640 Jul 29 2022 hadoop-3.3.4/libexec/mapred-config.cmd
In the ls output, the “2” in the second column indicates the number of hard links to a file. Before we ran jdupes, each file only linked to itself. After, these two files link to the same spot on disk.
> du -s hadoop-3.3.4
1388980 hadoop-3.3.4
> find hadoop-3.3.4 -type f | wc -l
22566
The directory uses slightly less space, but the file count is the same!
But, be careful!
If you have a filesystem that de-duplicates data, that’s great. If you change the contents of a de-duplicated file, the filesystem will store the new data for the changed file and the old data for the unchanged file. If you de-duplicate with hard links and you edit a deduplicated file, you edit all the files that link to that location on disk. For example:
> ls -l hadoop-3.3.4/libexec/hdfs-config.cmd hadoop-3.3.4/libexec/mapred-config.cmd
-rwxr-xr-x 2 djh djh 1640 Jul 29 2022 hadoop-3.3.4/libexec/hdfs-config.cmd
-rwxr-xr-x 2 djh djh 1640 Jul 29 2022 hadoop-3.3.4/libexec/mapred-config.cmd
> echo foo >> hadoop-3.3.4/libexec/hdfs-config.cmd
> ls -l hadoop-3.3.4/libexec/hdfs-config.cmd hadoop-3.3.4/libexec/mapred-config.cmd
-rwxr-xr-x 2 djh djh 1644 Mar 23 16:16 hadoop-3.3.4/libexec/hdfs-config.cmd
-rwxr-xr-x 2 djh djh 1644 Mar 23 16:16 hadoop-3.3.4/libexec/mapred-config.cmd
Both files are now 4 bytes longer! Maybe this is desired, but in plenty of cases, this could be a problem.
Of course, the nature of how you “edit” a file is very important. A file copy utility might replace the files, or it may re-write them in place. You need to experiment and check your documentation. Here is an experiment.
> ls -l hadoop-3.3.4/libexec/hdfs-config.cmd hadoop-3.3.4/libexec/mapred-config.cmd
-rwxr-xr-x 2 djh djh 1640 Mar 23 16:19 hadoop-3.3.4/libexec/hdfs-config.cmd
-rwxr-xr-x 2 djh djh 1640 Mar 23 16:19 hadoop-3.3.4/libexec/mapred-config.cmd
> cp hadoop-3.3.4/libexec/hdfs-config.cmd hadoop-3.3.4/libexec/mapred-config.cmd
cp: 'hadoop-3.3.4/libexec/hdfs-config.cmd' and 'hadoop-3.3.4/libexec/mapred-config.cmd' are the same file
The cp command is not having it. What if we replace one of the files?
> cp hadoop-3.3.4/libexec/mapred-config.cmd hadoop-3.3.4/libexec/mapred-config.cmd.orig
> echo foo >> hadoop-3.3.4/libexec/hdfs-config.cmd
> ls -l hadoop-3.3.4/libexec/hdfs-config.cmd hadoop-3.3.4/libexec/mapred-config.cmd
-rwxr-xr-x 2 djh djh 1644 Mar 23 16:19 hadoop-3.3.4/libexec/hdfs-config.cmd
-rwxr-xr-x 2 djh djh 1644 Mar 23 16:19 hadoop-3.3.4/libexec/mapred-config.cmd
> cp hadoop-3.3.4/libexec/mapred-config.cmd.orig hadoop-3.3.4/libexec/mapred-config.cmd
> ls -l hadoop-3.3.4/libexec/hdfs-config.cmd hadoop-3.3.4/libexec/mapred-config.cmd
-rwxr-xr-x 2 djh djh 1640 Mar 23 16:19 hadoop-3.3.4/libexec/hdfs-config.cmd
-rwxr-xr-x 2 djh djh 1640 Mar 23 16:19 hadoop-3.3.4/libexec/mapred-config.cmd
When I run the cp command to replace one file, it replaces both files.
Back at work, I found I could save a lot of disk space on the system in question with jdupes -L, but I am also wary of unintended consequences of linking files together. If we pursue this strategy in the future, it will be with considerable caution.