dannyman.toldme.com : Technical

March 23, 2023
FreeBSD, Linux, Mac OS X, Technical

De-duplicating Files with jdupes!

Link: https://dannyman.toldme.com/2023/03/23/de-duplicating-files-with-jdupes/

Part of my day job involves looking at Nagios and checking up on systems that are filling their disks. I was looking at a system with a lot of large files, which are often duplicated, and I thought this would be less of an issue with de-duplication. There are filesystems that support de-duplication, but I recalled the fdupes command, a tool that “finds duplicate files” … if it can find duplicate files, could it perhaps hard-link the duplicates? The short answer is no.

But there is a fork of fdupes called jdupes, which supports de-duplication! I had to try it out.

It turns out your average Hadoop release ships with a healthy number of duplicate files, so I use that as a test corpus.

> du -hs hadoop-3.3.4 1.4G hadoop-3.3.4 > du -s hadoop-3.3.4 1413144 hadoop-3.3.4 > find hadoop-3.3.4 -type f | wc -l 22565

22,565 files in 1.4G, okay. What does jdupes think?

> jdupes -r hadoop-3.3.4 | head Scanning: 22561 files, 2616 items (in 1 specified) hadoop-3.3.4/NOTICE.txt hadoop-3.3.4/share/hadoop/yarn/webapps/ui2/WEB-INF/classes/META-INF/NOTICE.txt


hadoop-3.3.4/libexec/hdfs-config.cmd

hadoop-3.3.4/libexec/mapred-config.cmd
hadoop-3.3.4/LICENSE.txt

hadoop-3.3.4/share/hadoop/yarn/webapps/ui2/WEB-INF/classes/META-INF/LICENSE.txt

hadoop-3.3.4/share/hadoop/common/lib/commons-net-3.6.jar

There are some duplicate files. Let’s take a look.

> diff hadoop-3.3.4/libexec/hdfs-config.cmd hadoop-3.3.4/libexec/mapred-config.cmd > ls -l hadoop-3.3.4/libexec/hdfs-config.cmd hadoop-3.3.4/libexec/mapred-config.cmd -rwxr-xr-x 1 djh djh 1640 Jul 29 2022 hadoop-3.3.4/libexec/hdfs-config.cmd -rwxr-xr-x 1 djh djh 1640 Jul 29 2022 hadoop-3.3.4/libexec/mapred-config.cmd

Look identical to me, yes.

> jdupes -r -m hadoop-3.3.4 Scanning: 22561 files, 2616 items (in 1 specified) 2859 duplicate files (in 167 sets), occupying 52 MB

Here, jdupes says it can consolidate the duplicate files and save 52 MB. That is not huge, but I am just testing.

> jdupes -r -L hadoop-3.3.4|head Scanning: 22561 files, 2616 items (in 1 specified) [SRC] hadoop-3.3.4/NOTICE.txt ----> hadoop-3.3.4/share/hadoop/yarn/webapps/ui2/WEB-INF/classes/META-INF/NOTICE.txt


[SRC] hadoop-3.3.4/libexec/hdfs-config.cmd

----> hadoop-3.3.4/libexec/mapred-config.cmd
[SRC] hadoop-3.3.4/LICENSE.txt

----> hadoop-3.3.4/share/hadoop/yarn/webapps/ui2/WEB-INF/classes/META-INF/LICENSE.txt

[SRC] hadoop-3.3.4/share/hadoop/common/lib/commons-net-3.6.jar

How about them duplicate files?

> diff hadoop-3.3.4/libexec/hdfs-config.cmd hadoop-3.3.4/libexec/mapred-config.cmd > ls -l hadoop-3.3.4/libexec/hdfs-config.cmd hadoop-3.3.4/libexec/mapred-config.cmd -rwxr-xr-x 2 djh djh 1640 Jul 29 2022 hadoop-3.3.4/libexec/hdfs-config.cmd -rwxr-xr-x 2 djh djh 1640 Jul 29 2022 hadoop-3.3.4/libexec/mapred-config.cmd

In the ls output, the “2” in the second column indicates the number of hard links to a file. Before we ran jdupes, each file only linked to itself. After, these two files link to the same spot on disk.

> du -s hadoop-3.3.4 1388980 hadoop-3.3.4 > find hadoop-3.3.4 -type f | wc -l 22566

The directory uses slightly less space, but the file count is the same!

But, be careful!

If you have a filesystem that de-duplicates data, that’s great. If you change the contents of a de-duplicated file, the filesystem will store the new data for the changed file and the old data for the unchanged file. If you de-duplicate with hard links and you edit a deduplicated file, you edit all the files that link to that location on disk. For example:

> ls -l hadoop-3.3.4/libexec/hdfs-config.cmd hadoop-3.3.4/libexec/mapred-config.cmd -rwxr-xr-x 2 djh djh 1640 Jul 29 2022 hadoop-3.3.4/libexec/hdfs-config.cmd -rwxr-xr-x 2 djh djh 1640 Jul 29 2022 hadoop-3.3.4/libexec/mapred-config.cmd > echo foo >> hadoop-3.3.4/libexec/hdfs-config.cmd > ls -l hadoop-3.3.4/libexec/hdfs-config.cmd hadoop-3.3.4/libexec/mapred-config.cmd -rwxr-xr-x 2 djh djh 1644 Mar 23 16:16 hadoop-3.3.4/libexec/hdfs-config.cmd -rwxr-xr-x 2 djh djh 1644 Mar 23 16:16 hadoop-3.3.4/libexec/mapred-config.cmd

Both files are now 4 bytes longer! Maybe this is desired, but in plenty of cases, this could be a problem.

Of course, the nature of how you “edit” a file is very important. A file copy utility might replace the files, or it may re-write them in place. You need to experiment and check your documentation. Here is an experiment.

> ls -l hadoop-3.3.4/libexec/hdfs-config.cmd hadoop-3.3.4/libexec/mapred-config.cmd -rwxr-xr-x 2 djh djh 1640 Mar 23 16:19 hadoop-3.3.4/libexec/hdfs-config.cmd -rwxr-xr-x 2 djh djh 1640 Mar 23 16:19 hadoop-3.3.4/libexec/mapred-config.cmd > cp hadoop-3.3.4/libexec/hdfs-config.cmd hadoop-3.3.4/libexec/mapred-config.cmd cp: 'hadoop-3.3.4/libexec/hdfs-config.cmd' and 'hadoop-3.3.4/libexec/mapred-config.cmd' are the same file

The cp command is not having it. What if we replace one of the files?

> cp hadoop-3.3.4/libexec/mapred-config.cmd hadoop-3.3.4/libexec/mapred-config.cmd.orig > echo foo >> hadoop-3.3.4/libexec/hdfs-config.cmd > ls -l hadoop-3.3.4/libexec/hdfs-config.cmd hadoop-3.3.4/libexec/mapred-config.cmd -rwxr-xr-x 2 djh djh 1644 Mar 23 16:19 hadoop-3.3.4/libexec/hdfs-config.cmd -rwxr-xr-x 2 djh djh 1644 Mar 23 16:19 hadoop-3.3.4/libexec/mapred-config.cmd > cp hadoop-3.3.4/libexec/mapred-config.cmd.orig hadoop-3.3.4/libexec/mapred-config.cmd > ls -l hadoop-3.3.4/libexec/hdfs-config.cmd hadoop-3.3.4/libexec/mapred-config.cmd -rwxr-xr-x 2 djh djh 1640 Mar 23 16:19 hadoop-3.3.4/libexec/hdfs-config.cmd -rwxr-xr-x 2 djh djh 1640 Mar 23 16:19 hadoop-3.3.4/libexec/mapred-config.cmd

When I run the cp command to replace one file, it replaces both files.

Back at work, I found I could save a lot of disk space on the system in question with jdupes -L, but I am also wary of unintended consequences of linking files together. If we pursue this strategy in the future, it will be with considerable caution.