Hi there,
I'm baffled at this situation (attached pictures below):
- Regular mode: same content (Blake2b-256)
- Same file name, extension
- Same file created, modified
Duplicate Cleaner Pro 5.27.0 64bit
As you can see in pictures below, files of different sizes were found as duplicate and put in the same duplicate group.
What's happening?
Prior to this, I have been checking and removing (hundreds to thousands of) duplicates with byte-to-byte comparison. But thinking to be more efficient, I tried to use the hash comparison method and if I recall correctly, it was running fine (duplicate group contains duplicate files). But then, I was running it again and to my surprise I saw this result. I have rescanned with the same setting and got the same weird duplicate result.
A bug? Problems with cached hash in memory? or in disk?
Thanks.
Why same content comparison with hash put different files in the same (duplicate) group?
-
Kerr
- Posts: 6
- Joined: Thu Jun 29, 2017 7:43 am
- therube
- Posts: 651
- Joined: Tue Jun 28, 2011 4:38 pm
Re: Why same content comparison with hash put different files in the same (duplicate) group?
To me, your settings look correct, but your results look wrong?
(I cannot seem to duplicate similar on my end.)
The files you show are .jpg, & while you are running a Regular mode scan (vs. Image), maybe some setting set in the Image mode is (wrongly) playing in?
Oh, but you have other file types that were also dup'd. With that, aside from .jpg, are the other file types; .mp3, .mov., .docx..., did those files return the expected results, or are all file types returning unexpected results?
If you throw some arbitrary, non-jpg, files into a directory, & scan that directory, do the results still turn out wrong?
Hash & byte-to-byte should be returning the same sets of files.
And in your case, you also enabled Same name, so why are different file names even showing up at all as being "duplicates"?
(I cannot seem to duplicate similar on my end.)
The files you show are .jpg, & while you are running a Regular mode scan (vs. Image), maybe some setting set in the Image mode is (wrongly) playing in?
Oh, but you have other file types that were also dup'd. With that, aside from .jpg, are the other file types; .mp3, .mov., .docx..., did those files return the expected results, or are all file types returning unexpected results?
If you throw some arbitrary, non-jpg, files into a directory, & scan that directory, do the results still turn out wrong?
Hash & byte-to-byte should be returning the same sets of files.
And in your case, you also enabled Same name, so why are different file names even showing up at all as being "duplicates"?
- DigitalVolcano
- Site Admin
- Posts: 1924
- Joined: Thu Jun 09, 2011 10:04 am
Re: Why same content comparison with hash put different files in the same (duplicate) group?
Something looks up, as the Hash column is empty. I wonder if the Blake hashing isn't working? Do you get the same results if you change to a different hash type?
- therube
- Posts: 651
- Joined: Tue Jun 28, 2011 4:38 pm
Re: Why same content comparison with hash put different files in the same (duplicate) group?
(I didn't even catch that.)
Are you storing hashes?
Options | General -> Use caching for calculated hashes
Maybe that cache, DuplicateCleaner5_Pro_Cache-Blake2b_256.data, became corrupted & is affecting things?
You could (backup first, then) remove that file (with DC closed, or use, 'Clear cache' in the GUI - but that might ? be overkill ?), & see if that changes things? (Renaming the file might be better.)
Are you storing hashes?
Options | General -> Use caching for calculated hashes
Maybe that cache, DuplicateCleaner5_Pro_Cache-Blake2b_256.data, became corrupted & is affecting things?
You could (backup first, then) remove that file (with DC closed, or use, 'Clear cache' in the GUI - but that might ? be overkill ?), & see if that changes things? (Renaming the file might be better.)
-
Kerr
- Posts: 6
- Joined: Thu Jun 29, 2017 7:43 am
Re: Why same content comparison with hash put different files in the same (duplicate) group?
To answer the questions:
I’m cleaning up scattered backups of misc files (majority is media files) from smartphone so I use the most restrictive criteria first, i.e. regular mode same content byte-by-byte, name+ext, mod and created dates.
I’d been doing lots (thousands files and prob 30-40GBs of data) of comparison and removal of duplicates. Those worked fine. Then I moved on to different set of directories of even older backups and changed to hash-comparison (for better eficiency?).
Yes, I stored hashes (to make sure they weren’t gone when I closed the app so that I can reuse them for later comparison).
Yes, I have tried with SHA-256 and I get the same output (wrong duplicates and empty hash column). I tested re-doing the scan before closing the app and after restarting the app. Same result.
My cached hash files:
- Blake2b-256 33MB
- SHA256 16KB
Strange size of the SHA256, isn’t it?
Now, I just deleted the hash cache as suggested, and doing new scans with hash. The new cached hash files:
- SHA256 16KB
- Blake2b-256 16KB
Both gave the exact same wrong output as before (and empty hash column). It seems like there’s no hashing at all (16KB size).
The old cache Blake2b-256 33MB hash file is probably from previous comparison (lots of files). Somehow the app doesn’t create hash anymore (16KB)?
I just ran another scan with byte-to-byte comparison and it showed 0 (zero) duplicates. (I think I'd cleaned up the duplicates from this folder set in previous scans with byte-to-byte comparison.) Changing the comparison mode back to hash and I got the same wrong output.
To summary the new scan results after restarting the app and deleting the cached hash files:
- a set of folders to compare (593 files 7.22GB)
- byte-to-byte : 0 duplicates (correct output)
- hash (SHA256, Blake2b-256) : 372 duplicate (wrong output).
Hope this helps!
I’m cleaning up scattered backups of misc files (majority is media files) from smartphone so I use the most restrictive criteria first, i.e. regular mode same content byte-by-byte, name+ext, mod and created dates.
I’d been doing lots (thousands files and prob 30-40GBs of data) of comparison and removal of duplicates. Those worked fine. Then I moved on to different set of directories of even older backups and changed to hash-comparison (for better eficiency?).
Yes, I stored hashes (to make sure they weren’t gone when I closed the app so that I can reuse them for later comparison).
Yes, I have tried with SHA-256 and I get the same output (wrong duplicates and empty hash column). I tested re-doing the scan before closing the app and after restarting the app. Same result.
My cached hash files:
- Blake2b-256 33MB
- SHA256 16KB
Strange size of the SHA256, isn’t it?
Now, I just deleted the hash cache as suggested, and doing new scans with hash. The new cached hash files:
- SHA256 16KB
- Blake2b-256 16KB
Both gave the exact same wrong output as before (and empty hash column). It seems like there’s no hashing at all (16KB size).
The old cache Blake2b-256 33MB hash file is probably from previous comparison (lots of files). Somehow the app doesn’t create hash anymore (16KB)?
I just ran another scan with byte-to-byte comparison and it showed 0 (zero) duplicates. (I think I'd cleaned up the duplicates from this folder set in previous scans with byte-to-byte comparison.) Changing the comparison mode back to hash and I got the same wrong output.
To summary the new scan results after restarting the app and deleting the cached hash files:
- a set of folders to compare (593 files 7.22GB)
- byte-to-byte : 0 duplicates (correct output)
- hash (SHA256, Blake2b-256) : 372 duplicate (wrong output).
Hope this helps!
-
Kerr
- Posts: 6
- Joined: Thu Jun 29, 2017 7:43 am
Re: Why same content comparison with hash put different files in the same (duplicate) group?
I have just run another test with interesting results.
Test on different set of folders.
Same criteria as before.
- With hash comparison: wrong output again
Change criteria: uncheck the Same file name
- With hash comparison: correct output
- Double check with byte-to-byte: correct output
Change criteria: check the Same file name
- With hash comparison: wrong output again
Change criteria: uncheck the Same file name
- With hash comparison: correct output
So for some reason, checking the criteria Same file name produces the bug.
Note:
- Hash: SHA256 and Blake2b-256
- I have not tested with other criteria options if they may also be problematic.
- All folders are in a portable USB disk running Seagate Toolkit.
- The sequence I did above is reproducible (in my system at least).
Test on different set of folders.
Same criteria as before.
- With hash comparison: wrong output again
Change criteria: uncheck the Same file name
- With hash comparison: correct output
- Double check with byte-to-byte: correct output
Change criteria: check the Same file name
- With hash comparison: wrong output again
Change criteria: uncheck the Same file name
- With hash comparison: correct output
So for some reason, checking the criteria Same file name produces the bug.
Note:
- Hash: SHA256 and Blake2b-256
- I have not tested with other criteria options if they may also be problematic.
- All folders are in a portable USB disk running Seagate Toolkit.
- The sequence I did above is reproducible (in my system at least).
- DigitalVolcano
- Site Admin
- Posts: 1924
- Joined: Thu Jun 09, 2011 10:04 am
Re: Why same content comparison with hash put different files in the same (duplicate) group?
I'll test this, but which version are you running?
Also, is it possible to paste the exact settings from your log file?
thanks!
Also, is it possible to paste the exact settings from your log file?
thanks!