remove duplicate jpegs from folder

Technical support and scripting issues

Moderators: Dorian (MJT support), JRL

Post Reply
User avatar
Dorian (MJT support)
Automation Wizard
Posts: 1389
Joined: Sun Nov 03, 2002 3:19 am
Contact:

remove duplicate jpegs from folder

Post by Dorian (MJT support) » Fri Dec 14, 2007 9:21 pm

Hi Guys,

I have a folder containing 10,000+ jpegs and I want to automatically remove all the duplicates.

Where would I start?

I tried using GetFileList and CompareBitmaps but of course tha only compares .bmp files.

I also looked into the image recognition plugin but that seems intended for recognising images built into something as opposed to in a folder.

stev3n
Junior Coder
Posts: 22
Joined: Thu Nov 01, 2007 1:13 am

Post by stev3n » Sat Dec 15, 2007 2:05 pm

Are duplicates named similarly or randomly? And also, would two duplicates be two true duplicates? If so they would have the same size and you could maybe do something with looking at their file sizes

User avatar
Phil Pendlebury
Automation Wizard
Posts: 543
Joined: Tue Jan 16, 2007 9:00 am
Contact:

Post by Phil Pendlebury » Sat Dec 15, 2007 5:42 pm

Presumably they're not named the same otherwise they wouldn't be allowed in the same folder...

I think the only easy way to do this is as steven mentioned - using file sizes.

It would take hours upon hours to compare thousands of actual bitmaps one at a time (that's if they were bitmaps).

Surely you don't want to actually open each file:

Code: Select all

Open file 1 in viewer
Take screen capture
Open every OTHER file one at a time comparing each to file 1
If it is the same as file 1 delete it
Start from step one using file 2
Repeat 10000 times :-(

Of course many files will have similar file sizes too. :-(

I wonder if there's a clever way of getting image dimensions - maybe opening in viewer and getting the window size to a variable THEN file size?
Phil Pendlebury - Linktree

User avatar
Marcus Tettmar
Site Admin
Posts: 7395
Joined: Thu Sep 19, 2002 3:00 pm
Location: Dorset, UK
Contact:

Post by Marcus Tettmar » Sat Dec 15, 2007 6:56 pm

You could loop through the list of files and get an MD5 hash of each one - populate an array. If the hash is already in the array, delete the file (if the hash is the same the files must be identical).

Use the MD5 hash library on the plugins page.

User avatar
Dorian (MJT support)
Automation Wizard
Posts: 1389
Joined: Sun Nov 03, 2002 3:19 am
Contact:

Post by Dorian (MJT support) » Sat Dec 15, 2007 7:19 pm

Thanks for the ideas guys. hmm, MD5, I'll have to investigate that. I haven't even heard of it so it'll give me the chance to learn something new.

There are quite a few which have similar file sizes, so I'd imagine many of them might easily be the same size. Many are also the same dimensions. They are randomly named.

I'll try the MD5 method (if I can figure it out) and let you know how I get on.

User avatar
jpuziano
Automation Wizard
Posts: 1085
Joined: Sat Oct 30, 2004 12:00 am

Post by jpuziano » Sat Dec 15, 2007 7:54 pm

Hi Horoscopes2000,

No need to re-invent the wheel, check out http://www.dupemaster.com/ which can do this for you, not just for jpegs but duplicate files of all sorts.

If you have problems with v1.7, try v1.5 which I found works better for me.

However, if you do end up coding a macro solution to this, why not share it with others here on the forum... trying out and extending the scripts of others here on the forums is one of the best ways to learn more.
jpuziano

Note: If anyone else on the planet would find the following useful...
[Open] PlayWav command that plays from embedded script data
...then please add your thoughts/support at the above post - :-)

Me_again
Automation Wizard
Posts: 1101
Joined: Fri Jan 07, 2005 5:55 pm
Location: Somewhere else on the planet

Post by Me_again » Fri Dec 21, 2007 4:54 am

mtettmar wrote:You could loop through the list of files and get an MD5 hash of each one - populate an array. If the hash is already in the array, delete the file (if the hash is the same the files must be identical).

Use the MD5 hash library on the plugins page.
How does one compare MD5's? I'm sure I must be doing something dumb because I've tried regular If>'s and complex expressions but I can't get MS to believe they are the same.

Code: Select all

Let>HashLib=c:\hashlib\HashLib.dll
LibFunc>Hashlib,FileMD5,r,c:\mydir\mypic.jpg,buf
Let>hash=r_2
MDL>hash
LibFunc>Hashlib,FileMD5,q,c:\mydir\mypic.jpg,buf2
Let>hash2=q_2
MDL>hash2
If>hash2=hash
Goto>match
Else
Goto>nomatch
Endif
Label>match
MessageModal>match
Goto>theend
Label>nomatch
MessageModal>nomatch
Label>theend

User avatar
Marcus Tettmar
Site Admin
Posts: 7395
Joined: Thu Sep 19, 2002 3:00 pm
Location: Dorset, UK
Contact:

Post by Marcus Tettmar » Fri Dec 28, 2007 2:55 pm

"Functions return length of hash or zero if an error occurred."

Code: Select all

//Get MD5 hash of a file
LibFunc>d:\Hashlib,FileMD5,r1,c:\mydir\mypic.jpg,buf1
Mid>r1_2,1,r1,hash1

//Get MD5 hash of a file
LibFunc>d:\Hashlib,FileMD5,r2,c:\mydir\mypic.jpg,buf2
Mid>r2_2,1,r2,hash2

If>hash1=hash2
  MessageModal>Bitmap files are identical
else
  MessageModal>Bitmap files are different
Endif
Marcus Tettmar
http://mjtnet.com/blog/ | http://twitter.com/marcustettmar

Did you know we are now offering affordable monthly subscriptions for Macro Scheduler Standard?

Me_again
Automation Wizard
Posts: 1101
Joined: Fri Jan 07, 2005 5:55 pm
Location: Somewhere else on the planet

Post by Me_again » Fri Dec 28, 2007 3:24 pm

Well, that works, but why is this Mid> necessary?

Mid>r1_2,1,r1,hash1

My code came from direct from HashLib.txt

//Get MD5 hash of a file
LibFunc>Hashlib,FileMD5,r,c:\somefile.exe,buf
Let>hash=r_2
MessageModal>Hash: %hash%

User avatar
Marcus Tettmar
Site Admin
Posts: 7395
Joined: Thu Sep 19, 2002 3:00 pm
Location: Dorset, UK
Contact:

Post by Marcus Tettmar » Fri Dec 28, 2007 3:28 pm

"Functions return length of hash or zero if an error occurred."

The example just displays the contents in a message box, so any null chars past the end of the string won't matter.

But if you want to compare them you need to remove just the pertinent data. DLLs pass/return references to memory not actual literal strings. So when dealing with strings we pass a buffer which is a reference to memory and somehow need to know how much data was written to that buffer. As HashLib.txt says the "Functions return length of hash or zero if an error occurred.". So the return value gives you the amount of data in the buffer, so you should use that to retrieve it. Hence the Mid statement.
Marcus Tettmar
http://mjtnet.com/blog/ | http://twitter.com/marcustettmar

Did you know we are now offering affordable monthly subscriptions for Macro Scheduler Standard?

Me_again
Automation Wizard
Posts: 1101
Joined: Fri Jan 07, 2005 5:55 pm
Location: Somewhere else on the planet

Post by Me_again » Fri Dec 28, 2007 3:38 pm

Thanks, that's very clear now :)

User avatar
Marcus Tettmar
Site Admin
Posts: 7395
Joined: Thu Sep 19, 2002 3:00 pm
Location: Dorset, UK
Contact:

Post by Marcus Tettmar » Fri Dec 28, 2007 3:53 pm

No problem. I have also just updated HashLib.txt in the library distribution to use Mid in the examples. So hopefully that will avoid confusion in future.
Marcus Tettmar
http://mjtnet.com/blog/ | http://twitter.com/marcustettmar

Did you know we are now offering affordable monthly subscriptions for Macro Scheduler Standard?

Post Reply
Sign up to our newsletter for free automation tips, tricks & discounts