In this article, we are going to create a python script on how to remove duplicate files.
Introduction:
This python script removes duplicate files in the directory where the script runs. It first lists all the files in the directory and checks whether they have existing files or not. If exists it deletes the duplicate files.
A duplicate file is a copy of a file on your computer that may be stored in the same folder or in another folder.
Duplicate files have absolutely identical content, size, and extensions but might have different file names.
Project Prerequisites:
There are no external libraries that we are going to use to run this simple python script.
os
hashlib
os:
The OS module in Python provides functions for creating and removing a directory (folder), fetching its contents, changing and identifying the current directory, etc.,
It is possible to automatically perform many operating system tasks. The OS module in Python provides functions for creating and removing a directory (folder), fetching its contents, changing and identifying the current directory, etc.
You first need to import the os
module to interact with the underlying operating system. So, import it using the import os
statement before using its functions.
to know more about this module, refer to: https://www.tutorialsteacher.com/python/os-module
hashlib:
hashlib hashing function takes a variable length of bytes and converts it into a fixed-length sequence. The Python hashlib module is an interface for hashing messages easily. this contains numerous methods which will handle hashing any raw message in an encrypted format.
This module implements a common interface to many different secure hash and message digest algorithms. Included are the FIPS secure hash algorithms SHA1, SHA224, SHA256, SHA384, and SHA512 (defined in FIPS 180-2) as well as RSA’s MD5 algorithm (defined in Internet RFC 1321). The terms “secure hash” and “message digest” are interchangeable. Older algorithms were called message digests. The modern term is secure hash.
to know more about this module, refer to: https://docs.python.org/3/library/hashlib.html
Code Implementation:
firstly we have to import the os and hashlib modules. without these two modules, we cannot run this program because they help in identifying directories and converting data into hash values.
import hashlib
import os
import hashlib
import os
Here we have to define a function. To read the particular block size from the file.
def: a function is a logical unit of code containing a sequence of statements indented under a name given using the “def” keyword.
Basically, what this function hashFIle() does is, it returns the hash string for the given file name and we fix a size of 65536 for BLOCK so that it won’t lead to memory overflow in case of large files.
def hashFile(filename):
# For large files, if we read it all together it can lead to memory overflow, So we take a blocksize to read at a time
BLOCKSIZE = 65536
hasher = hashlib.md5()
with open(filename, ‘rb’) as file:
# Reads the particular blocksize from file
buf = file.read(BLOCKSIZE)
while(len(buf) > 0):
hasher.update(buf)
buf = file.read(BLOCKSIZE)
return hasher.hexdigest()
There are some of the methods of OS that we use in the following code:
isfile() this method is used to check whether the specified path is an existing regular file or not.
os.listdir() this method is used to get the list of all files and directories in the specified directory.
append() this method is used to append a passed obj into the existing list.
in this block of code, we are finding the list of deleted files and existing files using conditional statements and looping statements.
remove() this method removes the first occurrence of the element with the specified value.
if __name__ == "__main__":
# Dictionary to store the hash and filename
hashMap = {}
# List to store deleted files
deletedFiles = []
filelist = [f for f in os.listdir() if os.path.isfile(f)]
for f in filelist:
key = hashFile(f)
# If key already exists, it deletes the file
if key in hashMap.keys():
deletedFiles.append(f)
os.remove(f)
else:
hashMap[key] = f
if len(deletedFiles) != 0:
print('Deleted Files')
for i in deletedFiles:
print(i)
else:
print('No duplicate files found')
Successfully if we run the program we will get to know the duplicate files and immediately it will remove the duplicate files in our system. if there are no duplicates then we get the output as no duplicate files found.
Source code:
Here is the complete source code of our project.
You can copy and run this on your machine.
import hashlib
import os
# Returns the hash string of the given file name
def hashFile(filename):
# For large files, if we read it all together it can lead to memory overflow, So we take a blocksize to read at a time
BLOCKSIZE = 65536
hasher = hashlib.md5()
with open(filename, 'rb') as file:
# Reads the particular blocksize from file
buf = file.read(BLOCKSIZE)
while(len(buf) > 0):
hasher.update(buf)
buf = file.read(BLOCKSIZE)
return hasher.hexdigest()
if __name__ == "__main__":
# Dictionary to store the hash and filename
hashMap = {}
# List to store deleted files
deletedFiles = []
filelist = [f for f in os.listdir() if os.path.isfile(f)]
for f in filelist:
key = hashFile(f)
# If key already exists, it deletes the file
if key in hashMap.keys():
deletedFiles.append(f)
os.remove(f)
else:
hashMap[key] = f
if len(deletedFiles) != 0:
print('Deleted Files')
for i in deletedFiles:
print(i)
else:
print('No duplicate files found')
we have completed the coding part now we have to run the program to get the output.
Output:
we can run this in command prompt like the following:
This is the output of my system. which means there are no duplicates in my system.
Let us also try creating a duplicate file in the same directory.
Here is the output in ubuntu, since we are not allowed to create duplicate files in windows.
Congratulations. You have now made a simple python script to remove duplicate files in our system.