How to Write a File Comparison Utility in Python

Python Assignment Help Python is a very versatile programming language, so in this post we are going to consider a program that will find duplicate files (not by name, but by contents). A naive implementation works in n² time by comparing each pair of files, but we can do much better, first of all, we store each file with it’s size, and then consider for each size how many files are there. If there is only 1 file of a given size, then it can’t possible have any duplicates and a further optimization is to compare the hashes of the files, so that if we have 5 files of 100K each, we would need to check 2.5 MB (each pair and reading the entire file each time), by using hashes we reduce the workload to reading 500K. An additional optimization is to take the hash of the header of the file, and only if those are the same do we calculate the hash of the full file. This is an example project that is the type of python assignment help we can provide.

import sys, os, hashlib

from datetime import datetime

# BUF_SIZE is totally arbitrary, change for your app!

BUF_SIZE = 65536 # lets read stuff in 64kb chunks!

class File():

def __init__(self, name : str):

self.name = name

stat = os.stat(name)

self.size = stat.st_size

self.modified = datetime.fromtimestamp(stat.st_mtime)

self.header = None

self.full = None

self.zeroes = False

def __str__(self):

return “{}({}) {} {} {}”.format(self.name, self.size, self.modified, self.header, self.full)

class CompareFile():

def __init__(self, file: File):

self.file = file

def __eq__(self, other):

if not isinstance(other, CompareFile):

return False

if self.file.size != other.file.size:

return False

self.calc()

other.calc()

if self.file.header != other.file.header:

return False

self.calc(True)

other.calc(True)

return self.file.full == other.file.full

def _hashfile(self, full):

# return “full” if full else “header”

md5 = hashlib.md5()

with open(self.file.name, ‘rb’) as f:

first = True

valid = True

while True:

data = f.read(BUF_SIZE)

if not data:

break

if valid and (first or len(data) == BUF_SIZE):

valid = any(data)

first = False

md5.update(data)

if not full:

break

return md5.hexdigest(), not valid

def calc(self, full=False):

if full:

if self.file.full:

return

self.file.full, self.file.zeroes = self._hashfile(full)

return

if self.file.header:

return

self.file.header, self.file.zeroes = self._hashfile(full)

def __hash__(self):

return hash(self.file.size)

def main(args):

fileA = File(r”E:\comics\Backways 003 (2018) (digital) (Son of Ultron-Empire).cbr”)

fileB = File(r”E:\comics\Backways 003 (2018) (digital) (Son of Ultron-Empire)_2.cbr”)

files = [CompareFile(fileA), CompareFile(fileB)]

print(fileA)

print(fileB)

set(files)

print(fileA)

print(fileB)

if __name__ == “__main__”:

main(sys.argv)

File that was downloaded and corrupt, so redownloaded and you can see the results.

E:\comics\Backways 003 (2018) (digital) (Son of Ultron-Empire).cbr(112814076) 2018-07-07 06:53:58 None None
E:\comics\Backways 003 (2018) (digital) (Son of Ultron-Empire)_2.cbr(112814076) 2018-07-07 06:54:12 None None
E:\comics\Backways 003 (2018) (digital) (Son of Ultron-Empire).cbr(112814076) 2018-07-07 06:53:58 55217b468617c2cb2db134815a87ec97 3d300657f2bfe697655292a8a6916b5d
E:\comics\Backways 003 (2018) (digital) (Son of Ultron-Empire)_2.cbr(112814076) 2018-07-07 06:54:12 55217b468617c2cb2db134815a87ec97 8093101eb01d3e374628d7e3aa854cf9

As you can see it uses the hash to compare the files, and the first 64k is identical, but the full hash is different. So we will get all of the files in the directory and check for duplicates. We will be able to run it from the command line, and this gives the results below.

C:\Python35\python.exe C:/course/social/duplicates.py e:\comics\*.rar

Duplicate file found: e:\comics\Captain America v1 441-445 (2).rar(277537904) 2018-07-25 19:32:36 3eae234a3d5703287bb66c95838b136b c7ff4078105c5e1523ec5df7380438c4
Copy of: e:\comics\Captain America v1 441-445.rar(277537904) 2018-07-21 03:51:08 3eae234a3d5703287bb66c95838b136b c7ff4078105c5e1523ec5df7380438c4

Add the following code.

from glob import glob

def main(args):

filenames = {file for file in glob(args[1], recursive=True)}

files = {}

for path in filenames:

file = CompareFile(File(path))

if file not in files:

files[file] = file

else:

print(“Duplicate file found:”, file.file)

print(“Copy of:”, files[file].file)

Keep visiting Programming Homeworks Help and we will extend this example to use database for persistence so that it only ever needs to scan a file once and also we will add html output so the program can display the results and so you can delete files. For all your Python homework help this should be your first choice.