Large suppression files

How to handle massive suppression files appropriately
We recommend against uploading unnecessarily large suppression files directly into your platform.
Many advertisers now have suppression files that have grown to 1GB, 2GB, or even larger. The outcome of uploading files this large is unpredictable and contingent on many factors including the robustness of your hardware.
Instead, we recommend running a script or tool outside of the platform.                                                                                                                     Use a tool that can output a file of only the matches between your contact list and the suppression file, and then upload that output file as a suppression list instead. This is the type of task the free version of ChatGPT 3.5 is perfectly suited for, as shown below.
Here's the prompt I used on the free version of ChatGPT to generate a script that runs on my macbook pro that will process a large suppression file in md5sum format:
 
I need to create a python script that can run on osx which takes a plain text file and converts each line to an md5sum, then compares each newly md5sumed line of that file to each line of a second plain text file which already consists of md5sums, I then need the matching lines from the first file to be saved to a file in the lines original format before they were converted to md5 for the match. The pre-existing md5sums file is 812MB and over 23 million lines, so this script needs to handle large file sizes while completing quickly
 
import hashlib
 
import sys
 
 
 
def main():
 
# get the filenames from the command line arguments
 
file1 = sys.argv[1]
 
file2 = sys.argv[2]
 
output_file = "matching_lines.txt"
 
 
 
# read the md5sums from file2 and store them in a set
 
md5sums = set()
 
with open(file2, "r") as f:
 
for line in f:
 
md5sums.add(line.strip())
 
 
 
# loop over each line in file1 and compare its md5sum to the md5sums in the set
 
with open(file1, "r") as f, open(output_file, "w") as out:
 
for line in f:
 
md5 = hashlib.md5(line.encode()).hexdigest()
 
if md5 in md5sums:
 
out.write(line)
 
 
 
if __name__ == "__main__":
 
main()
 
The above script was tested on millions of records, completed in a few seconds, and produced accurate output 
Here's another example one liner with AWK in macOS terminal that can process two files of email addresses in regular plain unhashed format:
awk -F, 'FNR==NR {a[$1]; next}; $1 in a' suppression.csv contacts.csv
 
To export multiple contact lists together as one file to make this process efficient, add them to a segment and export the segment instead

Was this article helpful?