Shared and unique lines from large files. Fastest method?

9 months ago

Comment

1 / 1

When you have two large files and you want to extract shared and unique lines efficiently, you can use common command-line tools or scripting languages. Here are a few options:

Python:
You can use Python to read both files, store the lines in sets, and then perform set operations to get shared and unique lines. This method is memory-intensive, so make sure you have enough memory available.

python
with open('file1.txt', 'r') as f1, open('file2.txt', 'r') as f2:

    lines1 = set(f1)

    lines2 = set(f2)



shared_lines = lines1.intersection(lines2)

unique_lines_file1 = lines1 - shared_lines

unique_lines_file2 = lines2 - shared_lines



print("Shared Lines:", shared_lines)

print("Unique Lines in File 1:", unique_lines_file1)

print("Unique Lines in File 2:", unique_lines_file2)

Unix Command Line (Linux/Mac) - comm and sort:
On Unix-like systems, you can use the comm command to compare two sorted files line by line. The files must be sorted first. This method can handle very large files efficiently without loading everything into memory.

bash
sort file1.txt > sorted_file1.txt

sort file2.txt > sorted_file2.txt

comm -12 sorted_file1.txt sorted_file2.txt > shared_lines.txt

comm -23 sorted_file1.txt sorted_file2.txt > unique_lines_file1.txt

comm -13 sorted_file1.txt sorted_file2.txt > unique_lines_file2.txt

Windows Command Prompt - findstr:
On Windows, you can use the findstr command to find lines that match in both files. This method is less efficient than using comm on Unix-like systems but can work for smaller files.
```
cmd
findstr /g:file1.txt file2.txt > shared_lines.txt

findstr /v /g:shared_lines.txt file1.txt > unique_lines_file1.txt

findstr /v /g:shared_lines.txt file2.txt > unique_lines_file2.txt
```

Choose the method that best fits your system and requirements. Keep in mind that the Unix command-line approach with comm is likely to be the most efficient for very large files.