Remove duplicates from file linux command line
Introduction
So you got a text file with duplicates (words, lines, etc) and you want to delete the extra ones ? How can we do that ? This post discusses a couple of methods that can be used to remove duplicates via Unix (or Mac) terminal. Assume we have the following text file…
1 2 3 4 5 6 7 8 9 |
Line 5 Line 5 Line 1 Line 2 Line 2 Line 3 Line 3 Line 3 Line 4 |
Let us explore different ways to clean the file from duplicates…
Using awk command
awk is a standard feature on any Unix based OS. It is very powerful for text processing and extraction. Removing duplicates using awk can be implemented in one liner. Check this out…
1 |
awk '!dup[$0]++' data.txt > data.out |
Here is the explanation…
- awk is the command name
- data.txt is the input file with duplicates
- data.out is the output file without duplicates
- $0 evaluates to the current processed line in the file
- dup[] is a dictionary or an associative array where dup is just a name of our choice
- $0 is used as a key in that dictionary
- The corresponding value is the number of occurrences of that line in the file
Every time a line is processed by awk, it is used as a key in the dictionary and the total count is incremented using (++) operator. The negation (!) character indicates a logical NOT. In other words, the current line is not printed to the output file if it already exists in the dictionary. It is a smart trick to remove duplicates because unique lines have a count of 1 in the dictionary.
Using Perl command line
Utilizing the same trick (dictionary approach), we can also implement a command line Perl code to achieve the same result. Take a look…
1 |
perl -ne 'print unless $dup{$_}++;' data.txt > data.out |
Note that perl -ne ‘CODE’ is equivalent to…
1 2 3 4 |
while (<>) { CODE } |
The (e) option is used to load a module and the (n) option is used to add a loop around the code block. This is very convenient in text processing. Let us explain the above Perl syntax…
- $_ holds the current line just like $0 in awk.
- $dup{$_} is an associative array or a dictionary as in the awk example.
It is the same logic essentially, we are using the current line as a key into a dictionary. We only print lines with count of 1
Using Python command line
We can write a standalone Python script to open the file, remove duplicates then save the output file, however Python can also be used to process text from the terminal. Take a look…
1 |
python -c "import sys; lines = sys.stdin.readlines(); print ''.join(set(lines))" < data.txt > data.out |
The -c option is followed by the Python code that we need to execute. Note also that line order in this case is not retained. Here is a brief explanation of the code…
1 2 3 4 5 6 7 8 9 10 11 |
# Import sys module because we are going to read # from standard input import sys # Read text from standard input using shell # redirection feature < The content of the # file is read into a Python list lines = sys.stdin.readlines() # Convert the list (lines) into a set because # sets by definition have no duplicates. # Join all the elements into a single string print ''.join(set(lines)) |
If you want to write a standalone script. Here is an example implementation…
1 2 3 4 5 |
# Read all lines into a set lines = set(open('data.txt').readlines()) # Write the filtered content back to file with open('data.out', 'w') as f: f.writelines(lines) |
Using sort command
If we are not to worry about lines order, we can simply use the sort command to remove duplicate entries as follows…
1 2 |
# Sort and remove duplicates using -u option sort -u data.txt > data.out |
Using uniq command
The uniq command does not remove all duplicates. It only removes adjacent duplicates. here is an example…
1 |
uniq data.txt > data.out |
If we apply the above command on the following data file
1 2 3 4 5 |
1 1 1 2 1 |
The output file should look like…
1 2 3 |
1 2 1 |
and not like…
1 2 |
1 2 |
Using sort with order
If we need to retain the order and stick to the sort command, we can do that using the following chain of commands…
1 |
cat -n data.txt | sort -uk2 | sort -nk1 | cut -f2 > data.out |
Here is a break down of the piped commands…
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
# Print the whole file cat data.txt Line 5 Line 5 Line 1 Line 1 Line 1 Line 2 Line 3 # Add line numbers cat -n data.txt 1 Line 5 2 Line 5 3 Line 1 4 Line 1 5 Line 1 6 Line 2 7 Line 3 # Sort the file based on second column cat -n data.txt | sort -k2 3 Line 1 4 Line 1 5 Line 1 6 Line 2 7 Line 3 1 Line 5 2 Line 5 # Remove duplicates using -u cat -n data.txt | sort -uk2 3 Line 1 6 Line 2 7 Line 3 1 Line 5 # Restore the order by sorting again using line numbers cat -n data.txt | sort -uk2 | sort -nk1 1 Line 5 3 Line 1 6 Line 2 7 Line 3 # Remove line numbers cat -n data.txt | sort -uk2 | sort -nk1 | cut -f2 Line 5 Line 1 Line 2 Line 3 |
That is it for today. Thanks for reading. If you have questions, please use the comments section below.
About Author
Mohammed Abualrob
Software Engineer @ Cisco