September 12, 2018

Remove duplicates from file linux command line

By Mohammed Abualrob Articles and Tutorials, Code Snippets, Operating Systems 0 Comments

Introduction

So you got a text file with duplicates (words, lines, etc) and you want to delete the extra ones ? How can we do that ? This post discusses a couple of methods that can be used to remove duplicates via Unix (or Mac) terminal. Assume we have the following text file…

Line 5
Line 5
Line 1
Line 2
Line 2
Line 3
Line 3
Line 3
Line 4

Line 5

Line 1

Line 2

Line 3

Line 4

Let us explore different ways to clean the file from duplicates…

Using awk command

awk is a standard feature on any Unix based OS. It is very powerful for text processing and extraction. Removing duplicates using awk can be implemented in one liner. Check this out…

awk '!dup[$0]++' data.txt > data.out

1	awk '!dup[$0]++' data.txt > data.out

Here is the explanation…

awk is the command name
data.txt is the input file with duplicates
data.out is the output file without duplicates
$0 evaluates to the current processed line in the file
dup[] is a dictionary or an associative array where dup is just a name of our choice
$0 is used as a key in that dictionary
The corresponding value is the number of occurrences of that line in the file

Every time a line is processed by awk, it is used as a key in the dictionary and the total count is incremented using (++) operator. The negation (!) character indicates a logical NOT. In other words, the current line is not printed to the output file if it already exists in the dictionary. It is a smart trick to remove duplicates because unique lines have a count of 1 in the dictionary.

Using Perl command line

Utilizing the same trick (dictionary approach), we can also implement a command line Perl code to achieve the same result. Take a look…

perl -ne 'print unless $dup{$_}++;' data.txt > data.out

1	perl -ne 'print unless $dup{$_}++;' data.txt > data.out

Note that perl -ne ‘CODE’ is equivalent to…

while (<>) 
{
    CODE
}

while (<>)

{

CODE

}

The (e) option is used to load a module and the (n) option is used to add a loop around the code block. This is very convenient in text processing. Let us explain the above Perl syntax…

$_ holds the current line just like $0 in awk.
$dup{$_} is an associative array or a dictionary as in the awk example.

It is the same logic essentially, we are using the current line as a key into a dictionary. We only print lines with count of 1

Using Python command line

We can write a standalone Python script to open the file, remove duplicates then save the output file, however Python can also be used to process text from the terminal. Take a look…

python -c "import sys; lines = sys.stdin.readlines(); print ''.join(set(lines))" < data.txt > data.out

1	python -c "import sys; lines = sys.stdin.readlines(); print ''.join(set(lines))" < data.txt > data.out

The -c option is followed by the Python code that we need to execute. Note also that line order in this case is not retained. Here is a brief explanation of the code…

# Import sys module because we are going to read 
# from standard input
import sys
# Read text from standard input using shell 
# redirection feature < The content of the 
# file is read into a Python list
lines = sys.stdin.readlines()
# Convert the list (lines) into a set because
# sets by definition have no duplicates.
# Join all the elements into a single string
print ''.join(set(lines))

# Import sys module because we are going to read

# from standard input

import sys

# Read text from standard input using shell

# redirection feature < The content of the

# file is read into a Python list

lines = sys.stdin.readlines()

# Convert the list (lines) into a set because

# sets by definition have no duplicates.

# Join all the elements into a single string

print ''.join(set(lines))

If you want to write a standalone script. Here is an example implementation…

# Read all lines into a set
lines = set(open('data.txt').readlines())
# Write the filtered content back to file
with open('data.out', 'w') as f:
   f.writelines(lines)

# Read all lines into a set

lines = set(open('data.txt').readlines())

# Write the filtered content back to file

with open('data.out', 'w') as f:

f.writelines(lines)

Using sort command

If we are not to worry about lines order, we can simply use the sort command to remove duplicate entries as follows…

# Sort and remove duplicates using -u option
sort -u data.txt > data.out

1 2	# Sort and remove duplicates using -u option sort -u data.txt > data.out

Using uniq command

The uniq command does not remove all duplicates. It only removes adjacent duplicates. here is an example…

uniq data.txt > data.out

1	uniq data.txt > data.out

If we apply the above command on the following data file

The output file should look like…

1
2
1

and not like…

1
2

Using sort with order

If we need to retain the order and stick to the sort command, we can do that using the following chain of commands…

cat -n data.txt | sort -uk2 | sort -nk1 | cut -f2 > data.out

1	cat -n data.txt \| sort -uk2 \| sort -nk1 \| cut -f2 > data.out

Here is a break down of the piped commands…

# Print the whole file
cat data.txt 

Line 5
Line 5
Line 1
Line 1
Line 1
Line 2
Line 3

# Add line numbers
cat -n data.txt

1 Line 5
2 Line 5
3 Line 1
4 Line 1
5 Line 1
6 Line 2
7 Line 3

# Sort the file based on second column
cat -n data.txt | sort -k2

3 Line 1
4 Line 1
5 Line 1
6 Line 2
7 Line 3
1 Line 5
2 Line 5

# Remove duplicates using -u
cat -n data.txt | sort -uk2

3 Line 1
6 Line 2
7 Line 3
1 Line 5

# Restore the order by sorting again using line numbers
cat -n data.txt | sort -uk2 | sort -nk1

1 Line 5
3 Line 1
6 Line 2
7 Line 3

# Remove line numbers
cat -n data.txt | sort -uk2 | sort -nk1 | cut -f2

Line 5
Line 1
Line 2
Line 3

# Print the whole file

cat data.txt

Line 5

Line 1

Line 2

Line 3

# Add line numbers

cat -n data.txt

1 Line 5

2 Line 5

3 Line 1

4 Line 1

5 Line 1

6 Line 2

7 Line 3

# Sort the file based on second column

cat -n data.txt | sort -k2

3 Line 1

4 Line 1

5 Line 1

6 Line 2

7 Line 3

1 Line 5

2 Line 5

# Remove duplicates using -u

cat -n data.txt | sort -uk2

3 Line 1

6 Line 2

7 Line 3

1 Line 5

# Restore the order by sorting again using line numbers

cat -n data.txt | sort -uk2 | sort -nk1

1 Line 5

3 Line 1

6 Line 2

7 Line 3

# Remove line numbers

cat -n data.txt | sort -uk2 | sort -nk1 | cut -f2

Line 5

Line 1

Line 2

Line 3

That is it for today. Thanks for reading. If you have questions, please use the comments section below.

8 BIT AVENUE

Remove duplicates from file linux command line

Introduction

Using awk command

Using Perl command line

Using Python command line

Using sort command

Using uniq command

Using sort with order

More from my site

About Author

Mohammed Abualrob

Add a Comment