Remove duplicates from file linux command line

Introduction

So you got a text file with duplicates (words, lines, etc) and you want to delete the extra ones ? How can we do that ? This post discusses a couple of methods that can be used to remove duplicates via Unix (or Mac) terminal. Assume we have the following text file…

Let us explore different ways to clean the file from duplicates…

Using awk command

awk is a standard feature on any Unix based OS. It is very powerful for text processing and extraction. Removing duplicates using awk can be implemented in one liner. Check this out…

Here is the explanation…

  • awk is the command name
  • data.txt is the input file with duplicates
  • data.out is the output file without duplicates
  • $0 evaluates to the current processed line in the file
  • dup[] is a dictionary or an associative array where dup is just a name of our choice
  • $0 is used as a key in that dictionary
  • The corresponding value is the number of occurrences of that line in the file

Every time a line is processed by awk, it is used as a key in the dictionary and the total count is incremented using (++) operator. The negation (!) character indicates a logical NOT. In other words, the current line is not printed to the output file if it already exists in the dictionary. It is a smart trick to remove duplicates because unique lines have a count of 1 in the dictionary.

Using Perl command line

Utilizing the same trick (dictionary approach), we can also implement a command line Perl code to achieve the same result. Take a look…

Note that perl -ne ‘CODE’ is equivalent to…

The (e) option is used to load a module and the (n) option is used to add a loop around the code block. This is very convenient in text processing. Let us explain the above Perl syntax…

  • $_ holds the current line just like $0 in awk.
  • $dup{$_} is an associative array or a dictionary as in the awk example.

It is the same logic essentially, we are using the current line as a key into a dictionary. We only print lines with count of 1

Using Python command line

We can write a standalone Python script to open the file, remove duplicates then save the output file, however Python can also be used to process text from the terminal. Take a look…

The -c option is followed by the Python code that we need to execute. Note also that line order in this case is not retained. Here is a brief explanation of the code…

If you want to write a standalone script. Here is an example implementation…

Using sort command

If we are not to worry about lines order, we can simply use the sort command to remove duplicate entries as follows…

Using uniq command

The uniq command does not remove all duplicates. It only removes adjacent duplicates. here is an example…

If we apply the above command on the following data file

The output file should look like…

and not like…

Using sort with order

If we need to retain the order and stick to the sort command, we can do that using the following chain of commands…

Here is a break down of the piped commands…

That is it for today. Thanks for reading. If you have questions, please use the comments section below.

Add a Comment

Your email address will not be published. Required fields are marked *