Remove duplicated sequences from an alignment¶
The omit_duplicated app removes redundant sequences from a sequence collection (aligned or unaligned).
Let’s create sample data with duplicated sequences.
Creating the omit_duplicated app with the argument choose="longest" selects the duplicated sequence with the least number of gaps and ambiguous characters. In the above example, only one of c and d will be retained.
Creating the omit_duplicated app with the argument choose=None means only unique sequences are retained.
The mask_degen argument specifies how to treat matches between sequences with degenerate characters.
Let’s create sample data that has a DNA ambiguity code.
Since “Y” represents pyrimidines where the site can be either “C” or “T”, s1 indeed matches s2 and one of them will be removed.