How to generate one hot encoding for DNA sequences?

Tags:

I would like to generate one hot encoding for a set of DNA sequences. For example the sequence ACGTCCA can be represented as below in a transpose manner. But the code below will generate the one hot encoding in horizontal way in which I would prefer it in vertical form. Can anyone help me?

ACGTCCA 
1000001 - A
0100110 - C 
0010000 - G
0001000 - T

Example code:

from sklearn.preprocessing import OneHotEncoder
import itertools

# two example sequences
seqs = ["ACGTCCA","CGGATTG"]


# split sequences to tokens
tokens_seqs = [seq.split("\\") for seq in seqs]

# convert list of of token-lists to one flat list of tokens
# and then create a dictionary that maps word to id of word,
# like {A: 1, B: 2} here
all_tokens = itertools.chain.from_iterable(tokens_seqs)
word_to_id = {token: idx for idx, token in enumerate(set(all_tokens))}

# convert token lists to token-id lists, e.g. [[1, 2], [2, 2]] here
token_ids = [[word_to_id[token] for token in tokens_seq] for tokens_seq in tokens_seqs]

# convert list of token-id lists to one-hot representation
vec = OneHotEncoder(n_values=len(word_to_id))
X = vec.fit_transform(token_ids)

print X.toarray()

However, the code gives me output:

[[ 0.  1.]
 [ 1.  0.]]

Expected output:

[[1. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 1. 0.]]

228

asked Dec 14 '15 09:12

Xiong89

2 Answers

def one_hot_encode(seq):
    mapping = dict(zip("ACGT", range(4)))    
    seq2 = [mapping[i] for i in seq]
    return np.eye(4)[seq2]

one_hot_encode("AACGT")

## Output: 
array([[1., 0., 0., 0.],
   [1., 0., 0., 0.],
   [0., 1., 0., 0.],
   [0., 0., 1., 0.],
   [0., 0., 0., 1.]])

133

answered Sep 30 '22 14:09

DrIDK

I suggest doing it a slightly more manual way:

import numpy as np

seqs = ["ACGTCCA","CGGATTG"]

CHARS = 'ACGT'
CHARS_COUNT = len(CHARS)

maxlen = max(map(len, seqs))
res = np.zeros((len(seqs), CHARS_COUNT * maxlen), dtype=np.uint8)

for si, seq in enumerate(seqs):
    seqlen = len(seq)
    arr = np.chararray((seqlen,), buffer=seq)
    for ii, char in enumerate(CHARS):
        res[si][ii*seqlen:(ii+1)*seqlen][arr == char] = 1

print res

This gives you your desired result:

[[1 0 0 0 0 0 1 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0]
 [0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 1 1 0]]

answered Sep 30 '22 13:09

John Zwinck

Related questions
                            
                                Python Sphinx anchor on arbitrary line
                            
                                Django, How to make multiple annotate in a single queryset
                            
                                How to close a QDialog
                            
                                Unable to Include Jinja2 Template to Pyinstaller Distribution
                            
                                How to get apache to serve static files on Flask webapp
                            
                                Find maximum of each row in a numpy array and the corresponding element in another array of the same size
                            
                                How to save / serialize a trained model in theano?
                            
                                Get value of a form input by ID python/flask
                            
                                How to run a command only if is the master branch in travis-ci?
                            
                                linear regression for timeseries python (numpy or pandas)
                            
                                How to annotate seaborn pairplots?
                            
                                Why is adding to or removing from the middle of a collections.deque slower than lookup there?
                            
                                How to customize a scatter matrix to see all titles?
                            
                                Load part of a json in python
                            
                                Solving a system of odes (with changing constant!) using scipy.integrate.odeint?
                            
                                compressed files bigger in h5py
                            
                                How to generate many interaction terms in Pandas?
                            
                                Remove axis scale
                            
                                Flip non-zero values along each row of a lower triangular numpy array
                            
                                How to get all alpha values of scikit-learn SVM classifier?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to generate one hot encoding for DNA sequences?

Tags:

python

arrays

itertools

one-hot-encoding

scikit-learn

Xiong89

People also ask

2 Answers

DrIDK

John Zwinck

Recent Activity

Donate For Us