Question

Randomized Motif Search Input: Integers k and t, followed by a collection of strings Dna. Output:...

Randomized Motif Search

Input: Integers k and t, followed by a collection of strings Dna.

Output: A collection BestMotifs resulting from running RandomizedMotifSearch(Dna, k, t)

1,000 times. Remember to use pseudocounts!

Pseudocode

RandomizedMotifSearch(Dna, k, t):

       Motifs ← empty list

      for each sequence seq in Dna

         add randomly selected k-mer from seq to Motifs

       BestMotifs ← Motifs

       while forever

           Profile ← Profile(Motifs)

           Motifs ← Motifs(Profile, Dna)

            if Score(Motifs) < Score(BestMotifs)

                BestMotifs ← Motifs

            else

                return BestMotifs

SAMPLE DATASET:

Input:

8 5

CGCCCCTCTCGGGGGTGTTCAGTAAACGGCCA

GGGCGAGGTATGTGTAAGTGCCAAGGTGCCAG

TAGTACCGAGACCGAAAGAAGTATACAGGCGT

TAGATCAAGTTTCAGGTGCACGTCGGTGAACC

AATCCACCAGCTCCACGTGCAATGTTGGCCTA

Output:

TCTCGGGG

CCAAGGTG

TACAGGCG

TTCAGGTG

TCCACGTG

The sample dataset is not actually run on your code.

TEST DATASET 1:

Input:

6 8

AATTGGCACATCATTATCGATAACGATTCGCCGCATTGCC

GGTTAACATCGAATAACTGACACCTGCTCTGGCACCGCTC

AATTGGCGGCGGTATAGCCAGATAGTGCCAATAATTTCCT

GGTTAATGGTGAAGTGTGGGTTATGGGGAAAGGCAGACTG

AATTGGACGGCAACTACGGTTACAACGCAGCAAGAATATT

GGTTAACTGTTGTTGCTAACACCGTTAAGCGACGGCAACT

AATTGGCCAACGTAGGCGCGGCTTGGCATCTCGGTGTGTG

GGTTAAAAGGCGCATCTTACTCTTTTCGCTTTCAAAAAAA

Output:

CGATAA

GGTTAA

GGTATA

GGTTAA

GGTTAC

GGTTAA

GGCCAA

GGTTAA

            This dataset checks if your code has an off-by-one error at the beginning of each

sequence of Dna. Notice that the some of the motifs of the solution occur at the beginning of

their respective sequences in Dna, so if your code did not check the first k-mer in each sequence

of Dna, it would not find these sequences.

TEST DATASET 2:

Input:

6 8

GCACATCATTAAACGATTCGCCGCATTGCCTCGATTAACC

TCATAACTGACACCTGCTCTGGCACCGCTCATCCAAGGCC

AAGCGGGTATAGCCAGATAGTGCCAATAATTTCCTTAACC

AGTCGGTGGTGAAGTGTGGGTTATGGGGAAAGGCAAGGCC

AACCGGACGGCAACTACGGTTACAACGCAGCAAGTTAACC

AGGCGTCTGTTGTTGCTAACACCGTTAAGCGACGAAGGCC

AAGCTTCCAACATCGTCTTGGCATCTCGGTGTGTTTAACC

AATTGAACATCTTACTCTTTTCGCTTTCAAAAAAAAGGCC

Output:

TTAACC

ATAACT

TTAACC

TGAAGT

TTAACC

TTAAGC

TTAACC

TGAACA

This dataset checks if your code has an off-by-one error at the end of each sequence of

Dna. Notice that the some of the motifs of the solution occur at the end of their respective

sequences in Dna, so if your code did not check the last k-mer in each sequence of Dna, it would

not find these sequences.

In Java

Homework Answers

Answer #1

I followed the pseudocode to the best of my knowledge. I hope this is code will help you in solving this task. Here is my code:

def BuildProfileMatrix(dnamatrix):
    ProfileMatrix = [[1 for x in xrange(len(dnamatrix[0]))] for x in xrange(4)]
    indices = {'A':0, 'C':1, 'G': 2, 'T':3}
    for seq in dnamatrix:
    for i in xrange(len(dnamatrix[0])):            
        ProfileMatrix[indices[seq[i]]][i] += 1
    ProbMatrix = [[float(x)/sum(zip(*ProfileMatrix)[0]) for x in y] for y in ProfileMatrix]
    return ProbMatrix
def ProfileRandomGenerator(profile, dna, k, i):
    indices = {'A':0, 'C':1, 'G': 2, 'T':3}
    score_list = []
    for x in xrange(len(dna[i]) - k + 1):
        probability = 1
        window = dna[i][x : k + x]
    for y in xrange(k):
        probability *= profile[indices[window[y]]][y]
    score_list.append(probability)
    rnd = uniform(0, sum(score_list))
    current = 0
    for z, bias in enumerate(score_list):
        current += bias
        if rnd <= current:
            return dna[i][z : k + z]
def score(motifs):
    ProfileMatrix = [[0 for x in xrange(len(motifs[0]))] for x in xrange(4)]
    indices = {'A':0, 'C':1, 'G': 2, 'T':3}
    for seq in motifs:
        for i in xrange(len(motifs[0])):            
            ProfileMatrix[indices[seq[i]]][i] += 1
    score = len(motifs)*len(motifs[0]) - sum([max(x) for x in zip(*ProfileMatrix)])
    return score
from random import randint, uniform    
def GibbsSampler(k, t, N):
     dna = ['CGCCCCTCTCGGGGGTGTTCAGTAACCGGCCA',
    'GGGCGAGGTATGTGTAAGTGCCAAGGTGCCAG',
    'TAGTACCGAGACCGAAAGAAGTATACAGGCGT',
    'TAGATCAAGTTTCAGGTGCACGTCGGTGAACC',
    'AATCCACCAGCTCCACGTGCAATGTTGGCCTA']
    Motifs = []
    for i in [randint(0, len(dna[0])-k) for x in range(len(dna))]:
        j = 0
        kmer = dna[j][i : k+i]
        j += 1
        Motifs.append(kmer)
    BestMotifs = []
    s_best = float('inf')
    for i in xrange(N):
        x = randint(0, t-1)
    Motifs.pop(x)
    profile = BuildProfileMatrix(Motifs)
    Motif = ProfileRandomGenerator(profile, dna, k, x)
    Motifs.append(Motif)
    s_motifs = score(Motifs)
    if s_motifs < s_best:
        s_best = s_motifs
        BestMotifs = Motifs
return [s_best, BestMotifs]

k, t, N =8, 5, 100            
best_motifs = [float('inf'), None]

# Repeat the Gibbs sampler search 20 times.
for repeat in xrange(20):
    current_motifs = GibbsSampler(k, t, N)
    if current_motifs[0] < best_motifs[0]:
        best_motifs = current_motifs
# Print and save the answer.
print '\n'.join(best_motifs[1])     
Know the answer?
Your Answer:

Post as a guest

Your Name:

What's your source?

Earn Coins

Coins can be redeemed for fabulous gifts.

Not the answer you're looking for?
Ask your own homework help question
Similar Questions