chaos game for DNA sequences

Tags:

I have tried the mathematica code for making the chaos game for DNA sequences posted in this address: http://facstaff.unca.edu/mcmcclur/blog/GeneCGR.html

which is like this:

Click to copy

genome = Import["c:\data\sequence.fasta", "Sequence"];
genome = StringReplace[ToString[genome], {"{" -> "", "}" -> ""}];
chars = StringCases[genome, "G" | "C" | "T" | "A"];
f[x_, "A"] := x/2;
f[x_, "T"] := x/2 + {1/2, 0};
f[x_, "G"] := x/2 + {1/2, 1/2};
f[x_, "C"] := x/2 + {0, 1/2};
pts = FoldList[f, {0.5, 0.5}, chars];
Graphics[{PointSize[Tiny], Point[pts]}]

the fasta sequence that I have is just a sequence of letters like AACCTTTGATCAAA and the graph to be generated comes like this:

enter image description here

the code works fine with small sequences, but when I want to put a huge sequence, for example almost 40Mb of a chromosome, the program takes a lot of time and only displays a black square so that it is impossible to analyze. Is it possible to improve the aforementioned code, so that the square in which it would be displayed it would be bigger?, by the way the square must be only the square unit. Thanks for your help in advance

971

asked Nov 04 '11 12:11

Layla

1 Answers

Summary of the incremental edits below:

This will give you a considerable speedup in computing the point coordinates by using compiled code (50x excluding computing shifts):

Click to copy

shifts = chars /. {"A" -> {0., 0.}, "T" -> {.5, 0.}, "G" -> {.5, .5}, "C" -> {0, .5}};
fun1d = Compile[{{a, _Real, 1}}, FoldList[#/2 + #2 &, .5, a], CompilationTarget -> "C"]
pts = Transpose[fun1d /@ Transpose[shifts]];

The bottleneck in your code is actually rendering the graphic, we instead of plotting each point, we'll visualize the density of points:

Click to copy

threshold = 1;
With[{size = 300}, 
 Image[1 - UnitStep[BinCounts[pts, 1/size, 1/size] - threshold]]
]

A region will be coloured black if it has at least threshold points. size is the image-dimension. By either choosing a large size or a large threshold you can avoid the "black square problem".

My original answer with more details:

On my rather dated machine, the code is not very slow.

Click to copy

chars = RandomChoice[{"A", "T", "C", "G"}, 800000];

f[x_, "A"] := x/2;
f[x_, "T"] := x/2 + {1/2, 0};
f[x_, "G"] := x/2 + {1/2, 1/2};
f[x_, "C"] := x/2 + {0, 1/2};
Timing[pts = FoldList[f, {0.5, 0.5}, chars];]
Graphics[{PointSize[Tiny], Point[pts]}]

I get a timing of 6.8 seconds, which is usable unless you need to run it lots of times in a loop (if it's not fast enough for your use case and machine, please add a comment, and we'll try to speed it up).

Rendering the graphic unfortunately takes much longer than this (36 seconds), and I don't know if there's anything you can do about it. Disabling antialiasing may help a little bit, depending on your platform, but not much: Style[Graphics[{PointSize[Tiny], Point[pts]}], Antialiasing -> False] (for me it doesn't). This is a long-standing annoyance for many of us.

Regarding the whole graphic being black, you can resize it using your mouse and make it bigger. The next time you evaluate your expression, the output graphic will remember its size. Or just use ImageSize -> 800 as a Graphics option. Considering the pixel density of screens the only other solution that I can think of (that doesn't involve resizing the graphic) would be to represent pixel density using shades of grey, and plot the density.

EDIT:

This is how you can plot the density (this is also much much faster to compute and render than the point-plot!):

Click to copy

With[{resolution = 0.01}, 
 ArrayPlot@BinCounts[pts, resolution, resolution]
]

Play with the resolution to make the plot nice.

For my random-sequence example, this only gives a grey plot. For your genome data it will probably give a more interesting pattern.

EDIT 2:

Here's a simple way to speed up the function using compilation:

First, replace the characters by the shift vectors (has to be done only once for a dataset, then you can save the result):

Click to copy

arr = chars /. {"A" -> {0., 0.}, "T" -> {.5, 0.}, "G" -> {.5, .5}, "C" -> {0, .5}};

Then let's compile our function:

Click to copy

fun = Compile[{{a, _Real, 2}}, FoldList[#/2 + #2 &, {.5, .5}, a], 
 CompilationTarget -> "C"]

Remove CompilationTarget if your version of Mathematica is earlier than 8 or you don't have a C compiler installed.

Click to copy

fun[arr]; // Timing

gives me 0.6 seconds, which is an instant 10x speedup.

EDIT 3:

Another ~5x speedup is possible compared to the above compiled version by avoiding some kernel callbacks in the compiled function (I checked the compilation output using CompilePrint to come up with this version --- otherwise it's not obvious why it's faster):

Click to copy

fun1d = Compile[{{a, _Real, 1}}, FoldList[#/2 + #2 &, .5, a], 
  CompilationTarget -> "C"]

arrt = Transpose[arr];
Timing[result = fun1d /@ arrt;]
pts = Transpose[result];

This runs in 0.11 seconds on my machine. On a more modern machine it should finish in a few seconds even for a 40 MB dataset.

I split off the transpositions into separate inputs because at this point the running time of fun1d starts to get comparable to the running time of Transpose.

143

answered Oct 25 '22 00:10

Szabolcs

Related questions
                            
                                Does Mathematica have a function equivalent to Matlab's "unique" function
                            
                                How do I force Mathematica to include user defined functions in Simplify and FullSimplify?
                            
                                Mathematica - DSolve spits out #1 in output
                            
                                What is the Computational Complexity of Mathematica's CylindricalDecomposition
                            
                                Save Plot Options of Mathematica into a single variable
                            
                                Lexical and dynamic scoping in Mathematica: Local variables with Module, With, and Block
                            
                                Mathematica: branch points for real roots of polynomial
                            
                                Inserting formula into plot
                            
                                How to close initialization cell in Notebook?
                            
                                How to show the animation control by default
                            
                                Arctan Binning, from Plot to Histogram, the tricks
                            
                                How to make external Mathematica functions interruptible?
                            
                                Save Mathematica code in `FullForm` syntax
                            
                                How to change the background style of TabView?
                            
                                How to stop barchart from hiding labels for data value 0 in Mathematica?
                            
                                Fast extraction of elements from nested lists
                            
                                Computer Vision with Mathematica
                            
                                How can I plot a function defined on the unit simplex in Mathematica?
                            
                                Mathematica and CUDA
                            
                                Style dynamic value

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

chaos game for DNA sequences

Tags:

wolfram-mathematica

dna-sequence

fasta

chaos

Layla

People also ask

1 Answers

Szabolcs

Recent Activity

Donate For Us