Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Implementation of March memory testing algorithm

I am looking for a memory testing algorithm that will help my team verify the design and test during production (bad soldering, cross-connected address/data lines, mismatched impedances, mirroring etc.).

I've read that e.g. March C or similar is the answer to our prayers, but I haven't yet found an implementation of one such algorithm that we can just borrow.

like image 536
Christian Madsen Avatar asked Sep 16 '10 18:09

Christian Madsen


1 Answers

I have been testing boards for over 20 years and tonight is the first time to hear of this March test or algorithm. And looking at it, it bothers me to see a name applied to common sense as if that person or group invented common sense.

Anyway, think about the things you said you wanted to test. Ideally a board level test is testing the solder and pcboard, manufacturing items, NOT design verification, and NOT chip verification. The chip should have been tested by the vendor and you should ideally only need to do a quick functional. In the case of memories though it is common to test each bit cell anyway, time permitting, sram you probably have the time, get into gigabytes of dram and you are looking at hours to days per board if you try to test each bit.

So you want to wiggle every pin ideally, basic functional tests like fill all addresses with all ones 0xFFF..., fill with zeros, fill with 0x5s fill with 0xAs. 0x6s and 0x9s and 0xCs and 0x3s if you are so inclined. checkerboard, again with those alternating patterns fill every other address with 0x5s and every other with 0xAs, etc. Then for crosstalk the walking ones and walking zeros. 0x00..001, 0x00...002, 0x00...004, etc. then 0xff..ffe, 0xff..ffd, etc.

That is all well and good but assumes you have working address bits. If say all the address lines were broken most of the above tests would pass. If only the least significant address bit worked then all of the above tests would pass, and depending on the size of memmory and how you are driving the tests that could be hours wasted.

Another thing you need to know is the size of your data bus. If this is a 32 bit processor but uses a 16 bit data bus and you are doing a 32 bit walking ones test, you have spent twice too much time you only need to walk 16 bits not 32. Or a 64 bit data bus on a 32 bit processor (The average 32 bit desktop with a 72 bit memory for example) you have not covered all the friendly bit combinations. Wide memory interfaces may not exercise all the data lines if all you do are half or quarter the memory bus widths.

A common quick address test is to fill memory with its address. You have to essentially put a unique pattern into every address location.

The above covers most of the blatant, bad solder, lifted pin, floating ball problems. (many will just label these the March tests apparently) If the memory pinout supports different size memories you may not have hit all the address bits but there really isnt anything you can do about that, and it may not matter because putting the max size memory may involve solder and that means a board retest anyway.

There are a lot of tests above, and if you write and run each of them on each location in the full memory space it can take a while. One easy way to reduce it, assuming the goal is manufacturing test not inside the chip test, is to skip addresses using prime numbers. instead of every memory location use every 257 memory locations for example and have your tests complete that much faster. Prime numbers other than 2 often will wiggle every address bit. For walking ones tests you really only need to test one memory location not the whole memory, that can speed it up. A checkerboard, two locations (the goal there to check state changes on the data bus).

These low speed tests wont cover impedance though, that is a tricky one. Need to turn up and down the knob on the memory bus speed if you can, and create very low level, ideally hand assembled tests that push the memory bus at max speed, a read or write every clock cycle sustained as long as you can stand it. If your processor or peripherals in your processor (dma, etc) cannot sustain those rates, then whatever the fastest thing in the chip is...well the fastest you can go, and you need to get that thing doing the longest runs of the fastest bursts it can do. That wont necessarily cover impedance, you probably cannot fully test impedance without putting a scope on each trace of each board. Going fast may find some more blatant impedance problems as well as bulk capacitance and things like that. A checkerboard with all ones to all zeros may help with ground bounce and bulk capacitance as well.

Also note that going slow is very important. High speed and high volume tests do not cover noise on the board, could have a bad board or design and easily pass all the memory tests. You might want to have some tests that are intentionally slow, allowing write strobes to glitch for example, even better if you perform tests on nearby traces but not the memory traces. Fill memory, wait a bit, read it back, see if some writes snuck in. You mentioned sram, for dram a slow test is important to make sure the refreshes are working, perhaps fill with unique patterns, wait a while, read back, fill with the inverse of the unique patterns to flip every bit, wait a while, read back.

I have for the most part abandoned most of the above tests and get a lot of mileage out of pseudo random testing. Using an LFSR that creates more unique numbers than the number of memory locations I want to test, for example this 16 bit one should produce two to the power 16 minus two unique numbers before it repeats. The minus two is because an lfsr wont operate on or generate the numbers all ones or all zeros, remember this when you seed it.

unsigned int fastprand16 ( unsigned int prand )
{
    // 16 bit lfsr  bits 16,14,13,11
    if(prand&1)
    {
        prand>>=1;
        prand^=0x0000B400;
    }
    else
    {
        prand>>=1;
    }
    return(prand&0x0000FFFF);
}

Wikipedia has links for tables of lfsr bit cell taps that produce the maximum number of patterns before repeating for various shifter lengths. The one above works but is a bit boring you want to be flipping more data bits and not just shifting them.

Using your own randomizer is better than using one from a library. The library changes from computer to computer from operating system to operating system from compiler to compiler and from versions of os or compiler on the same system. Using your own you will insure the test does not change its properties over time even if the host system driving it does. This is why something like an lfsr is good, it may not be a great random number generator for playing card games against a computer, but for creating repeatable chaotic looking data patterns on a data bus with a small bit of fast executing code, it is great. Without a homebrew randomizer I would avoid randomizer based testing all together.

If you need to perform a fast BIST for example, you can fill memory with prand numbers read back, fill with the inverse of the same prand numbers, read back. Or performing a prand test first to weed out the blatantly bad boards, then perform the March tests with perhaps the exception of the address test. or instead you could perform many/thousands of prand passes changing the seed each time. Knowing the properties of your lfsr patterns you might choose to use the next random number in the pattern as the seed for the next pass on memory. Or to be ideal you can use a second lfsr to produce the seed, creating every possible seed over time.

Caches and cache testing are a nightmare. On chip ones should follow the rule of this is not a chip verification test nor design verification test this is a manufacturing test. If you have the data cache on and are testing ram on the other side of it you may be fooling yourself, may need to perform the write pass number of times before the read pass. Ideally you want the cache on so your test runs fast, but want to disable the cache for the memory region under test. That reminds me that a common mistake is to perform all of these tests only on the memory not being used by the software performing the test (assuming this board has a processor and the sram is the processors execution memory), in particular having that software run from zero or low memory, meaning the memory area that most programs spend most of their time running, is not tested and that chunk in the middle between program space and stack, that is used less often is the most tested, plus not testing all the address bits because you might slam into the stack. Almost a waste of time bothering to do any memory tests for a system like that dont you think? If you dont trust the memory to bother to test it you cannot trust the results of the test program running in that memory. Ideally you want to execute from a rom or on chip scratch memory, so that you can fully test the entire memory bus.

ECC memory is another nightmare, well designed ecc memory and memory controllers will allow you to address all of the bits including the ecc tags, allowing you to test everything as well as the ecc system itself, single and multi bit errors. If you dont have access then even for positive testing, if you are bothering to try to test every bit inside the chip, then for each row you need to insure that the suite of memory tests turn on and off all of the ecc bits at least once as well as for each bit in the tag it is tested on with all the other bits at one point in time off, and off with each of the other bits (not necessarily at the same time) on. Modern processors with their branch prediction are well within their rights to read any memory location at will, so your test may accidentally read a memory location with an intentionally planted single bit error, causing that bit to be repaired, and by the time your test gets around to hit that location you may fail because you didnt see the expected single bit error, when in fact the system is operating properly. Parity is similar to ecc just not as bad.

Another thing about board test is that if say you wanted to test every bit in every chip as well as all the pcb traces and solder joints and cables. It doesnt take long to look through the peripherals, or look at the instruction set of the processor itself (if you have one on board) and find that even at say 2ghz you may be looking at tens of billions of years before you even reach the outer pins of the first chip (working from the inside out). You cannot and will not test everything, pick the low hanging fruit, wait for users (hopefully in house software/bsp developers) to find unforseen problems, and then create new tests for those specific problems. You may have the perfect march memory test and it still wont find intermittent sram problems. Even with burn in I have seen parts fail months later. Well beyond the expected infant mortality for boards/parts.

Bottom line, there is no one size fits all solution, you have to tune common or popular or personal favorite practices with the features of the specific board/chips and be able to debug and create new tests on the fly. You also need to be proactive in forcing the design engineers to design for test. It is your head (the test engineer) that will roll before theirs if there is a product recall.

Sorry for the long post, I hope it is useful to someone...

like image 130
old_timer Avatar answered Nov 09 '22 20:11

old_timer