Adventures in talking to the QSPI flash

I am getting closer to being able to communicate with the QSPI flash, so that we can have the MEGA65 update its own bitstreams in the field. To recap the current situation:

1. Most of the signals to the flash are easy to connect to with the QSPI flash, except the clock, which is normally driven by the FPGA's configuration logic.

2. The FPGA has a facility, the STARTUPE2 component, that allows the running bitstream to take control of this signal.

3. I have managed to achieve (2) in a test bitstream, as confirmed by my new JTAG boundary scan setup.

4. But I haven't got it working for a real bitstream.

To get to this point, from the last blog post, I discovered that the STARTUPE2 component *must* be in the top level of a design.

The question is now why in the real bitstream, it still isn't working, even though I have moved it to the top level.

Basically it works in the pixeltext test target, that lacks a M65 computer, but not in the nexys4ddr-widget target. More weird, when I removed the M65 computer component out of this second target, it still isn't working.

This makes me suspect that there might be some kind of target setup in the Vivado project that is to blame. There is a "persist" flag that can be used, which causes the configuration clock to remain active on the QSPI clock pin. That could be the problem -- but then I would still be expecting to see the line waggle, which it doesn't seem to.

However, digging further, I did managed to control the line with the M65 computer component taken out of the real bitstream. Now trying to put it back in, but with a dedicated 1Hz clock on the pin, so that I can eliminate internal problems in the plumbing of the line to the register I had it hooked up to. Basically I can keep pushing the connection deeper down into the design, until it is in the component where I was controlling it.

Ok, so with the full machine core, and the 1Hz clock in the outer layer, I can control the clock line. Next step is from in the sdcardio.vhdl file where it gets connected to, to see if I can toggle it there under automatic control. If that works, then I must have some subtle bug in the register plumbing. If not, then the plumbing problem must be between sdcardio.vhdl and the outer layer of the design. Either way, I will be able to considerably narrow down where the problem can be hiding.

So, the clock toggles, meaning the problem is probably in sdcardio.vhdl somewhere...

Okay.... So, this is one of those funny bug fixes that I really hate. It could well be that I have done something really stupid, but if so, I am ignorant to what it is. But the solution was to create a 2nd register to control the QSPI clock at $D6CD. With that implemented, magically $D6CC works to control the clock. I've had this kind of problem before with VHDL, where possibly something is incorrectly optimising out the ability to write to some signal. Anyway, it is solved for now.

Then I started trying to investigate things, and came to the rapid conclusion that my life would be so much nicer, if I could make my new JTAG boundary scanner produce industry-standard VCD files that I could view in gtkwave, to get a more effective understanding of what is going on. So I did. It wasn't too hard, and now I can produce pretty pictures like this:

Bildschirmfoto vom 2020-01-12 03-50-09 1.png

Which is helpfully showing me that I can waggle the clock line, and also control the CS (chip select) line, but that the data lines are seemingly not doing anything. But I know from prior experimentation that I can indeed control these lines, so this is probably an example of me having an error in my test program. But how nice it is to be able to determine that in just a few seconds :)

Digging through this, I fixed the initial problem, but also found I had the SO and SI lines switched around from the way they should be, so that will need a resynthesis... Well, then I wasn't so sure, so I made it so that the four data lines are open-collector with internal pull-ups in the FPGA. This means that the lines can be either driven low, or float high. This means I can fiddle with which line is which etc, without having to resynthesise each time.

However, I am seeing some quite weird things with the data lines when I look at the JTAG traces:

Bildschirmfoto vom 2020-01-14 22-05-28 2.png

So let me explain what we have here. Because I was seeing weird things, I make a test program that tries every possible value on the four data lines, CS and clock pins to the QSPI flash. The open-collector operation means that the direction pins (the .ctl pins in the lower half) basically indicate what we *should* be seeing on the actual pins (in the top half). This holds true for QspiDB[2], QspiDB[3], QspiCSn and the clock, but not for QspiDB[1] and QspiDB[0]: These two pins switch a short time later. This would only make real sense, if the QSPI flash was pulling those lines down (remember, open-collector outputs "float" high, so any device connected to them can pull them down to ground), or there is something really fishy going on with the FPGA control of those pins. I now need to try to solve this riddle.

Let's look first at FPGA control of the pins as a potential cause. As the other pins don't exhibit this strange behaviour, and the four DB pins are all controlled in an identical manner, I find it hard to believe that the problem is there. That leaves the QSPI flash as the current primary suspect.

First stop: Check the schematics. Nothing sinister here on the Nexys4DDR boards: the QSPI flash is directly connected to the FPGA, with only some external pull-up resistors, which can't cause this funny problem I am seeing.

So that suggests it is most likely just the way that I am communicating with the QSPI flash.

Poking around, it seems that DB0 only changes (or is only changeable) when CS is high. This makes sense, as when CS is high, the QSPI flash is not active, and so shouldn't be trying to drive any lines. When it is low, then DB1 stays tied low. This makes me 99% sure that DB1 is the line from the QSPI to the FPGA, and DB0 is the command line from the FPGA to the QSPI.

This means, in theory at least, that I should be able to talk to the QSPI flash, if I drive the correct waveform. However, so far at least, there are no signs of active response from the QSPI flash. And looking at the trace, here we see this weird problem again: The DB0 signal stays low for one clock tick longer than it is being pulled low:

Bildschirmfoto vom 2020-01-14 22-46-55 3.png

This is really weird. I can slow the clock down even more (its currently less than 1KHz, anyway) to the point where it looks mucb better, but this feels altogether wrong: The FPGA can read out its bitstream from this QSPI interface at 66MHz, so ~660Hz should be absolutely no problem! The 1.8KOhm pull ups should be able to pull these lines high in <1 micro second, but we are seeing rise (or delay) times of >1 milli second -- a thousand times slower.

This bizarre delay occurs whether the QSPI flash is selected via the CS line, or not. This would seem to suggest that it is not the QSPI flash to blame -- unless it is in some strange mode following the FPGA configuration process.

Ok, looking again that the schematic, there are indeed 1.8K pull-ups on the DB2 and DB3 lines, but not on DB0 or DB1. This means that it is possible that running these lines open-collector might not be practicable. So I resynthesised with the ability to push those lines actively high, as well as pull them low, or tri-state them, as before. Now by actively pushing them, they respond immediately, as expected. So now I can send a byte via the SPI interface, and it all looks right:

Bildschirmfoto vom 2020-01-15 05-13-04 5.png

Of course, it still isn't working. But that could be because I just realised I am sending the bits least-significant-bit first, instead of most-significant-bit first. And indeed, that suddenly gets it responding to me!

Bildschirmfoto vom 2020-01-15 05-18-49 6.png

Now we're finally getting somewhere :) Again, I am so glad I implemented this VCD logger and JTAG boundary scan stuff.

Of course I could have just figured out how to do it from in Vivado, but its so much nicer to have a little light-weight and open-source tool. Also, by having it integrated in monitor_load, I can do multiple things all in one quick action. Here is now I run the test program, and then ask monitor_load to sample those pins -- all in one single command:

make src/tests/qspitest.prg && src/tools/monitor_load -F -4 -r src/tests/qspitest.prg -V log.vcd -J src/vhdl/nexys4ddr-widget.xdc,${HOME}/build/artix7/public/bsdl/xc7a100tl_csg324.bsd,qspisck,qspicsn,qspidb[3],qspidb[2],qspidb[1],qspidb[0]

Okay, so its a bit of a long command, but that's what pressing the up arrow in a shell is all about, so you can just use it again and again, without having to re-type it.

When that command has logged the pins for long enough, I just hit control-C, and then launch gtkwave on the resulting log.vcd file, with a little tiny script that tells it to automatically show all signals:

gtkwave -S allsigs.tcl log.vcd

So the whole work-flow is now super easy and efficient.

But anyway, back to figuring out why the test program doesn't read the data from the SPI response correctly... It's currently reading all ones, i.e., not noticing when the DB1 line goes low. Adding a short delay fixes this. Not entirely sure why. But with that, I can finally read some useful things out of the chip, and display them:

QSPI DEVICE ID = $2018                 
RDID BYTE COUNT = 77                   
TORS WITH 64KB SECTORS.                
PART FAMILY IS 8000                    
 01 80 30 30 80 FF FF FF               
 FF FF FF FF 51 52 59 02               
 00 40 00 53 46 51 00 27               
 36 00 00 06 08 08 0F 02               
 02 03 03 18 02 01 08 00               
 02 1F 00 10 00 FD 00 00               
 01 FF FF FF FF FF FF FF               
 FF FF FF FF 50 52 49 31                                                    

I confirmed with the data sheet that these data are broadly sensible. So the next step will be to extract all the relevant data out, e.g., the information I need to programme the device, and after that, to implement simple block read, erase and write functions... Which turned out to be remarkably painless, if rather boring internally. The more exciting part will be in the next post, where I (hopefully) actually implement writing of bitstreams to the QSPI flash.