Monday, March 30, 2015

ESP8266 SPI flash performance


Despite the popularity of the ESP8266, I have yet to see a detailed datasheet published.  Nava Whiteford, on his blog, has links to a summary datasheet and a Cadence tensilica core that the chip is based on.  None of this provides any details on how the memory controller pages in data from the SPI flash, nor the speed of the communications.  About all that is clear from the datasheet and chip markings is that it uses a quad-SPI serial flash chip.

I decided to find out the performance of the SPI flash, as well as get an idea of what the cache line fill size of the chip is.  By looking at the pin-out of the flash chip, I determined that pin 6 is the clock.  After some probing and playing with the settings on my scope, I captured the clock burst shown above.

Based on the 500ns horizontal scale, the clock burst lasts a little more than 2uS.  Zooming in shows that the clock is exactly 40Mhz, or half of the ESP8266 80Mhz clock and have of the maximum 80Mhz speed rating of the SPI flash.  Given the burst lasted a little more than 2uS, the total number of clock pulses is in the range of 85-90.  Accounting for the overhead of commands to enable quad SPI mode and address setup, it seems the burst corresponds to reading 32 bytes from the flash, and therefore the cache line size is likely 32 bytes.

Conclusion

The clock signal is clean, and with a rise + fall time of 11.1ns, could be increased to 90Mhz without significant distortion or attenuation.  With documentation on the registers to change the clock speed to 160Mhz, the ESP8266 can be run at double speed without overclocking the SPI flash.

Sunday, March 29, 2015

Fastest AVR software SPI in the West

Most AVR MCUs like the ATmega328p have a hardware I/O shift register, but only on a fixed set of pins.  Arduino's shiftOut function is horribly slow, so a number of faster implementations have been written.  I'll look at how fast they are, and explain an implementation in AVR assembler that's faster than any C implementation, and I'll even claim that it is the fastest software SPI for the AVR.

Adafruit spiWrite

void spiWrite(uint8_t data)
{
 uint8_t bit;
 for(bit = 0x80; bit; bit >>= 1) {
  SPIPORT &= ~clkpinmask;
  if(data & bit) SPIPORT |= mosipinmask;
  else SPIPORT &= ~mosipinmask;
  SPIPORT |= clkpinmask;
 }
}

This code comes from the Adafruit Nokia 5110 LCD library.  It's a bit odd because it doesn't use a loop counting down from 8 to 0 for the bits to be shifted, it shifts a bit through the 8 bits in a byte.  While it is much faster than Arduino's shiftOut from using direct port manipulation instead of the slow digitalWrite, its far from an optimal implementation in C.  I compiled the code using avr-gcc 4.8 with -Os, and disassembled the code using avr-objdump -D:
00000000 <spiWrite>:
   0:   28 e0           ldi     r18, 0x08       ; 8
   2:   30 e0           ldi     r19, 0x00       ; 0
   4:   90 e8           ldi     r25, 0x80       ; 128
   6:   2d 98           cbi     0x05, 5 ; 5
   8:   49 2f           mov     r20, r25
   a:   48 23           and     r20, r24
   c:   11 f0           breq    .+4             ; 0x12
   e:   2c 9a           sbi     0x05, 4 ; 5
  10:   01 c0           rjmp    .+2             ; 0x14
  12:   2c 98           cbi     0x05, 4 ; 5
  14:   2d 9a           sbi     0x05, 5 ; 5
  16:   96 95           lsr     r25
  18:   21 50           subi    r18, 0x01       ; 1
  1a:   31 09           sbc     r19, r1
  1c:   21 15           cp      r18, r1
  1e:   31 05           cpc     r19, r1
  20:   91 f7           brne    .-28            ; 0x6
  22:   08 95           ret

The loop for each bit compiles to 14 instructions, and takes 17 clock cycles for a 0 and 18 for a 1.  Although most AVR instructions take a single cycle, the cbi and sbi instructions for clearing and setting a single bit take two cycles.  Branches, when taken, are also two-cycle instructions.

Generic spi_byte

void spi_byte(uint8_t byte){
    uint8_t i = 8;
    do{
        SPIPORT &= ~mosipinmask;
        if(byte & 0x80) SPIPORT |= mosipinmask;
        SPIPORT |= clkpinmask;  // clk hi
        byte <<= 1;
        SPIPORT &=~ clkpinmask; // clk lo

    }while(--i);
    return;
}

I've seen variants of this code used not just for AVR, but also for PIC MCUs.  It is faster than the Adafruit code in part because the loop counts down to zero, and as experienced coders know, on almost every CPU, counting up from zero to eight is slower than counting down to zero.  The disassembly shows the code to be 40% faster than the Adafruit code, taking 12 cycles for a 0 and 13 for a 1.
00000024 <spi_byte>:
  24:   98 e0           ldi     r25, 0x08       ; 8
  26:   2c 98           cbi     0x05, 4 ; 5
  28:   87 fd           sbrc    r24, 7
  2a:   2c 9a           sbi     0x05, 4 ; 5
  2c:   2d 9a           sbi     0x05, 5 ; 5
  2e:   88 0f           add     r24, r24
  30:   2d 98           cbi     0x05, 5 ; 5
  32:   91 50           subi    r25, 0x01       ; 1
  34:   c1 f7           brne    .-16            ; 0x26
  36:   08 95           ret

AVR optimized in C

Looking at the assembler code, half of the loop time is taken by the cbi and sbi two-cycle instructions.  The key to further speed optimizations is to write code that will compile to single-cyle out instructions instead.  The mosi and clk pins can be cleared by reading the port state before the loop, then writing the 8 bits of the port with mosi and clk cleared:
    uint8_t portbits = (SPIPORT & ~(mosipinmask | clkpinmask) );
    do{
        SPIPORT = portbits;      // clk and data low

This also saves having to clear the clk pin at the end of the loop, for a total savings of 3 cycles.  With this technique, the time per bit can be reduced to 9 cycles.  By using the AVR PIN register, another cycle can be shaved off the loop.  The datasheet does not describe the PIN register in detail, stating little more than, "Writing a logic one to PINxn toggles the value of PORTxn".  What this means, for example, is that writing 0x81 to PINB will toggle the state of bit 0 and bit 7, leaving the rest of the bits unchanged.  Here's the final code:
void spi_byteFast(uint8_t byte){
    uint8_t i = 8;
    uint8_t portbits = (SPIPORT & ~(mosipinmask | clkpinmask) );
    do{
        SPIPORT = portbits;      // clk and data low
        if(byte & 0x80) SPIPIN = mosipinmask;
        SPIPIN = clkpinmask;     // toggle clk
        byte <<= 1;
    }while(--i);
    return;
}

The disassembly shows that although the code size has increased, the loop for transmitting a bit takes only 8 cycles.  More speed can be obtained at the cost of code size by having the compiler unroll the loop (enabled with -O3 in gcc).  This would reduce the time per bit to 5 cycles.
00000050 <spi_byteFast>:
  50:   25 b1           in      r18, 0x05       ; 5
  52:   2f 7c           andi    r18, 0xCF       ; 207
  54:   98 e0           ldi     r25, 0x08       ; 8
  56:   40 e1           ldi     r20, 0x10       ; 16
  58:   30 e2           ldi     r19, 0x20       ; 32
  5a:   25 b9           out     0x05, r18       ; 5
  5c:   87 fd           sbrc    r24, 7
  5e:   43 b9           out     0x03, r20       ; 3
  60:   33 b9           out     0x03, r19       ; 3
  62:   88 0f           add     r24, r24
  64:   91 50           subi    r25, 0x01       ; 1
  66:   c9 f7           brne    .-14            ; 0x5a
  68:   08 95           ret
\

Assembler

I learned to code in assembler (6502) over thirty years ago, and started to learn C a few years after that.  When gcc was first released in 1987, it generated code that was much larger and slower than assembler.  Although it has improved significantly over the years, what surprises me is that it or any other C compiler still rarely matches hand-optimized assembler code.  You might think that there's nothing left to optimize out of  the 7 instructions that make up the the loop above, but by making use of the carry flag, I can eliminate the loop counter.  That saves a register and reduces the loop time to 7 cycles from 8:
spiByte:
    in r18, SPIPORT     ; save port state
    andi r18, ~(mosipinmask | clkpinmask)
    ldi r20, mosipinmask
    ldi r19, clkpinmask
    lsl r24
    ori r24, 0x01       ; 9th bit marks end of byte
spiBit:
    out SPIPORT, r18
    brcc zeroBit
    out SPIPORT-2, r20  ; PORT address -2 is PIN
    lsl r24
    out SPIPORT-2, r19  ; clk hi
    brne spiBit
    ret

When looking for fast software SPI code, the best I could find was 8 cycles per bit.  I read a couple posts on AVRfreaks claiming 7 cycles is possible, but no code was posted.  Unrolled, the above assembler code is still 5 cycles per bit, the same as the optimized C version.  So to back up my claim about the fastest code and hand-optimized assembler being better than the compiler, I need to reduce the timing to 4 cycles per bit.  I can do it using the AVR's T flag, with the bst and bld instructions that transfer a single bit between the T flag and a register.
spiFast:
    in r25, SPIPORT     ; save port state
    andi r25, ~clkpinmask
    ldi r19, clkpinmask
halfByte:
    bst r24, 7
    bld r25, MOSI
    out SPIPORT, r25    ; clk low + data
    out SPIPORT-2, r19  ; clk hi
    bst r24, 6
    bld r25, MOSI
    out SPIPORT, r25    ; clk low + data
    out SPIPORT-2, r19  ; clk hi
    bst r24, 5
    bld r25, MOSI
    out SPIPORT, r25    ; clk low + data
    out SPIPORT-2, r19  ; clk hi
    bst r24, 4
    bld r25, MOSI
    out SPIPORT, r25    ; clk low + data
    out SPIPORT-2, r19  ; clk hi
    swap r24
    eor r1, r19         ; r1 is zero reg
    brne halfByte
    ret

The loop is half unrolled, doing two loops of 4 bits, with the function using a total of 23 instructions.  Fully unrolled the function would use 35 instructions/cycles (plus return), saving 4 cycles over the half unrolled version.

Conclusion

Including overhead, the spiFast assembler code is just under half the speed of  hardware SPI running at full speed (2 cycles per bit).  With the assistance of a hardware timer to generate the clock, and a port dedicated to just the mosi line, it's theoretically possible to output one bit every two cycles using a sequence of lsl and out instructions.  But for a fully software implementation that doesn't modify anything other than the mosi and clk bits, you won't find anything faster than 4 cycles per bit.  Copies of the code are available on my github repo: spi.S and spiWrite.c.

Thursday, March 26, 2015

Yet another esp8266 article

When I received my 246c ESP-01 module, I didn't expect I'd end up writing a blog post about it.  Unless you've spent the last six months on a research mission in Antarctica, you probably have read about these cheap and relatively easy to program Wifi modules.  But as I learned a few things that I haven't read about, I decided to share my knowledge.

The first thing I learned is that the module worked fine getting it's power from a PL2303HX USB-TTL module, despite many claims I've read that the esp8266 modules draw too much power.  This is likely true for FTDI USB-TTL modules, as the internal 3.3V regulator on them is only rated for 50mA.  The the regulator on the PL2303HX is rated for 150mA, and my testing revealed the modules will output more than the ~200mA maximum required for the esp8266 with minimal voltage drop.  I did find the high current draw when I connected power to the ESP-01 sometimes caused the USB-TTL module to reset.  This was resolved by soldering a 22nF capacitor between the 3.3V output and Gnd.

As most guides will tell you, CH_PD/CHIP_EN needs to be high in order for the module to boot, so I soldered a short wire to Vcc.  Some people also say to pull RST high, however, like many MCU's, the esp8266 has an internal pull-up on reset, so no external pullup is required.  Similarly, GPIO0 and GPIO2 need to be high to select boot from SPI flash.  GPIO0 has an internal pull-up on boot, and GPIO2 defaults to high, so nothing needs to be wired to these pins.

An easy way to tell if you ESP is working is to do a Wifi scan.  After I powered it up, I could see an open access point named ESP_XXXXXX, where XXXXXX is part of the MAC address.

To communicate with the ESP-01, I started by using Putty at 9600bps to try using the AT commands.  I wasn't getting any response when I typed, so I reset the module by touching a resistor between Gnd and RST.  This caused the blue LED to flash, and along with a bit of garbage, the following text appeared:
[Vendor:www.ai-thinker.com Version:0.9.2.4]

What I eventually figured out was that the esp8266 requires all AT commands to end with CRLF.  Putty seems to send just CR for [Enter], so I entered <CTRL-M><CTRL-J> in order to send a CRLF.

I decided to try the nodeMCU eLua firmware, so I jumpered GPIO0 low, and ran the nodemcu-flasher.  After the firmware was flashed, I was able to connect (still at 9600bps) with putty and enter elua code.  The next step was to try an IDE for elua development on the esp8266.

I found three different IDEs, LuaLoader, Esplorer, and Lua Uploader.  I tried LuaLoader first, but had problems with serial communications - I wasn't seeing any text coming from the esp8266.  It's handy DTR & RTS control buttons worked, as I could get the module to reset by wiring RTS to reset on the module, then toggling RTS low then back to high.  I tried running Esplorer, but when I found out it needs Java SE installed first, I moved on to trying Lua Uploader rather than waiting for the Java runtime to download and install.

As can be seen from the screen shot above, Lua Uploader is rather spartan, but it worked fine for me.  By default it loads with a blink program that toggles GPIO2.  Communication with the ESP is rather slow at the default 9600bps, so I saved the following init.lua file to the ESP for 115.2kbps communication:
uart.setup(0, 115200, 8, 0, 1, 1)

Conclusion

I'm impressed with the capability of these little modules.  Next I plan to get another module with more GPIO, and try out the C SDK.

Monday, March 16, 2015

Low-power wireless communication options

I've written a number of blog posts about wireless communications, primarily 315/433Mhz ASK/OOK modules and 2.4Ghz GFSK modules.  In this article I'll analyze the pros and cons of each.

UHF ASK/OOK modules

ASK modules are the simplest type of communication modules, both in their interface and in their radio coding.  Single-pin operation allows them to be used with a single GPIO pin, or even with a UART.  The 315 and 433Mhz bands can be received by inexpensive RTL-SDR dongles, making antenna testing relatively easy.  At 78c for a pair of modules, they are a very inexpensive option for wireless communications for embedded systems.  The bands are popular for low-bandwidth keyless entry systems, so fob transmitters are widely available for 315 and 433 Mhz.

For a given power output budget, 315Mhz communications has the best range since free-space loss increases with frequency.  Using 433Mhz means about 3dB more attenuation than 315, and 2.5Ghz has about 15dB more attenuation than 433.  The 315 and 433Mhz bands can be used for low-power control and intermittent data in the US and Canada, but in the UK it looks like only the 433Mhz band is available.  Due to the simple radio encoding involving the presence or absence of a carrier frequency, sensitivity is over 100dB.  Superheterodyne receivers have a sensitivity in the -107 to -111 dB range, while super-regenerative receivers have a sensitivity in the -103 to -106 dB range.

The inexpensive super-regenerative receivers may need tuning, though this can be avoided by using the more expensive superheterodyne receivers which use a crystal for a frequency reference.  High-level protocol requirements such CRC or addressing must be handled in software, although libraries such as VirtualWire can greatly simplify this task.

Although the specs on most transmitters indicate an input power requirement between 3.5 and 12V, the modules I have tested will work on lower voltages.  I tested a 433Mhz transmitter module powered by 2xAA NiMH providing 2.6V.  At that voltage, the transmit power was about 9dB lower than when it is powered at 5V.

As David Cook explains in his LoFi project, it is feasible to use CR2032 coin cells to power projects using these modules.  Depending on the MCU power consumption and the frequency of data transmission, even 5 years of operation on a single coin cell is possible.

2.4Ghz modules

I've written a few blog posts before about 2.4Ghz modules, both Nordic's nRF24l01 and Semitek's SE8R01.  The nRF chip has lower power consumption of 13.5mA in receive mode, than the Semitek module which consumes 19.5mA.  Either module can still be powered from a CR2032 coin cell.  When using a 50-100uF capacitor as recommended by TI, capacitor leakage can add to battery drain, so use a higher voltage capacitor, such as 16 or even 50V as these will have lower leakage.

Nordic's chip has lower power consumption, you might not get a real nRF chip when you purchase a module.  Besides higher power consumption, some clones are not compatible with the genuine nRF this can range from completely unable to communicate with nRF modules like the SE8R01, or simply inability to work with the dynamic payload length feature.  With SE8R01 modules selling for only 41c in small quantity, if you don't require compatibility with the genuine nRF protocol, and aren't planning to try bit-banging bluetooth low energy, then these seem to be the best choice.  One other reason to go with genuine nRF modules (if you can find them for a reasonable price) is for range.  At 250kbps, the nRF has a receiver sensitivity of  -94 dBm.  The SE8R01 doesn't have a 250kbps mode, and at both 500kbps and 1mbps the receiver sensitivity is -86 dBm.

Although the power consumption of 2.4Ghz modules is similar to UHF ASK/OOK modules, the higher data rate permits lower average power.  Take for example an periodic transmission consisting of a total of 10 bytes: 1 sync, 3 address, 4 payload data, and 2 CRC bytes.  At 250kbps the radio would be transmitting for only 320uS and needs to be powered another 130uS for tx settling time, making the total less than half a millisecond.  An ASK/OOK module transmitting at 9600kbps would have the radio powered for a little over 8ms.  For a periodic transmission every 10 seconds, and with sleep power consumption of 4uA, the ASK module would consume an average of 21uA, compared to just 5uA for the nRF module.

Conclusion

For low-speed uni-directional data, a UHF ASK/OOK module will work fine, and is the simplest solution.  For reliable bi-directional communication, or for high bandwidth 2.4Ghz modules are the best choice.

Saturday, February 14, 2015

PL2303HX 3.3V power output


I've recently read comments about pl2303 USB-TTL serial modules stating that their power output is too low for modules like the esp8266.  Although I've written about the modules before, I didn't check the power output on them.  The datasheet indicates a maximum 3.3V output of 150mA, but it may output more (or less) than that before dropping below 3V.

For testing I tried two different modules.  With no load, one output 3.35V and the other was 3.34V.  With a 13Ohm load, both dropped to 3.15V, which computes to an output current of 242mA.  With a 242mA current causing a 0.2V drop, the output resistance of the regulator is approximately 0.83 Ohms.  This means if you want to push it to its limit, it should be able to output around 420mA before it drops below 3.0V.  Pretty good for modules that sell for under $1 including shipping.

Monday, February 9, 2015

picobootSTK500: an Arduino compatible atmega bootloader in 216 bytes


picoboot was the first bootloader I wrote.  Although it hasn't been beat for size, the fact that it uses a proprietary protocol is a big drawback.  I soon found out that despite Avrdude being open source, there is no open and public process for contributing to it.  So I published a version of avrdude with picoboot support, but since I had emailed Joerg with my patches instead of posting to the avrdude mailing list, support for picoboot in the official avrdude release was not going to happen.

With Arduino Pro Mini boards available from China for only $2, there is no longer a significant cost advantage of using individual DIP AVR MCUs.  Many of these boards come with the stock Arduino bootloader, which is 2KB.  Optiboot was what I flashed to my pro mini boards to replace the stock bootloader.  Since optiboot doesn't wait long after reset for a flash download, and my PL-2303hx USB-TTL modules don't break out the DTR line, I developed a zero-wire auto-reset for the boards.  But since I couldn't get my patch added to avrdude, it would be hard for other people to use it.

My solution was to write a tiny bootloader that, like Optiboot, is compatible with the Arduino bootloader already supported by avrdude.  Like picoboot, I decided to write it in AVR assembler.  Optiboot is small enough to fit into the 512-byte minimum boot size of a mega328, but too big for the 256-byte minimum bootsize of a mega168.  Another design goal was for the bootloader to wait indefinitely, to avoid having to juggle the timing of pressing the reset button and uploading code to flash.  I didn't want to use the watchdog reset the way Optiboot does, since it makes it impossible for application code to detect a watchdog reset.  Lastly, if space allowed, I wanted to support eeprom writing (Optiboot only supports flash writes).

After a few months of on-and-off development, I have a working release.  There's no eeprom write support yet, but at 216 bytes of code, I still have 40 bytes to use for eeprom support and still have it fit in 256 bytes.  Besides being the smallest arduino-compatible bootloader, it has a unique reset-handling functionality.  Pressing the reset button once will go into bootloader mode, waiting with the LED dimly lit.  Pressing the reset button a second time will do a normal reset.  Due to delayed re-enabling of the read-while-write (RWW) section, it should be slightly faster than Optiboot.  Like my original picoboot bootloader, a basic blink program is included with picobootSTK500.

I have included two pre-built versions in the repository; one for the mega328 and another for the mega168.  I only have mega328's for testing, so I'd appreciate someone with a mega168 testing it out and dropping a note in the comments.  Lastly I'd like to thank Optiboot maintainer Bill Westfield for his help.  He is one of those pleasant programmers where his abilities exceed his ego - something I may yet achieve in the future. :-)

Saturday, January 17, 2015

USB interfacing for AVR microcontrollers

Since the release of V-USB, dozens of projects have been made that allow an AVR to communicate over USB.  USB data signals are supposed to be in the range of 2.8 to 3.6V, so there are two recommended ways to have an AVR output the correct voltage.  One is to supply the AVR with 3.3V power, and the other is to use 5V power but clip the USB data signal using zener diodes.  Most implementations of V-USB, like USBasp, use the zener diodes.  I'll explain why using a 3.3V supply should be the preferred method.

Pictured above is a capture of D- line on a chinese USBasp, showing one of the 1kHz SE0 idle pulses.  The idle voltage is close to the 2.8V limit because this particular USBasp uses a 2.2K pullup resistor to +5V.  Using a 1.5K pullup as is recommended on the V-USB site puts it closer to 3.2V, which is the high voltage level from the host.

To determine how the signal is read by the AVR, we can look at the ATmega8A input threshold graphs at section 28.8 of the datasheet.
The graph indicates with a 5V supply, any voltage over 2.65V will be read as a logic "1".  The next graph in the datasheet shows any voltage below 2.4V will be read as a logic "0".  Since signals from the host transition between 0 and 3.2V and not 5V, the AVR will interpret low signals as being longer than high signals.  With a fall time of about 115ns, the signal will drop to 2.4V (logic low) in about 29ns.  With a rise time of about 100ns, the signal will rise above 2.65V from 0V in about 83ns.  At 1.5mbps, both high and low bits should last 667ns, but instead a low bit will last 721ns and a high bit will last only 612ns.  This 18% difference in bit times reduces the margin of error V-USB has to work with.

With a 3.3V supply voltage to the AVR, the low signal will last almost 670ns, and the high will last a little more than 663ns, for a timing difference of just under 1%.  Besides significantly improving signal timing, using a 3.3V supply simplifies the USB interface circuit.  Obviously the zener diodes are no longer needed.  Additionally, the 68Ohm resistors which limit the current from the AVR through the zener diodes, are no longer necessary either.

An alternative to using a 3.3V regulator, which is described on the V-USB site, is to use 2 diodes in series to provide a drop of about 1.6V.  My preference is to use a medium-power red LED like the BR5379K, which doubles as a power LED.  At 5mA, the current draw of an ATmega8A running at 12Mhz and 3.6V, the BR5379K has a voltage drop of about 1.65V.  It has a maximum current of 50mA; more than enough to drive something like a nRF24l01+ module in addition to the AVR.