Wednesday, April 1, 2015

A 4mbps shiftOut for esp8266/Arduino


Since I finished writing the fastest possible bit-banged SPI for AVR, I wanted to see how fast the ESP8266 is at bit-banging SPI.  The NodeMCU eLua interpreter I initially tested out on my ESP-01 has little hope of high-performance since it is at best byte-code compiled.  For a simple way to develop C programs for the ESP8266, I decided to use ESP8266/Arduino, using Jeroen's installer for my existing Arduino 1.6.1 installation.  Starting with a basic shiftOut function that worked at around 640kbps, I was able to write an optimized version that is six times faster at almost 4mbps.

I modified the spi_byte AVR C code to use digitalWrite(), and call it twice in loop():
void shiftOut(byte dataPin, byte clkPin, byte data)
{
  byte i = 8;
    do{
      digitalWrite(clkPin, LOW);
      digitalWrite(dataPin, LOW);
      if(data & 0x80) digitalWrite(dataPin, HIGH);
      digitalWrite(clkPin, HIGH);
      data <<= 1;
    }while(--i);
    return;
}

void loop() {
  shiftOut(DATA, CLOCK, 'h'); 
  shiftOut(DATA, CLOCK, 'i'); 
}

Since I don't have a datasheet for the esp8266 that provides instruction timing, and am just starting to learn the lx106 assembler code, I used my oscilloscope to measure the timing of the data line:

The time to shift out 8 bits of data is around 12.5us, for a speed of 640kbps.  Looking at the signal in more detail I could see that the time between digitalWrite(dataPin, LOW) and digitalWrite(dataPin, HIGH) was 425ns.  Rather than setting the data pin low, then setting it high if the bit to shift out was a 1, I changed the code to do a single digitalWrite based on the bit being a 0 or a 1:
void shiftOut(byte dataPin, byte clkPin, byte data)
{
  byte i = 8;
    do{
      digitalWrite(clkPin, LOW);
      if(data & 0x80) digitalWrite(dataPin, HIGH);
      else digitalWrite(dataPin, LOW);
      digitalWrite(clkPin, HIGH);
      data <<= 1;
    }while(--i);
    return;
}

This change increased the speed slightly to 770kbps.  Suspecting the overhead of calling digitalWrite as being a large part of the performance limitations, I looked at the source for the digitalWrite function.  If I could get the compiler to inline the digitalWrite function, I figured it would provide a significant speedup.  From my previous investigation of the performance of digitalWrite, I knew gcc's link-time optimization could do this kind of global inlining.  I enabled lto by adding -flto to the compiler options in platform.txt.  Unfortunately, the xtensa-lx106-elf build of gcc 4.8.2 does not yet support lto.

After looking at the source for the digitalWrite function, I could see that I could replace the digitalWrite with a call to a esp8266 library function GPIO_REG_WRITE:
void shiftOutFast(byte data)
{
  byte i = 8;
    do{
      GPIO_REG_WRITE(GPIO_OUT_W1TC_ADDRESS, 1 << CLOCK);
      if(data & 0x80)
        GPIO_REG_WRITE(GPIO_OUT_W1TS_ADDRESS, 1 << DATA);
      else
        GPIO_REG_WRITE(GPIO_OUT_W1TC_ADDRESS, 1 << DATA);
      GPIO_REG_WRITE(GPIO_OUT_W1TS_ADDRESS, 1 << CLOCK);
      data <<= 1;
    }while(--i);
    return;
}

This modified version was much faster - the oscilloscope screen shot at the beginning of this article shows the performance of shiftOutFast.  One bit time is 262.5ns, for a speed of 3.81mbps.  This would be quite adequate for driving a Nokia 5110 black and white LCD which has a maximum speed of 4mbps.

Conclusion

While 4mbps is fast enough for a low-resolution LCD display or some LEDs controlled by a shift register like the 74595, it's quite slow compared to the 80Mhz clock speed of the esp8266.  Each bit, at 262.5ns is taking 21 clock cycles.  I doubt the esp8266 supports modifying an I/O register in a single cyle like the AVR does, but it should be able to do it in two or three cycles.  While I don't have a proper datasheet for the esp8266, the Xtensa LX data book is a good start.  Combined with disassembling the compiled C, I should be able to further optimize the code, and maybe even figure out how to write the code in lx106 assembler.

6 comments:

  1. Have you seen the GPIO_OUT_ADDRESS register? It acts like a "port" allowing you read/write all GPIO 0-15. Sixteen uses a different register.

    ReplyDelete
    Replies
    1. I looked at the code disassembly, and the GPIO_REG_WRITE(GPIO_OUT_W1TC_ADDRESS, X) compiles to a single instruction. Same thing for GPIO_OUT_W1TS_ADDRESS for setting, so I can't see any benefit to doing it differently.

      Delete
  2. Ralph, see this link for a GPIO toggling example in ASM: http://bbs.espressif.com/viewtopic.php?t=200&p=987#p956

    ReplyDelete
  3. Take a look at this SPI lib https://github.com/MetalPhreak/ESP8266_SPI_Driver.
    I got 16mbits after optimizing the spi_transaction function (moved spi register setup to a second method).

    ReplyDelete
    Replies
    1. 16mbps is pretty good. I haven't done much low-level esp8266 work lately, but I think 10mbps is feasible for bit-banging when running the MCU at 160Mhz.
      I find the spi_transaction function unnecessarily complex. Abstracting away the command/address/data and just having data out and data in would be better IMHO.

      Delete
  4. Hi Ralph, I have also been doing some work on ESP8266 and ESP32 with regard to bitbanging and toggling pins, and I found that cutting every single wasted clock cycle out really helps, for instance, you have several instances of "1 << CLOCK" etc, if you replaced this with a hardcoded value or something #defined then you would see small improvements in speed. also I have found that is you clear both the DATA and CLOCK at the same time (rather than two separate ones) and then you only need to set the data if (data & 0x80) is true, this will save even more time ! keep up the good work. Bob

    ReplyDelete