Sunday, September 18, 2016

Advanced Tonga BIOS editing


I recently decided to spend some time to figure out some of the low-level details of how the BIOS works on my R9 380 cards.  A few months ago I had found Tonga Bios Editor, but hadn't done anything more than modify the memory frequency table so the card would default to 1500Mhz instead of 1375.  My goal was to modify the memory timing and to reduce power usage.

The card I decided to test the memory timing mods on was a Club3D 4GB R9 380 with Elpida W4032BABG-60-F RAM.  Although the RAM is rated for 6Gbps/1.5Ghz, the default memory clock is 1475Mhz.  In my previous testing I found that the card was stable with the memory overclocked well above 1.5Ghz, but the mining performance was actually slower at 1.6Ghz compared to 1.5Ghz.  Unfortunately Tonga Bios Reader does not provide a way to edit the memory timings aka straps, so I'd have to use a hex editor.


I've highlighted the 1500Mhz memory timing in the screen shot above.  I found it by searching for the string F0 49 02, which you first have to convert from little-endian to get 249F0, and then from hex to get 150,000, which is expressed in increments of .01Mhz.  The timing for up to 1625Mhz (C4 7A 02) comes after it, and then 1750Mhz (98 AB 02).  The Club3D BIOS actually has 2 sets of timings, one for memory type 01 (the number after F0 49 02), as and for memory type 02 (not shown).  This is so the same BIOS can be used on a card that can be made with different memory.  Obviously one type of memory the BIOS supports is Elpida, and from comparing BIOS images from other cards, I determined that memory type 02 is for Hynix.

To reduce the chance of bricking my card, the first time I modified only the 1625Mhz memory timing.  Since the default memory timing is 1475Mhz, my modified timing would only be used when overclocking the memory over 1500Mhz.  So if the the card crashed on the 1625Mhz timing, it would be back to the safe 1500Mhz timing after a reboot.  To actually make the change I copied the 1500Mhz timing (starting with 77 71) to the 1625Mhz timing.  After the change, the BIOS checksum is invalid, so I simply loaded the BIOS in Tonga Bios Reader and re-saved it in order to update the checksum.

I used Atiflash 2.71 to flash the BIOS since I have found no DOS or Linux flash utilities for Tonga GPUs.  After flashing the updated BIOS, I overclocked the RAM to 1625Mhz, and my eth mining speed went from just under 21Mh to about 22.5Mh.  To get even faster timings, I copied the 1375Mhz timings from a MSI R9 380 with Elpida RAM to the Club3d 1625Mhz memory timing.  That boosted my mining speed at 1625Mhz to slightly over 23Mh

I then tried a number of ways to improve the timing beyond 1625Mhz, but I found nothing that was both stable and faster at 1700Mhz.  Different cards may overclock better, depending on both the GPU asic and the memory.  Hynix memory seems to overclock a bit better than Elpida, while Samsung memory, which seems rather rare on R9 380 cards, tends to overclock the best.  The memory controller on the GPU also needs to be able overclock from 1475Mhz.  Unlike the simple voltage modding the Hawaii BIOS, there is no easy way to modify the memory controller voltage (VDDCI) on Tonga.  The ability to over-volt the memory controller would make it easier to overclock the memory speed beyond 1625Mhz.

Since the Club3D BIOS supports both Elpida and Hynix memory, I improved the timing for both memory types.  This allows me to use a single BIOS image for cards that have either Elpida or Hynix memory.  It's also dependent on the card having a NCP81022 voltage controller, but all my R9 380 cards have the same voltage controller.  I've shared it on my google drive as 380NR.ROM if you want to try it (at the possible risk of bricking your card).  Atiflash checks the subsystem ID of the target card against the BIOS to be flashed, so it is necessary to use the command-line version of atiflash with the "-fs" option:
atiflash -p 0 380RN.ROM -fs

In addition to improving memory speeds, I wanted to reduce power usage of my 380 cards.  On Windows it is possible to use a tool like MSI Afterburner to reduce the core voltage (VDDC), but on Linux there is no similar tool.  To reduce the voltage in the BIOS, modify value0 in Voltage Table2 for the different DPM states.  After a lot of experimenting, I made two different BIOSes with different voltage levels since some cards under-volt better than others.  The first one has 975, 1050, and 1100 mV for dpm 5, 6, & 7, while the other has 1025, 1100, & 1150 mV.  These are also shared on my google drive as 380NR1100.ROM and 380NR1150.ROM.

With the faster RAM timing and voltage modifications I've improved my eth mining hashrates by about 10%, without any material change in power use.  I've tried my custom ROM on four different cards.  Although two of them seem to be OK with 900/1650Mhz clocks, I'm playing it safe and running all four at 885/1625Mhz.  If you are lucky and have a card that is stable at 925/1700Mhz, you can mine eth at almost 25Mh/s.  With most cards you can expect to get between 23 and 24Mh/s.

Sunday, September 11, 2016

Hawaii BIOS voltage modding


When using Hawaii GPUs such as the R9 290 on Linux, aticonfig does not provide the ability to modify voltages.  Even under windows, utilities such as MSI Afterburner usually have limits on how much the GPU voltage can be increased or decreased.  In order to reduce power consumption I decided to create a custom BIOS with lower voltages for my MSI R9 290X.

The best tool I have found for Hawaii BIOS mods is Hawaii Bios Reader.  For reading and writing the BIOS to Hawaii cards, I use ATIFlash 2.71.  It woks from DOS, so I can use the FreeDOS image included in SystemRescueCD.

In the screen shot above, I've circled two voltages.  The first, VDDCI, is the memory controller voltage.  Reducing it to 950mV gives a slight power reduction.

The second voltage is the DPM0 GPU core voltage.  DPM0 is the lowest power state, when the GPU is clocked at 300Mhz, and powered at approximately 968mV.  I say approximately because the actual voltage seems to be close to the DPM0 value, but not always exact.  This may be related to the precision of the voltage regulator on the card, or the BIOS may be using more than just the DPM0 voltage table to control the voltage.  The rest of the DPM values are not voltages, but indexes into a table that has a formula for the BIOS to calculate the increase in voltage based on the leakage characteristics of the GPU.  I do not change them.


For reasons I have not yet figured out, the DPM0 voltage in each of the limit tables has to match the PowerPlay table.  After modifying the four limit tables, the BIOS can be saved and flashed to the card.

I've created modified BIOS files for a MSI R9 290X 4GB card with DM0 voltages of 868, 825, and 775.  With the 775mV BIOS I was able to reduce power consumption by over 20% compared to 968mV.

https://drive.google.com/drive/folders/0BwLnDyLLT3WkRUQ4VU5kVm5qM0k

Wednesday, September 7, 2016

Monero mining on Linux


With Monero's recent jump in price to over $10, it's the new hot coin for GPU mining.  Monero has been around for a couple years now, so there are a couple options for mining.  There's a closed-source miner from Claymore, and the open-source miner from Wolf that I used.

I used the same Ubuntu/AMD rig that I set up for eth mining.  Building the miner took a couple updates compared to building ethminer.  First, since stdatomic.h is missing from gcc 4.8.4, you need to use gcc 5 or 6.  Second, jansson needs to be installed.  On Ubuntu the required package is libjansson-dev.  The default makefile uses a debug build with no optimization, so I modified the makefile to use O3 and LTO "OPT = -O3 -flto".  I've shared the compiled binary on my google drive.

To mine with all the GPUs on your system, you'll have to edit the xmr.conf file and add to the "devices" list.  The "index" is the card number from the output of "aticonfig --lsa".  Although the miner supports setting GPU clock rates and fan speeds, I prefer to use my aticonfig scripts instead.  It is also necessary to modify  "rawintensity" and "worksize" for optimal performance.  The xmr.conf included in the tgz file has the settings that I found work best for a R9 380 card clocked at 1050/1500.  For a R7 370 card, I found a rawintensity setting of 640 worked best, giving about 400 hashes per second.

Although Monero was more profitable to mine than ethereum for a few days, the difficulty increase associated with more miners has evened it out.  Dwarfpool has a XMR calculator that seems accurate.  The pool I used was monerohash.com, and instead of running the monero client, I created an account online using mymonero.com.

Sunday, August 21, 2016

Ethereum mining on Ubuntu Linux


For a couple months, I've been intending to do a blog post on mining with Ubuntu.  Now that I've been able do make a static build of Genoil's ethminer, that process has become much easier.  Since I have no Nvidia GPUs, this post will only cover how to mine with AMD GPUs like the R7 and R9 series.

The first step is to download a 64-bit Ubuntu 14.04 desktop release.  I use the desktop distribution since it includes X11, although it is possible to use Ubuntu server and then install the X11 packages separately.  I recommend installing Ubuntu without any GPU cards installed (use your motherboard's iGPU), in order to confirm the base system is working OK.  Follow the installation instructions, and at step 7, choose "Log in automatically".  This will make it easier to have your rig start mining automatically after reboot.

After the initial reboot, I recommend installing ssh server.  It can be installed from the shell (terminal) with: "sudo apt-get install openssh-server -y".  Ubuntu uses mDNS, so if you chose 'rig1' as the computer name during the installation, you can ssh to 'rig1.local' from other computers on your LAN.

Shutdown the computer and install the first GPU card, and plug your monitor into the GPU card instead of the iGPU video port.  Most motherboards will default to using the GPU card when it is installed, and if not, there should be a BIOS setup option to choose between them. If you do not even see a boot screen, try plugging the card directly into the motherboard instead of using a riser.  Also double-check your card's PCI-e power connections.

Once you are successfully booting into the Ubuntu desktop, edit /etc/init/gpu-manager.conf to keep gpu manager from modifying /etc/X11/xorg.conf.  Then install the AMD fglrx video drivers: "sudo apt-get install fglrx -y".  If the fglrx drivers installed successfully, running "sudo aticonfig --lsa" will show your installed card.  Next, to set up your xorg.conf file, run "sudo rm /etc/X11/xorg.conf" and "sudo aticonfig --initial --adapter=all".

After rebooting, if the computer does not boot into the X11 desktop, ssh into the computer and verify that /etc/modprobe.d/fglrx-core.conf was created when the fglrx driver was installed.  This keeps Ubuntu from loading the open-source radeon drivers, which will conflict with the proprietary fglrx drivers.  For additional debugging, look at the /var/log/Xorg.0.log file.

Continue with installing the rest of your cards one at a time.  Re-initialize your xorg.conf each time, by executing "sudo rm /etc/X11/xorg.conf" and "sudo aticonfig --initial --adapter=all".  Reboot one more time, and then exeucte, "aticonfig --odgc --adapter=all".  This will display all the cards and their core/memory clocks.  If you are connecting remotely via ssh, you need to run, "export DISPLAY=:0" or you will get the "X needs to be running..." error.  You can use aticonfig to change the clock speeds on your card.  For example, "aticonfig --od-enable --adapter=2 --odsc 820,1500" will set card #2 to 820Mhz core and 1500Mhz memory (a good speed for most R9 380 cards).  To simplify setting clock speeds on different cards, I created a script which reads a list of card types and clock rates from a clocks.txt file.

Once your cards are installed and configured, you can use my ethminer build:
wget github.com/nerdralph/ethminer-nr/raw/110/releases/ethminer-1.1.9nr-OCL.tgz
tar xzf ethminer-1.1.9nr-OCL.tgz
cd ethminer-nr
./mine.sh

Once you've confirmed that ethminer is working, you can edit the mine.sh script to use your own mining pool account.  If you want your rig to start mining automatically on boot-up, edit your .bashrc and add "cd ethminer-nr" and "./mine.sh" to the end of the file.

Sunday, July 31, 2016

Improving Genoil's ethminer


In my last post about mining ethereum, I explained why I preferred Genoil's fork of the Ethereum Foundation's ethminer.  After that post, I started having stability problems with one of the newer releases of Genoil's miner.  I suspected the problem was likely deadlocks with mutexes that had been added to the code.  They had been added to reduce the chance of the miner submitting stale or invalid shares, but in this case the solution was worse than the problem, since there is no harm in submitting a small number of invalid shares to a pool.  After taking some time to review the code and discuss my ideas with the author, I decided to make some improvements.  The result is ethminer-nr.

A description of some of the changes can be found on the issues tracker for Genoil's miner, since I expect most of my changes to be merged upstream.  The first thing I did was remove the mutexes.  This does open the possibility of a rare race condition that could cause an invalid share submit when one thread processes a share from a GPU while another thread processes a new job from the pool.  On Linux the threads can be serialized using the taskset command to pin the process to a single CPU.  On a multi-CPU system, use "taskset 1 ./ethminer ..."  to pin the process to the first CPU.

As described in the issues tracker, I added per-GPU reporting of hash rate.  I also reduced the stats output to accepted (A) and rejected (R), including stales, since I have never seen a pool submit fail, and only some pools will report a rejected share.  The more compact output helps the stats still fit on a single line, even with hashrate reporting from multiple GPUs:
  m  13:28:46|ethminer  15099 24326 15099 =54525Khs A807+6:R0+0

To help detect when a pool connection has failed, instead of trying to manage timeouts in the code, I decided to rely on the TCP stack.  The first thing I did was enable TCP keepalives on the stratum connection to the pool.  If the pool server is alive but just didn't have any new jobs for a while, the socket connection will remain open.  If the network connection to the pool fails, there will be no keepalive response and the socket will be closed.  Since the default timeouts are rather long, I reduced them to make network failure detection faster:
sudo sysctl -w net.ipv4.tcp_keepalive_time=30
sudo sysctl -w net.ipv4.tcp_keepalive_intvl=5
sudo sysctl -w net.ipv4.tcp_keepalive_intvl=3

I wasn't certain if packets sent to the server will reset the keepalive timer, even if there is no response (even an ACK) from the server.  Therefore I also reduced the default TCP retransmission count to 5, so the pool connection will close after a packet is sent (i.e. share submit) 5 times without an acknowledgement.
sudo sysctl -w net.ipv4.tcp_retries2=5

I was also able to make a stand-alone linux binary.  Until now the Linux builds had made extensive use of shared libraries, so the binary could not be used without first installing several shared library dependencies like boost and json.  I had to do some of the build manually, so to make your own static linked binary you'll have to wait a few days for some updates to the cmake build scripts.  If you want to try it now anyway, you can add "-DETH_STATIC=1" to the cmake command line.

As for future improvements, since I've started learning OpenCL, I'm hoping to optimize the ethminer OpenCL kernel to improve hashing performance.  Look for something in late August or early September.

Sunday, July 17, 2016

Diving into the OpenCL deep end


Programs for mining on GPUs are usually written in OpenCL.  It's based on C, which I know well, so a few weeks ago I decided to try to improve some mining OpenCL code.  My intention was to both learn OpenCL and better understand mining algorithims.

I started with simple changes to the OpenCL code for Genoil's ethminer.  I then spent a lot of time reading GCN architecture and instruction set documents to understand how AMD GPUs run OpenCL code.  Since I recently started mining Sia, I took a look at the gominer kernel code, and thought I might be able to optimize the performance.  I tested with the AMD fglrx drivers under Ubuntu 14.04 (OpenGL version string: 4.5.13399) with a r9 290 card.

The first thing I tried was replacing the rotate code in the ror64 function to use amd_bitalign.  The bitalign instruction (v_alignbit_b32) can do a 32-bit rotate in a single cycle, much like the ARM barrel shifter.  I was surprised that the speed did not improve, which suggests the AMD OpenCL drivers are optimized to use the alignbit instruction.  What was worse was that the kernel would calculate incorrect hash values.  After double and triple-checking my code, I found a post indicating a bug with amd_bitalign when using values divisible by 8.  I then tried amd_bytealign, and that didn't work either.  I was able to confirm the bug when I found that a bitalign of 21 followed by 3 worked (albeit slower), while a single bitalign of 24 did not.

It would seem there is no reason to use the amd_bitalign any more.  Relying on the driver to optimize the code makes it portable to other platforms.  I couldn't find any documentation from AMD saying the bitalign and other media ops are deprected, but I did verify that the pragmas make no difference in kernel:
#pragma OPENCL EXTENSION cl_amd_media_ops : enable
#pragma OPENCL EXTENSION cl_amd_media_ops : disable


After finding a post stating the rotate() function is optimized to use alignbit, I tried changing the "ror64(x, y)" calls to "rotate(x, 64-y)".  The code functioned properly but was  actually slower.  By using AMD_OCL_BUILD_OPTIONS_APPEND=-save-temps, I was able to view the assember .isa files, and could tell that the calls to rotate with 64-bit values were using v_lshlrev_b32, v_lshrrev_b64, and v_or_b32 instead of a pair of v_alignbit_b32 instructions.  Besides using 1 additional instruction, the 64-bit shift instructions apparently take 2 or even 4 times longer to execute on some platforms.

In the end, I wasn't able to improve the kernel speed.  I think re-writing the kernel in GCN assembler is probably the best way to get the maximum hashing performance.

Monday, July 11, 2016

Mining Sia coin on Ubuntu


Sia is a hot crypto-currency for miners.  Just a week ago, the sia network hashrate was 6.5 Th/s, and the only way to mine was solo as there were no public pools.  In the last three days, sia.nanopoool.org, and siamining.com started up and the network hashrate grew to 14.7 Th/s, with the two pools making up 80% of the total network hashrate.

Mining on Windows is relatively easy, with nanopool posting a binary build of siamining's gominer fork.  For Ubuntu, you need to build it from the source.  For that, you'll need to install go first.  If you type 'go' in Ubuntu 14.04, you'll get the following message:
The program 'go' is currently not installed. You can install it by typing:
apt-get install gccgo-go

I tried the similar package 'gccgo', which turned out to be a rabbit hole.  The version 1.4.2 referred to in the gominer readme is a version of the package 'golang'.  Neither gccgo-go or gccgo have the latest libraries needed my gominer.  And the most recent version of golang in the standard Ubuntu repositories is 1.3.3.  However the Ethereum foundation publishes a 1.5.1 build of golang in their ppa.

Even with the golang 1.5.1, building gominer wasn't as simple as "go get github.com/SiaMining/gominer".  The reason is that the gominer modifications to support pooled mining are in the "poolmod3" branch, and there is no option to install directly from a branch.  So I made my own fork of the poolmod3 branch, and added detailed install instructions for Ubuntu:
add-apt-repository -y ppa:ethereum/ethereum
sudo apt-get update
apt-get install -y git ocl-icd-libopencl1 opencl-headers golang
go get github.com/nerdralph/gominer-nr
Once I got it running on a single GPU, I wanted to find out if it was worthwhile to switch my eth mining rigs to sia.  I couldn't find a good sia mining calculator, so I pieced together some information about mining rewards and used the Sia Pulse calculator.  I wanted to compare a single R9 290 clocked at 1050/1125, which gets about 29Mh/s mining eth, earning $2.17/day.  For Sia, the R9 290 gets about 1100Mh, which if you put that into the Sia Pulse calculator along with the current difficulty of 4740Th, it will calculate daily earnings of 6015 SC/day.  Multiplying by the 62c/1000SC shown on sia.nanopool.org will give you a total of $3.73/d, but that will be wrong.  The Sia Pulse calculator defaults to a block reward of 300,000, but that goes down by 1 for each block.  So at block 59,900, the block reward is 240,100. and the actual earnings would be $2.99/d.

Since the earnings are almost 40% better than eth, I decided to switch my mining rigs from eth to sia.  I had to adjust the overclocking settings, as sia is a compute-intensive algorithm instead of a memory-intensive algorithm like ethereum.  After reducing the core clock of a couple cards from 1050 to 1025, the rigs were stable.  When trying out nanopool, I was getting a lot of "ERROR fetching work;" and "Error submitting solution - Share rejected" messages.  I think their servers may have been getting overloaded, as it worked fine when I switched to siamining.com.  I also find siamining.com has more detailed stats, in particular % of rejected shares (all below 0.5% for me).

I may end up switching back to eth in the near future, since a doubling in network hashare for sia will eventually mean a doubling of the difficulty, cutting the amount of sia mined in half.  In the process I'll at least have learned a bit about golang, and I can easily switch between eth and sia when one is more profitable than the other.