## Saturday, March 3, 2018

### Fast small prime checker in golang

Anyone who does any crypto coding knows that the ability to generate and test prime numbers is important.  A search through the golang crypto packages didn't turn up any function to check if a number is prime.  The "math/big" package has a ProbablyPrime function, but the documentation is unclear on what value of n to use so it is "100% accurate for inputs less than 2⁶⁴".  For the Ethereum miner I am writing, I need a function to check numbers less than 26-bits, so I decided to write my own.

Since int32 is large enough for the biggest number I'll be checking, and 32-bit integer division is usually faster than 64-bit, even on 64-bit platforms, I wrote my prime checking function to take a uint32.  A basic prime checking function will usually test odd divisors up to the square root of N, skipping all even numbers (multiples of two).  My prime checker is slightly more optimized by skipping all multiples of 3.  Here's the code:
func i32Prime(n uint32) bool {
//    if (n==2)||(n==3) {return true;}
if n%2 == 0 { return false }
if n%3 == 0 { return false }
sqrt := uint32(math.Sqrt(float64(n)))
for i := uint32(5); i <= sqrt; i += 6 {
if n%i == 0 { return false }
if n%(i+2) == 0 { return false }
}
return true
}

My code will never call isPrime with small numbers, so I have the first line that checks for two or three commented out.  In order to test and benchmark the function, I wrote prime_test.go.  Run the tests with "go test prime_test.go -bench=. test".  For numbers up to 22 bits, i32Prime is one to two orders of magnitude faster than ProbablyPrime(0).  In absolute terms, on a Celeron G1840 using a single core, BenchmarkPrime reports 998 ns/op.  I considered further optimizing the code to skip multiples of 5, but I don't think the ~20% speed improvement is worth the extra code complexity.

## Saturday, February 24, 2018

### Let's get going!

You might be asking if this is just one more of the many blog posts about go that can be found all over the internet.  I don't want to duplicate what other people have written, so I'll mostly be crypto functions sha3/keccak in go.

Despite a brief experiment with go almost two years ago, I had not done any serious coding in go.  That all changed when early this year I decided to write an ethereum miner from scratch.  After maintaining and improving https://github.com/nerdralph/ethminer-nr, I decided I would like to try something other than C++.  My first attempt was with D, and while it fixes some of the things I dislike about C++, 3rd-party library support is minimal.  After working with it for about a week, I decided to move on.  After some prototyping with python/cython, I settled on go.

After eight years of development, go is quite mature.  As I'll explain later in this blog post, my concerns about code performance were proven to be unwarranted.  Although it is quite mature, I've found it's still new enough that there is room for improvements to be made in go libraries.

Since I'm writing an ethereum miner, I need code that can perform keccak hashing.  Keccak is the same as the official sha-3 standard with a different pad (aka domain separation) byte.  The crypto/sha3 package internally supports the ability to use arbitrary domain separation bytes, but the functionality is not exported.  Therefore I forked the repository and added functions for keccak-256 and keccak-512.  A common operation in crypto is XOR, and the sha3 package includes an optimized XOR implemenation.  This function is not exported either, so I added a fast XOR function as well.

Ethereum's proof-of-work uses a DAG of about 2GB that is generated from a 32MB cache.  This cache and the DAG changes and grows slightly every 30,000 blocks (about 5 days).  Using my modified sha3 library and based on the description from the ethereum wiki, I wrote a test program that connects to a mining pool, gets the current seed hash, and generates the DAG cache.  The final hex string printed out is the last 32 bytes of the cache.  I created an internal debug build of ethminer-nr that also outputs the last 32 bytes of the cache in order to verify that my code works correctly.

When it comes to performance, I had read some old benchmarks that show gcc-go generating much faster code than the stock go compiler (gc).  Things have obviously changed, as the go compiler in my tests was much faster in my tests.  My ETH cache generation test program takes about 3 seconds to run when using the standard go compiler versus 8 seconds with gcc-go using -O3 -march=native.  This is on an Intel G1840 comparing go version go1.9.2 linux/amd64 with go1.6.1 gccgo.  The versions chosen were the latest pre-packaged versions for Ubuntu 16 (golang-1.9 and gccgo-6).  At least for compute-heavy crypto functions, I don't see any point in using gcc-go.

## Sunday, February 4, 2018

### Ethereum mining pool comparisons

Since I started mining ethereum, the focus of my optimizations have been on mining software and hardware tuning.  While overclocking and software mining tweaks are the major factor in maximizing earnings, choosing the best mining pool can make a measurable difference as well.

I tested the top three pools with North American servers: Ethermine, Mining Pool Hub, and Nanopool.  I tested mining on each pool, and wrote a small program to monitor pools.  Nanopool came out on the bottom, with Ethermine and Mining Pool Hub both performing well.

I think the biggest difference between pool earnings has to do with latency.  For someone in North America, using a pool in Asia with a network round-trip latency of 200-300ms will result in lower earnings than a North American pool with a network latency of 30-50ms.  The reason is higher latency causes a higher stale share rate.  If it takes 150ms for a share submission to reach the pool, with Ethereum's average block time of 15 seconds, the latency will add 1% to your stale share rate.  How badly that affects your earnings depends on how the pool rewards stale shares, something that is unfortunately not clearly documented on any of the three pools.

When I first started mining I would do simple latency tests using ping.  Following Ethermine's recent migration of their servers to AWS, they no longer respond to ping.  What really matters is not ping response time, but how quickly the pool forwards new jobs and processes submitted shares.  What further an evaluation of different pools, is that they often have multiple servers for one host name.  For example, here are the IP address for us-east1.ethereum.miningpoolhub.com from dig:
us-east1.ethereum.miningpoolhub.com. 32 IN A   192.81.129.199
us-east1.ethereum.miningpoolhub.com. 32 IN A   45.56.112.78
us-east1.ethereum.miningpoolhub.com. 32 IN A   45.33.104.156
us-east1.ethereum.miningpoolhub.com. 32 IN A   45.56.113.50

Even though 45.56.113.50 has a ping time about 40ms lower than 192.81.129.199, the 192.81.129.199 server usually sent new jobs faster than 45.56.113.50.  The difference between the first and last server to send a job was usually 200-300ms.  With nanopool, the difference was much more significant, with the slowest server often sending a new job 2 seconds (2000ms) after the fastest.  Recent updates posted on nanopool's site suggest their servers have been overloaded, such as changing their static difficulty from 5 billion to 10 billion.  Even with miners submitting shares at half the rate, it seems they are still having issues with server loads.

Less than a week ago, us1.ethermine.org resolved to a few different IPs, and now it resolves to a single AWS IP: 18.219.59.155.  I suspect there are at least two different servers using load balancing to respond to requests for the single IP.  By making multiple simultaneous stratum requests and timing the new jobs received, I was able to measure variations of more than 100ms between some jobs.  That seems to confirm my conclusion that there are likely multiple servers with slight variations in their performance.

In order to determine if the timing performance of the pools was actually having an impact on pool earnings, I looked at stats for blocks and uncles mined from etherscan.io.
Those stats show that although Nanopool produces about half as many blocks as Ethermine, it produces more uncles.  Since uncles receive a reward of at most 2.625 ETH vs 3 ETH for a regular block, miners should receive higher payouts on Ethermine than on Nanopool.  Based solely on uncle rate, payouts on Ethermine should be slightly higher than MPH.  Eun, the operator of MPH has been accessible and responsive to questions and suggestions about the pool, while the Ethermine pool operator is not accessible.  As an example of that accessibility, three days ago I emailed MPH about 100% rejects from one of their pool servers.  Thirty-five minutes later I received a response asking me to verify that the issue was resolved after they rebooted the server.

In conclusion, either Ethermine or MPH would make reasonable choices for someone mining in North America.  This pool comparison has also opened my eyes to optimization opportunities in mining software in how pools are chosen.  Until now mining software has done little more than switch pools when a connection is lost or no new jobs are received for a long period of time.  My intention is to have my mining software dynamically switch to mining jobs from the most responsive server instead of requiring an outright failure.

## Thursday, December 14, 2017

### Mining with AMDGPU-PRO 17.40 on Linux

A 17.40 beta was released on October 16, with a final release following on October 30th.  There have been some issues with corrupt versions of the final release, but I think they are resolved now.  I encountered lots of problems with this release, which was much of the motivation for making this post.

Until earlier this year, the AMDGPU-PRO drivers were targeted at the new Polaris cards, and support for even relatively recent Tonga was lacking.  Because of this, I was using the fglrx drivers for Tonga and Pitcairn cards.  The primary reason for upgrading now is for large page support, which improves performance on algorithms that use a large amount (2GB or more) of memory.  With the promise of better performance, and since fglrx is no longer being maintained, I decided to upgrade.

I've been using AMDGPU-PRO with kernel 4.10.5 for my Rx 470 cards, so I decided to use the same kernel.  I can't say there is any problems with using a newer kernel like 4.10.17 or even 4.14.5, so they might work just as well.  I left the on-board video enabled (i915), so I would not have to be connecting and disconnecting video cables when testing the GPUs.  After installing Ubuntu 16.04.3, I updated the kernel and rebooted.  For installing the AMDGPU-PRO drivers, I used the px option (amdgpu-pro-install --px), as it is supposed to support mixed iGPU/dGPU use.

My normal procedure for bringing up a multi-GPU machine is to start with a single GPU in the 16x motherboard slot, as this avoids potential issues with flaky risers.  Even with just one R9 380 card in the 16x slot, I was having problems with powerplay.  When it is working, pp_dpm_sclk will show the current clock rate with an asterisk, but this was not happening.  After two days of troubleshooting, I concluded there is a bug with powerplay and some motherboards when using the 16x slot.  When using only the 1x slots, powerplay works fine.

Since I wasn't able to use the 16x motherboard slot, testing card and riser combinations was more difficult.  Normally when I have a problem with a card and riser, I'll move the card to the 16x slot.  If the problems go away, I'll mark the riser as likely defective.  Mining algorithms like ethash use little bandwidth between the CPU and GPU, so there is no performance loss to using 1x risers.  Even the slowest PCIe 1.1 transfer rate is sufficient for mining.  Using "lspci -vv",  I could see the link speed was 5.0GT/s (LnkSta:), which is PCIe gen2 speed.  Reducing the speed to gen1 would mean lower quality risers could be used without encountering errors.

My first thought was to try to set the PCIe speed in the motherboard BIOS.  Setting gen1 in the chipset options made no difference, so perhaps it is only the speed used during boot-up before the OS takes over control of the PCIe bus.  Next, using "modinfo amdgpu", I noticed some module options related to PCIe.  Adding "amdgpu.pcie_gen2=0" had no effect.  Apparently the module no longer supports that option.  I could not find any documentation for the "pcie_gen_cap", but luckily the open-source amdgpu module supports the same module parameter.  By looking at amd_pcie.h in the kernel source code, I determined "0x10001" will limit the link to gen1.  I added "pcie_gen_cap=0x10001" to /etc/default/grub, ran update-grub, and rebooted.  With lspci I was able to see that all the GPUs were running at 2.5GT/s.

For clock control, and monitoring I've previously written about ROC-smi.
====================    ROCm System Management Interface    ====================
================================================================================
GPU  DID    Temp     AvgPwr   SCLK     MCLK     Fan      Perf    OverDrive  ECC
3   6938   66.0c    100.172W 858Mhz   1550Mhz  44.71%   manual    0%       N/A
1   6939   64.0c    112.21W  846Mhz   1550Mhz  42.75%   manual    0%       N/A
4   6939   62.0c    118.135W 839Mhz   1500Mhz  47.84%   manual    0%       N/A
2   6939   77.0c    123.78W  839Mhz   1550Mhz  64.71%   manual    0%       N/A
GPU[0]          : PowerPlay not enabled - Cannot get supported clocks
GPU[0]          : PowerPlay not enabled - Cannot get supported clocks
0   0402   N/A      N/A      N/A      N/A      None%              N/A      N/A
================================================================================
====================           End of ROCm SMI Log          ====================

I also use Kristy's utility to set specific clock rates:
ohgodatool -i 1 --mem-state 3 --mem-clock 1550

Unfortunately ethminer-nr doesn't work with this setup.  I suspect the new driver doesn't support some old OpenCL option, so the fix should be relatively simple, once I make the time to debug it.

## Wednesday, December 6, 2017

### Powering GPU mining rigs

Since I started mining ethereum almost two years ago, I have found that power distribution is important not just for equipment safety, but also for system stability.  When I started mining I thought my rigs should be fine as long as I used a robust server PSU to power the GPUs, with heavy 16 or 18AWG cables.  After frying one motherboard and more than a couple ATX PSUs, I've learned a lot of careful design and testing is required.

Using Dell, IBM, or HP server power supplies for mining rigs is not a new idea, so I won't go into too much detail about them.  I do recommend making an interlock connector so the server PSU turns on at the same time as the motherboard.  I also recommend only connecting the server PSU to power the GPU PCIe power connectors, as they are isolated from the 12V supply for the motherboard.  If you try to power ribbon risers, the 12V from the ATX and server PSUs will be interconnected and can lead to feedback problems.  Server PSUs are very robust and unlikely to be harmed, but I have killed a cheap 450W ATX PSU this way.  If you use USB risers, they are isolated from the motherboard's 12V supply, and therefore can be safely powered from the server PSU.

In the photo above, you might notice the grounding wire connecting all the cards, which then connects to a server PSU.  I recently added this to the rig after measuring higher current flowing through two of the ground wires connected to the 6-pin PCIe power plugs.  As I mentioned in my post about GPU PCIe power connections, there are only two ground pins, with the third ground wire being connected to the sense pin.  With two ground pins and three power pins, the ground wires carry 50% more current than the 12V wires.  Although the ground wires weren't heating up from the extra current, the connector was.  Adding the ground bypass wire reduced the connector temperature to a reasonable level.

For ATX PSUs, I've used a few of the EVGA 500B, and do not recommend them.  While even my cheap old 300W power supplies use 18AWG wire for the hard drive power connectors, the SATA and molex power cables on the 500B are only 20AWG.  Powering more than one or two risers with a 20AWG cable is a recipe for trouble.  I burned the 12V hard drive power wire on two 500B supplies before I realized this.  I recently purchased a Rosewill 500W 80plus gold PSU that was on sale at Newegg, and it is much better than the EVGA 500B.  The Rosewill uses 18AWG wire in the hard drive cables, and it also has a 12V sense wire in the ATX power connector.  This allows it to compensate for the voltage drop in the cable from the PSU to the motherboard.  The sense wire is the thinner yellow wire in the photo below.

Speaking of voltage drop, I recommend checking the voltage at the PCIe power connector to ensure it is close to 12V.  Most of my cards do not have a back plate, so I can use a multi-meter to measure at the 12V pins of the the power connector where they are soldered to the GPU PCB.  I also recommend checking the temperature of power connectors since good quality low-resistance connectors are just as important as heavy gauge wires.  Warm connectors are OK, but if they so hot that you can't hold your fingers to them, that's a problem.

My last recommendation is for people in North America (and some other places) where 120V AC power is the norm.  Wire up the outlets for your mining rigs for 240 instead of 120.  Power supplies are slightly more efficient at 240V, and will draw half as much current compared to 120V.  Lower current draw means less line loss going to the power supply and therefore less heat generated in power cords and plugs.  Properly designed AC power cables and plugs should never overheat below 10-15 Amps, however I have seen melted and burned connectors at barely over 10A of steady current draw.

## Friday, June 23, 2017

### Server PSU interlock

On my multi-GPU rigs, I use server PSUs like the Dell N750P to provide the 12V power to the PCI-E connectors.  These PSUs do not have power switches, so initially I would just pull the power cord out when I wanted to power them down.  After experimenting with the PSU control pins, I realized they have an active low "power on" pin.  Instead of using a jumper to connect it to ground, I decided to use an electronic switch to power the server PSU when the motherboard powers up.

The switch I used is a common, cheap model 817 optocoupler (pdf datasheet).  When current flows from pin 1 to 2, the optocoupler is turned on, creating a short from pin 4 to pin 3.  For my small circuit shown above, pin 4 is connected to the PS_ON signal, and pin 3 is connected to ground on the server PSU.  Pin 1 is connected to 12V (from the 4-pin 3.5" floppy drive power connector), and pin 2 is connected to ground.  On the back of the board is a 1K current-limiting resistor in series with the red LED which is a power on indicator.

I also made an even simpler interlock using only an optocoupler with the pins straightened and 0.1" header pins:
I connect pins 1 and 2 to the motherboard's power LED pins, which would normally light up a LED  when the motherboard powers up.  The motherboard already has a current-limiting resistor for the power LED, which typically limits the current to around 10mA.

## Friday, May 12, 2017

### Dummy plugs for headless GPU rigs

I've read about people claiming they needed to plug a monitor (or dummy plug) into one GPU card or else they couldn't use the card.  I had never encountered any problems with either fglrx or AMDGPU-Pro drivers until recently.  I moved a 4GB R9 380 card from an Ubuntu 14.04/fglrx rig to a Ubuntu 16.04/AMDGPU-Pro rig.  The remaining cards are 2GB R7 370 cards, and I started getting memory allocation errors for the primary card.  After checking with "ethminer --list-devices", I noticed the first card had about half the maximum memory allocation limit of the others:
Genoil's ethminer 0.9.41-genoil-1.2.0nr
=====================================================================
Forked from github.com/ethereum/cpp-ethereum
CUDA kernel ported from Tim Hughes' OpenCL kernel
With contributions from nicehash, nerdralph, RoBiK and sp_

ETH: 0xeb9310b185455f863f526dab3d245809f6854b4d

[OPENCL]:
Listing OpenCL devices.
FORMAT: [deviceID] deviceName
[0] Pitcairn
CL_DEVICE_TYPE: GPU
CL_DEVICE_GLOBAL_MEM_SIZE: 1920991232
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 970981376
CL_DEVICE_MAX_WORK_GROUP_SIZE: 256
[1] Pitcairn
CL_DEVICE_TYPE: GPU
CL_DEVICE_GLOBAL_MEM_SIZE: 2095054848
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 1868562432
CL_DEVICE_MAX_WORK_GROUP_SIZE: 256

I have an old VGA LCD monitor that I connected using a HDMI-VGA adapter.  After connecting the monitor, nearly the full amount became available:
Genoil's ethminer 0.9.41-genoil-1.2.0nr
=====================================================================
Forked from github.com/ethereum/cpp-ethereum
CUDA kernel ported from Tim Hughes' OpenCL kernel
With contributions from nicehash, nerdralph, RoBiK and sp_

ETH: 0xeb9310b185455f863f526dab3d245809f6854b4d

[OPENCL]:
Listing OpenCL devices.
FORMAT: [deviceID] deviceName
[0] Pitcairn
CL_DEVICE_TYPE: GPU
CL_DEVICE_GLOBAL_MEM_SIZE: 1969225728
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 1750073344
CL_DEVICE_MAX_WORK_GROUP_SIZE: 256
[1] Pitcairn
CL_DEVICE_TYPE: GPU
CL_DEVICE_GLOBAL_MEM_SIZE: 1968177152
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 1750073344
CL_DEVICE_MAX_WORK_GROUP_SIZE: 256

I also found the monitor doesn't have to be plugged in, just the HDMI-VGA adapter.  While there might be a way to configure fglrx so that the full memory is available without the adapter, I'm more interested in learning more about AMDGPU-Pro.