One of the basic parameters of processors is the word length of a single arithmetic operation. This value can range from 8 bits for simple micro-controllers through 32 bits for 80386 derivatives to the most popular 64-bit length today. In addition, processors may have extended instruction sets (such as MMX, SSE, etc.) that can extend the processor’s capabilities to 128-bit (or larger) words.
The word lengths provided are therefore too short for the implementation of cryptographic algorithms. For RSA, the suggested (in terms of security) key length is today (year 2019) 4096 bits, which is many times more than standard processors provide. For elliptic curves, the length of the prime number in the suggested curve is 521 bits (no, it’s not an error: is one of Mersenne prime numbers).
So, if we want to create an implementation of elliptic curves, we must have implementations of large integer numbers. We can use an existing one (mathematical libraries usually implement not only ‘large numbers’ but also the elliptic curves themselves) or write it yourself (it is enough to implement addition, subtraction, multiplication, modulo and division). The presented examples are written in Python, which has built-in ‘large numbers’ arithmetic, but when writing an implementation in a typical embedded we do not have this comfort – we have to implement everything in C / C ++.
Let’s check the performance of the implementation presented in the previous part, but let’s use the more demanding curve:
ECPoint 521-Bit
Time: 0.146764039993
Time: 0.149250984192
So, one point multiplication on one Intel G4560 3.50GHz core amounted to over 400 ms, and at the time of establishing the connection you need to perform at least 4 multiplications (key negotiation and authentication), i.e. mathematical calculations alone would take almost 2 seconds. For processors whose speeds are measured not in GHz, but in MHz, the results would be measured in minutes, which is completely unacceptable. So we can try to optimize several implementation elements:
- large numbers arithmetic,
- operations in the field of
,
- operations in the EC group (
),
- EC point multiplication (
).
We can optimize the arithmetic of large numbers by standard in terms of computational complexity, conditional jumps, data copy elimination, etc. In our examples we assume for the purposes of our examples that in Python this implementation is well optimized.
Operations in the field of use the arithmetic of large numbers with additional requirements, which results in the need to use operations: modulo and element inversion algorithm. Division and modulo are the most demanding operations for the processor (division is at least 30 times slower than adding). In the implementation of part 3, each operation required the use of modulo. In addition, one could replace modulo operation by comparing to the number p and subtraction, but the conditional instruction would break the instruction piping in the processor, so that despite the prediction of jumps, the gain could not be large. For multiplication, however, there is a way to get rid of the expensive modulo operation – Montgomery’s reduction algorithm.
To be able to use the Montgomery algorithm, we must first calculate the residual which is the smallest approximation greater than p (
).
The operation of converting the number to a residuum form looks like this:
The operation of the return conversion from the residuum form looks like this:
The operation of adding numbers in the form of residuum looks identical to the usual addition in :
because:
The operation of multiplying numbers in the form of residuum looks as follows:
because:
However, the above multiplication formula is much less optimal than multiplying in . Therefore, the Montgomery algorithm is used to optimize the multiplication of numbers in the form of residuum, for which it is necessary to calculate
The algorithm looks like this:
- If
then
else
In the above algorithm we have 3 demanding operations: twice modulo and once dividing. However, the basis for modulo is the number , so we can simplify these operations:
[bit mask]
[shifting bits to the right]
So our implementation of arithmetic using the Montgomery algorithm might look like this:
ECPoint 521-Bit
Time: 0.161954164505
Time: 0.163055896759
And … it’s worse 🙂
My first version of this code was even slower. Therefore, the above optimization makes sense only in implementations of large numbers in which operations on bits (shifting, masking) are efficient – it seems that in Python these operations are not. So you have to take my word for it that in C / C ++ implementation the above algorithm noticeably speeds up operations on the field.
In the next part there will be presented optimization, which will significantly accelerate the operations on the curves: Jacobian coordinates.
ec_part4 (full code without optimization)
ec_part4m (full code with Montgomery multiplication)