{"id":148,"date":"2020-01-18T11:09:00","date_gmt":"2020-01-18T10:09:00","guid":{"rendered":"http:\/\/blog.himitsu.eu\/?p=148"},"modified":"2020-01-18T11:15:55","modified_gmt":"2020-01-18T10:15:55","slug":"elliptic-curves-and-how-to-use-them-part-four","status":"publish","type":"post","link":"https:\/\/blog.himitsu.eu\/?p=148","title":{"rendered":"Elliptic curves and how to use them (part four)"},"content":{"rendered":"<p><a href=\"http:\/\/blog.himitsu.eu\/?p=118\">Go to part 3<\/a><\/p>\n<p>One of the basic parameters of processors is the word length of a single arithmetic operation. This value can range from 8 bits for simple micro-controllers through 32 bits for 80386 derivatives to the most popular 64-bit length today. In addition, processors may have extended instruction sets (such as MMX, SSE, etc.) that can extend the processor&#8217;s capabilities to 128-bit (or larger) words.<br \/>\nThe word lengths provided are therefore too short for the implementation of cryptographic algorithms. For RSA, the suggested (in terms of security) key length is today (year 2019) 4096 bits, which is many times more than standard processors provide. For elliptic curves, the length of the prime number in the suggested curve is 521 bits (no, it&#8217;s not an error: <img class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=2%5E%7B521%7D-1\" alt=\"2^{521}-1\" align=\"absmiddle\" \/> is one of Mersenne prime numbers).<br \/>\nSo, if we want to create an implementation of elliptic curves, we must have implementations of large integer numbers. We can use an existing one (mathematical libraries usually implement not only &#8216;large numbers&#8217; but also the elliptic curves themselves) or write it yourself (it is enough to implement addition, subtraction, multiplication, modulo and division). The presented examples are written in Python, which has built-in &#8216;large numbers&#8217; arithmetic, but when writing an implementation in a typical embedded we do not have this comfort &#8211; we have to implement everything in C \/ C ++.<br \/>\nLet&#8217;s check the performance of the implementation presented in the previous part, but let&#8217;s use the more demanding curve:<\/p>\n<div class=\"code-embed-wrapper\"> <pre class=\"language-python code-embed-pre\"  data-start=\"1\" data-line-offset=\"0\"><code class=\"language-python code-embed-code\">#3.3.  521-Bit Random ECP Group\nprint &quot;ECPoint 521-Bit&quot;\np = (2**521)-1\ntv = FieldElement(p, 1)\nneutral = ECPoint(p, -3, 0x0051953EB9618E1C9A1F929A21A0B68540EEA2DA725B99B315F3B8B489918EF109E156193951EC7E937B1652C0BD3BB1BF073573DF883D2C34F1EF451FD46B503F00)\ngx = FieldElement.factoryMethod(tv, 0x00C6858E06B70404E9CD9E3ECB662395B4429C648139053FB521F828AF606B4D3DBAA14B5E77EFE75928FE1DC127A2FFA8DE3348B3C1856A429BF97E7E31C2E5BD66)\ngy = FieldElement.factoryMethod(tv, 0x011839296A789A3BC0045C8A5FB42C7D1BD998F54449579B446817AFBD17273E662C97EE72995EF42640C550B9013FAD0761353C7086A272C24088BE94769FD16650)\ng = ECPoint.factoryMethod(neutral, gx, gy)\n\n#8.3.  521-Bit Random ECP Group\ni = 0x0037ADE9319A89F4DABDB3EF411AACCCA5123C61ACAB57B5393DCE47608172A095AA85A30FE1C2952C6771D937BA9777F5957B2639BAB072462F68C27A57382D4A52\nr = 0x0145BA99A847AF43793FDD0E872E7CDFA16BE30FDC780F97BCCC3F078380201E9C677D600B343757A3BDBF2A3163E4C2F869CCA7458AA4A4EFFC311F5CB151685EB9\nstart = monotonic.time.time()\ngi = g*i\nstop = monotonic.time.time()\nprint &quot;Time: %s&quot;%(stop-start)\nstart = monotonic.time.time()\ngr = g*r\nstop = monotonic.time.time()<\/code><\/pre> <div class=\"code-embed-infos\"> <\/div> <\/div>\n<blockquote><p>ECPoint 521-Bit<br \/>\nTime: 0.146764039993<br \/>\nTime: 0.149250984192<\/p><\/blockquote>\n<p>So, one point multiplication on one Intel G4560 3.50GHz core amounted to over 400 ms, and at the time of establishing the connection you need to perform at least 4 multiplications (key negotiation and authentication), i.e. mathematical calculations alone would take almost 2 seconds. For processors whose speeds are measured not in GHz, but in MHz, the results would be measured in minutes, which is completely unacceptable. So we can try to optimize several implementation elements:<\/p>\n<ul>\n<li>large numbers arithmetic,<\/li>\n<li>operations in the field of <img class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=F_p\" alt=\"F_p\" align=\"absmiddle\" \/>,<\/li>\n<li>operations in the EC group (<img class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=F_p\" alt=\"F_p\" align=\"absmiddle\" \/>),<\/li>\n<li>EC point multiplication (<img class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=F_p\" alt=\"F_p\" align=\"absmiddle\" \/>).<\/li>\n<\/ul>\n<p style=\"text-align: left;\">We can optimize the arithmetic of large numbers by standard in terms of computational complexity, conditional jumps, data copy elimination, etc. In our examples we assume for the purposes of our examples that in Python this implementation is well optimized.<br \/>\nOperations in the field of <img class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=F_p\" alt=\"F_p\" align=\"absmiddle\" \/> use the arithmetic of large numbers with additional requirements, which results in the need to use operations: modulo and element inversion algorithm. Division and modulo are the most demanding operations for the processor (division is at least 30 times slower than adding). In the implementation of part 3, each operation required the use of modulo. In addition, one could replace modulo operation by comparing to the number p and subtraction, but the conditional instruction would break the instruction piping in the processor, so that despite the prediction of jumps, the gain could not be large. For multiplication, however, there is a way to get rid of the expensive modulo operation &#8211; Montgomery&#8217;s reduction algorithm.<br \/>\nTo be able to use the Montgomery algorithm, we must first calculate the residual\u00a0 <img class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=R%3D2%5Ek\" alt=\"R=2^k\" align=\"absmiddle\" \/> which is the smallest approximation greater than p (<img class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=2%5E%7Bk-1%7D%20%5C%3B%20%5Cleq%20%5C%3B%20p%20%5C%3B%20%5Clt%20%5C%3B%202%5Ek\" alt=\"2^{k-1} \\; \\leq \\; p \\; \\lt \\; 2^k\" align=\"absmiddle\" \/>).<br \/>\nThe operation of converting the number to a residuum form looks like this:<\/p>\n<p style=\"text-align: center;\"><img class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=a'%20%5Cequiv%20a%5Cast%20R%20%5C%3B(mod%5C%3B%20p)\" alt=\"a' \\equiv a\\ast R \\;(mod\\; p)\" align=\"absmiddle\" \/><\/p>\n<p style=\"text-align: left;\">The operation of the return conversion from the residuum form looks like this:<\/p>\n<p style=\"text-align: center;\"><img class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=a%20%5Cequiv%20a'%20%5Cast%20R%5E%7B-1%7D%5C%3B(mod%5C%3Bp)\" alt=\"a \\equiv a' \\ast R^{-1}\\;(mod\\;p)\" align=\"absmiddle\" \/><\/p>\n<p style=\"text-align: left;\">The operation of adding numbers in the form of residuum looks identical to the usual addition in <img class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=F_p\" alt=\"F_p\" align=\"absmiddle\" \/>:<\/p>\n<p style=\"text-align: center;\"><img class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=c'%20%5Cequiv%20a'%20%2B%20b'%5C%3B(mod%5C%3Bp)\" alt=\"c' \\equiv a' + b'\\;(mod\\;p)\" align=\"absmiddle\" \/><\/p>\n<p style=\"text-align: left;\">because:<\/p>\n<p style=\"text-align: left;\"><img class=\"mathtex-equation-editor aligncenter\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=a'%2Bb'%20%5Cequiv%20a%20%5Cast%20R%20%2B%20b%20%5Cast%20R%20%5Cequiv%20%20(a%2Bb)%20%5Cast%20R%20%5Cequiv%20c%20%5Cast%20R%20%5Cequiv%20c'%20%20%20%20%20%20%20%20%20%5C%3B(mod%5C%3Bp)\" alt=\"a'+b' \\equiv a \\ast R + b \\ast R \\equiv (a+b) \\ast R \\equiv c \\ast R \\equiv c' \\;(mod\\;p)\" align=\"absmiddle\" \/><\/p>\n<p style=\"text-align: left;\">The operation of multiplying numbers in the form of residuum looks as follows:<\/p>\n<p style=\"text-align: left;\"><img class=\"mathtex-equation-editor aligncenter\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=c'%20%5Cequiv%20a%20%20%5Cast%20b%20%5Cast%20R%5E%7B-1%7D%20%20%20%5C%3B(mod%5C%3Bp)\" alt=\"c' \\equiv a \\ast b \\ast R^{-1} \\;(mod\\;p)\" align=\"absmiddle\" \/><\/p>\n<p style=\"text-align: left;\">because:<\/p>\n<p style=\"text-align: left;\"><img class=\"mathtex-equation-editor aligncenter\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=a'%5Cast%20b'%20%5Cequiv%20a%20%5Cast%20R%20%5Cast%20b%20%5Cast%20R%20%5Cequiv%20%20(a%20%5Cast%20b)%20%5Cast%20R%5E2%20%5Cequiv%20c%20%5Cast%20R%5E2%20%5Cequiv%20c'%20%5Cast%20R%20%20%20%20%20%20%20%20%20%5C%3B(mod%5C%3Bp)\" alt=\"a'\\ast b' \\equiv a \\ast R \\ast b \\ast R \\equiv (a \\ast b) \\ast R^2 \\equiv c \\ast R^2 \\equiv c' \\ast R \\;(mod\\;p)\" align=\"absmiddle\" \/><\/p>\n<p>However, the above multiplication formula is much less optimal than multiplying in <img class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=F_p\" alt=\"F_p\" align=\"absmiddle\" \/>. Therefore, the Montgomery algorithm is used to optimize the multiplication of numbers in the form of residuum, for which it is necessary to calculate <img class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=inv_p%20%3D%20%5Cfrac%7BR%20%5Cast%20R%5E%7B-1%7D%20-%201%7D%7B%20p%20%7D\" alt=\"inv_p = \\frac{R \\ast R^{-1} - 1}{ p }\" align=\"absmiddle\" \/><\/p>\n<p>The algorithm looks like this:<\/p>\n<ol>\n<li><img class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=T%3Da'%20%5Cast%20b'\" alt=\"T=a' \\ast b'\" align=\"absmiddle\" \/><\/li>\n<li><img class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=m%3D%20((T%20%5C%3B%20mod%20%5C%3BR)%20%5Cast%20inv_p%20%5C%3B%20)mod%20%5C%3B%20R\" alt=\"m= ((T \\; mod \\;R) \\ast inv_p \\; )mod \\; R\" align=\"absmiddle\" \/><\/li>\n<li><img class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=t%3D%5Cfrac%7BT%2Bm%20%5Cast%20p%7D%7BR%7D\" alt=\"t=\\frac{T+m \\ast p}{R}\" align=\"absmiddle\" \/><\/li>\n<li>If <img class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=t%5Cgeq%20p\" alt=\"t\\geq p\" align=\"absmiddle\" \/><br \/>\nthen <img class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=c'%3Dt-p\" alt=\"c'=t-p\" align=\"absmiddle\" \/><br \/>\nelse <img class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=c'%3Dt\" alt=\"c'=t\" align=\"absmiddle\" \/><\/li>\n<\/ol>\n<p>In the above algorithm we have 3 demanding operations: twice modulo and once dividing. However, the basis for modulo is the number <img class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=R%3D2%5Ek\" alt=\"R=2^k\" align=\"absmiddle\" \/>, so we can simplify these operations:<\/p>\n<p style=\"text-align: center;\"><img class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=c%20%5C%3B%20mod%20%5C%3B%20R%3Dc%5Cwedge(R%5E%7B-1%7D)\" alt=\"c \\; mod \\; R=c\\wedge(R^{-1})\" align=\"absmiddle\" \/> [bit mask]<br \/>\n<img class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=%5Cfrac%7Bc%7D%7BR%7D%3Dc%5Cgg%20k\" alt=\"\\frac{c}{R}=c\\gg k\" align=\"absmiddle\" \/> [shifting bits to the right]<\/p>\n<p>So our implementation of <img class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=F_p\" alt=\"F_p\" align=\"absmiddle\" \/> arithmetic using the Montgomery algorithm might look like this:<\/p>\n<div class=\"code-embed-wrapper\"> <pre class=\"language-python code-embed-pre\"  data-start=\"1\" data-line-offset=\"0\"><code class=\"language-python code-embed-code\">#!\/usr\/bin\/python\n#...\nclass FieldElement:\n    def __init__(self, p, rvalue, Rmask, k, invR, invp):\n        self.p = p\n        self.rvalue = rvalue\n        self.Rmask = Rmask\n        self.k = k\n        self.invR = invR\n        self.invp = invp\n        \n    def __str__(self):\n        if self.rvalue &lt; 0:\n            return &quot;-0x%x&quot;%(self.convertFromResiduum()*(-1))\n        return &quot;0x%x&quot;%self.convertFromResiduum()\n    \n    def __add__(self, other):\n        if type(other) == int or type(other) == long:\n            return self.add((other &lt;&lt; self.k)%p)\n        else:\n            return self.add(other.rvalue)\n    \n    def add(self, rvalue):\n        rvalue += self.rvalue\n        if rvalue &gt;= self.p:\n            rvalue -= self.p\n        return FieldElement(self.p, rvalue, self.Rmask, self.k, self.invR, self.invp)\n    \n    def __sub__(self, other):\n        if type(other) == int or type(other) == long:\n            return self.sub((other &lt;&lt; self.k)%p)\n        else:\n            return self.sub(other.rvalue)\n    \n    def sub(self, rvalue):\n        res = self.rvalue\n        if rvalue &gt; res:\n            res += self.p\n        res -= rvalue\n        return FieldElement(self.p, res, self.Rmask, self.k, self.invR, self.invp)\n    \n    def __mul__(self, other):\n        if type(other) == int or type(other) == long:\n            return self.mul((other &lt;&lt; self.k)%p)\n        else:\n            return self.mul(other.rvalue)\n    \n    def mul(self, rvalue):\n        T = rvalue * self.rvalue\n        #m = ( (T &amp; self.Rmask) * self.invp) &amp; self.Rmask\n        m = ( T * self.invp) &amp; self.Rmask\n        t = (T + m * self.p) &gt;&gt; self.k\n        if t &gt;= self.p:\n            rvalue = t - self.p\n        else:\n            rvalue = t\n        return FieldElement(self.p, rvalue, self.Rmask, self.k, self.invR, self.invp)\n    \n    def __eq__(self, other):\n        return (self.rvalue == other.rvalue)\n    \n    def __ne__(self, other):\n        return (self.rvalue != other.rvalue)\n\n    def convertFromResiduum(self):\n        return ((self.rvalue*self.invR)%self.p)\n    \n    @staticmethod\n    def createMethod(value, p):\n        R = 1\n        k = 0\n        while R &lt;= p:\n            R &lt;&lt;= 1\n            k += 1\n        invR = invers(R, p)\n        invp = ((R * invR) - 1) \/ p\n        rvalue = (value * R)%p\n        return FieldElement(p, rvalue, R - 1, k, invR, invp)\n    \n    @staticmethod\n    def factoryMethod(otherFieldElement, value):\n        rvalue = (value &lt;&lt; otherFieldElement.k)%otherFieldElement.p\n        return FieldElement(otherFieldElement.p, rvalue, otherFieldElement.Rmask, otherFieldElement.k, otherFieldElement.invR, otherFieldElement.invp)\n#...<\/code><\/pre> <div class=\"code-embed-infos\"> <\/div> <\/div>\n<blockquote><p>ECPoint 521-Bit<br \/>\nTime: 0.161954164505<br \/>\nTime: 0.163055896759<\/p><\/blockquote>\n<p>And &#8230; it&#8217;s worse \ud83d\ude42<br \/>\nMy first version of this code was even slower. Therefore, the above optimization makes sense only in implementations of large numbers in which operations on bits (shifting, masking) are efficient &#8211; it seems that in Python these operations are not. So you have to take my word for it that in C \/ C ++ implementation the above algorithm noticeably speeds up operations on the <img class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=F_p\" alt=\"F_p\" align=\"absmiddle\" \/> field.<br \/>\nIn the next part there will be presented optimization, which will significantly accelerate the operations on the curves: Jacobian coordinates.<\/p>\n<p><a href=\"http:\/\/blog.himitsu.eu\/wp-content\/uploads\/2020\/01\/ec_part4.doc\">ec_part4<\/a> (full code without optimization)<br \/>\n<a href=\"http:\/\/blog.himitsu.eu\/wp-content\/uploads\/2020\/01\/ec_part4m.doc\">ec_part4m<\/a> (full code with Montgomery multiplication)<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Go to part 3 One of the basic parameters of processors is the word length of a single arithmetic operation. This value can range from 8 bits for simple micro-controllers through 32 bits for 80386 derivatives to the most popular 64-bit length today. In addition, processors may have extended instruction sets (such as MMX, SSE, &hellip; <a href=\"https:\/\/blog.himitsu.eu\/?p=148\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Elliptic curves and how to use them (part four)<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[4],"tags":[],"_links":{"self":[{"href":"https:\/\/blog.himitsu.eu\/index.php?rest_route=\/wp\/v2\/posts\/148"}],"collection":[{"href":"https:\/\/blog.himitsu.eu\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.himitsu.eu\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.himitsu.eu\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.himitsu.eu\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=148"}],"version-history":[{"count":6,"href":"https:\/\/blog.himitsu.eu\/index.php?rest_route=\/wp\/v2\/posts\/148\/revisions"}],"predecessor-version":[{"id":157,"href":"https:\/\/blog.himitsu.eu\/index.php?rest_route=\/wp\/v2\/posts\/148\/revisions\/157"}],"wp:attachment":[{"href":"https:\/\/blog.himitsu.eu\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=148"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.himitsu.eu\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=148"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.himitsu.eu\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=148"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}