Salsa20 in pure Python 2.5

version p3.2

Daniel J. Bernstein's Salsa20, is a provisionally well respected stream cipher. Larry Bugbee's pySalsa20 C extension wrapper has been around for a while, but no pure Python version, despite occasional forlorn calls on the net.

I wrote pureSalsa20 for a program that doesn't need raw speed, but which I want to distribute in such a way that users don't have to compile C code or install a binary library.

Raw speed pureSalsa20 has not got. On my old G3 Mac, it can encrypt around 24KB/sec., and this is pureSalsa20's niche: when you need portability but not speed or real security. More about speed below. Meanwhile,

Get, install, test, use.

Download the file pureSalsa20_p3_2.zip. It will unzip into a pureSalsa20_p3_2 directory, which will include In the typical Python way, you put
    from pureSalsa20 import Salsa20
at the top of your program, and either have pureSalsa20.py in the current directory when you run it, or pre-install pureSalsa20 in your Python's standard location with the (Linux or Mac OS) command
    sudo python setup.py install
pureSalsa20 follows the API of Bugbee's pySalsa20, and I included Bugbee's use notes wholesale in the comments section of pureSalsa20.py. There are also comments there about my implementation and about testSalsa20.py .

Appendix: Speed

Assuming I haven't missed a Python trick, the reason pureSalsa20 is so slow has to do with how the 32-bit add and rotate operations that are basic to the Salsa20 algorithm, and are very fast on modern processors, are not simply and speedily supported in Python. If you express them simply in Python, Python wants to switch to its variable-precision "long" integers, which take longer to calculate, and if you're careful not to generate longs, the code is maybe twice as fast, but more complex.

If you're a Python geek, you might like to try your hand at the problem by writing three functions--

Answers below.

The crazy (but of course perfectly sane) thing is the sign bit. Python has no unsigned ints. The top bit of a 32-bit int is not 2**31 or 1<<31, but -(1<<31), or -0x80000000. Here's trunc32() from my code:

def trunc32( w ):
    """ Return the bottom 32 bits of w as a Python int.
        This may create a long temporarily, but returns an int. """
    w = int( ( w & 0x7fffFFFF ) | -( w & 0x80000000 ) ) 
    assert type(w) == int
    return w
trunc32() creates a couple longs temporarily, but I only use it in non- timing-critical testing, so I didn't avoid "longing". But I did avoid it in these two:
def add32( a, b ):
    """ Add two 32-bit words discarding carry above 32nd bit,
        and without creating a Python long.
        Timing shouldn't vary.
    lo = ( a & 0xFFFF ) + ( b & 0xFFFF )
    hi = ( a >> 16 ) + ( b >> 16 ) + ( lo >> 16 )
    return ( -(hi & 0x8000) | ( hi & 0x7FFF ) ) << 16 | ( lo & 0xFFFF )

def rot32( w, nLeft ):
    """ Rotate 32-bit word left by nLeft or right by -nLeft
        without creating a Python long.
        Timing depends on nLeft but not on w.
    nLeft &= 31  # which makes nLeft >= 0
    if nLeft == 0:
        return w

    # Note: now 1 <= nLeft <= 31.
    #     RRRsLLLLLL   There are nLeft RRR's, (31-nLeft) LLLLLL's,
    # =>  sLLLLLLRRR   and one s which becomes the sign bit.
    RRR = ( ( ( w >> 1 ) & 0x7fffFFFF ) >> ( 31 - nLeft ) )
    sLLLLLL = -( (1<<(31-nLeft)) & w ) | (0x7fffFFFF>>nLeft) & w
    return RRR | ( sLLLLLL << nLeft )

(This rot32() deals with nLeft < 0 or nLeft > 31 by taking nLeft mod 32, which is more than my spec above, but it converts the problem to that problem.)

All in all, it's like Homer Simpson juggling nuclear waste in a glove box.... in slow motion.

PureSalsa20 revisions:

      p3.2   Fixed bug that initialized the output buffer with plaintext!
             Saner ramping of nreps in speed test.
             Minor changes and print statements.
      p3.1   Took timing variability out of add32() and rot32().
             Made the internals more like pySalsa20/libsalsa .
             Put the semicolons back in the main loop!
             In encryptBytes(), modify a byte array instead of appending.
             Used subclasses instead of patches in testSalsa20.py .
             Fixed speed calculation bug.
             Added 64k-byte messages to speed test to be fair to pySalsa20.
      p3     First version, intended to parallel pySalsa20 version 3.

--Steve Witham
Up to my home page.