> From: Brian Warner <
warner-tahoe@...>
> Date: June 3, 2009 5:22:23 AM MDT
> To:
tahoe-dev@...
> Subject: [tahoe-dev] pycryptopp vs ARM
> Reply-To:
tahoe-dev@...
>
>
> I did some digging into the mysterious pycrypto++ corruption
> failures on
> Zandr's ARM-based linkstation. The executive summary: crypto+
> +-5.5.2 is
> broken, 5.6.0 probably fixes it, the problem won't appear on x86 or
> most
> other processors, there's a workaround but it's too ugly to
> contemplate.
>
>
> == The problem with crypto++-5.5.2 and ARM ==
>
> The main workhorse of AES-CTR_Mode is in strcipr.cpp,
> AdditiveCipherTemplate<S>::ProcessData. This is responsible for
> each call to
> AES.ProcessData(), which means it is effectively generating a
> stream cipher
> that looks like:
>
> 16 bytes of AES.encode(00000) ("00000" means 128 bits of zeros)
> 16 bytes of AES.encode(00001) (127 bits of zeros and a single
> "one" bit)
> 16 bytes of AES.encode(00002)
> etc..
>
> and XORing that stream cipher with the input data to produce the
> output data.
> Each call to ProcessData() can involve an arbitrary amount of data,
> but
> always works with adjacent spans of the stream, so it is
> responsible for
> keeping track of how many bytes have been processed already: an
> offset from
> the beginning of the stream.
>
> It remembers this offset in two pieces. The high-order piece is the
> number of
> 16-byte blocks which have been encoded, stored in "m_counterArray",
> as a
> 128-bit/16-byte number (remember that we're using AES here, so the
> keysize
> can be either 128-bits or 256-bits, but the blocksize is always
> 128-bits/16-bytes).
>
> The low-order piece is stored backwards in "m_leftOver": upon entry to
> ProcessData, if we're sitting at offset=0,16,32, then m_leftOver=0,
> but if
> we're sitting at offset=1,17,33 then m_leftOver=15, and if we're at
> offset=15,31,47 then m_leftOver=1.
>
> The AES object has a buffer named "m_buffer" which is used to hold the
> leftover stream cipher blocks between calls to ProcessData. It
> always holds
> exactly 16 bytes, and always updates these bytes as a unit. m_buffer
> [0] holds
> the stream byte that corresponds to offset=0,16,32 . m_buffer[1]
> holds the
> byte that is used for offset=1,17,33.
>
> m_leftOver then tells you which of the right-most bytes in m_buffer
> [] are
> waiting to be used. If m_leftOver=3, then m_buffer[13..15] are
> waiting to be
> XORed to provide offset=13..15, or offset=29..31 .
>
> ProcessData() has three phases:
> 1: use up any leftover stream data sitting in m_buffer from last time
> 2: process as many full 16-byte blocks as possible
> 3: process a trailing partial block, if any, putting the leftover
> stream in
> m_buffer
>
> Phase 1 happens if m_leftOver>0 and works by just XORing the right
> offset in
> m_buffer with the input data, and incrementing all the pointers. It
> does this
> one byte at a time: not as efficient as you might like, but it's
> never being
> used for more than 15 bytes per operation.
>
> Phase 2 has two forms: some operations can handle multple blocks at
> once very
> efficiently (remember that ProcessData() is pretty generic and is
> used by
> lots of different codepaths), so it sets up the underlying
> operation (i.e.
> AES256) to encrypt-and-XOR a couple of blocks and says Go. The
> operation code
> that is passed in to OperateKeystream() includes a note that says
> whether the
> input and output string pointers are aligned, but many operations
> (including
> modes.cpp:CTR_ModePolicy::OperateKeystream) ignore this note.
>
> If the first form couldn't be used, it falls back to the second
> form, which
> has a while(length>=bufferByteSize) loop and explictly writes the
> keystream
> into m_buffer, XORs it with the input string into the output
> string, and
> increments all the pointers. This XOR loop will run one byte at a
> time if its
> arguments aren't aligned.
>
> Phase 3 (which only happens if the remaining length is nonzero)
> writes the
> keystream into m_buffer, XORs just the partial block (one byte at a
> time),
> then sets m_leftOver so that the next time through Phase 1 will use
> the
> previously generated keystream. A given keystream block is only
> generated
> once.
>
> Note that Phase 1 and Phase 3 (and the second form of Phase 2, but
> not the
> first) all use byte-at-a-time loops. Also note that Phase 2 is used
> for the
> bulk of the data, so it wants to be as fast as possible.
>
> == The Gun On The Mantle In Act One ==
>
> All full-size CPUs like big fat memory reads, for efficiency:
> they've got 32-
> or 64- bit memory busses, and read whole cache lines at a time.
> They prefer
> aligned memory reads: x=*((int32_t*)0x0), because unaligned reads like
> x=*((int32_t*)0x1) or x=*((int32_t*)0x2) actually require the
> processor to
> perform two reads (the first of 0x0-0x3, the second of 0x4-0x7) and
> then
> shuffle bytes around to get them into the right place. Memory
> writes behave
> similarly.
>
> (microcontrollers, at least those with an 8-bit bus, don't care about
> alignment. In fact, many of them don't even have 16- or 32- bit
> operations,
> so the issue never comes up).
>
> Some processors will begrudgingly perform unaligned reads for you, but
> they'll be slow and grouchy about it. Some will throw a hardware
> exception,
> which might be caught by the kernel and emulated (meaning it'll
> work, but now
> the kernel is grouchy too, and things are really really slow), or
> might be
> delivered as a SIGILL to your program (which will probably kill it).
>
> The ARM processor has an exciting quirk. In certain configurations
> (which
> depend upon what the kernel wants to do), it uses a third mode, in
> which
> unaligned reads cause the specific byte to be loaded correctly but the
> remaining bytes get shifted around. This effectively corrupts most
> of the
> loaded word.
http://www.aleph1.co.uk/chapter-10-arm-structured-
> alignment-faq
> has a good writeup on the issue: their basic example is (remember
> these are
> little-endian):
>
> Address: 0 1 2 3 4 5 6 7
> Value : 10 21 66 23 ab 5e 9c 1d
>
> Using *(unsigned long*)2 would give:
>
> on x86: 0x5eab2366
> on ARM: 0x21102366
>
> Similar things happen if you try to do an unaligned write: instead of
> modifying bytes [2..5], you'll modify bytes [0..3] with some
> surprisingly
> rotated version of what you were writing. Byte [2] might get the
> right thing,
> but byte[0] will be clobbered.
>
> == The Gun On The Mantle In Act Two ==
>
> The function in Phase 2 which generates the stream data bottoms out in
> rijndael.cpp's Rijndael::Enc::ProcessAndXorBlock, which takes three
> (byte *)
> pointers (plus the internal key, already prepared and turned into
> subkeys):
> the input block (i.e. the counter), the XOR block (i.e. the
> plaintext), and
> the output block (i.e. the ciphertext). Most of the code in
> rijndael.cpp is
> x86 assembly code, but when that's disabled (hint: use M-x
> cpp-highlight-buffer to make emacs colorize or hide code inside
> specific
> conditionals, like !defined(CRYPTOPP_X86_ASM_AVAILABLE) ) the part
> that's
> left is the regular C AES implementation. It shuffles a bunch of
> data around
> internally and then does something like this:
>
> word32 *const obw = (word32 *)outBlock;
> const word32 *const xbw = (const word32 *)xorBlock;
>
> obw[0] = tbw[0] ^ xbw[0] ^ rk[0];
>
> That means it's taking the xorBlock plaintext pointer (a byte*),
> pretending
> it's really a 4-byte (word32*), and then reading from it one word
> at a time.
> If xorBlock was not actually a multiple of 4 bytes, then this will
> be an
> unaligned read. The write is doing the same thing, using outBlock
> and casting
> it to a word32* and then writing words into it, so you can get an
> unaligned
> write from this. This doesn't occur on x86 because the assembly
> version gets
> used instead of the C version (I assume that either the assembly
> handles
> misalignment explicitly, or that the x86 behaves correctly in the
> face of
> unaligned accesses).
>
> == The Gun Is Fired In Act Three ==
>
> The sequence of operations that triggers a pycryptopp failure on
> ARM is when
> an AES encryption instance is used to process two chunks of data in
> which the
> first chunk is less than 16 bytes long and the two chunks combined
> are at
> least 32 bytes long. The python code looks something like this:
>
> expected = AES(key).process(plaintext)
> e2 = AES(key)
> got = e2.process(plaintext[:9]) + e2.process(plaintext[9:9+23])
> assert expected == got # fails
>
> In particular, on ARM we see something like:
> 09,23: 0551d7974ff1d9e29c
> 9d883746f146a6,ffc9ce5ecaa9abd890da470e46000000 <--GOT
> 09,23: 0551d7974ff1d9e29c
> 9d883746080343,f15abafbc9ca5ac6a9a7eeaada790a42 <--EXPECTED
>
> (where the space is the break between the 9-byte call and the 23-
> byte call,
> and the comma is where the 16-byte blocksize lands).
>
> We also see failures with other length combinations: 9+24, 10+22, 10
> +23,
> 10+24, 15+17, etc.
>
> Let's look specifically at 9+23. The first call to process() will
> skip Phase
> 1 (since this is the first time we've used this object), will skip
> Phase 2
> (since we're not processing a full block), and only run Phase 3.
> Phase 3 will
> generate the stream data for offset=0..15 and store it in m_buffer,
> then XOR
> the first 9 bytes of that to produce the ciphertext, and store it
> in the
> output pointer. Note that pycryptopp/Python allocates a new string
> for each
> call to ProcessData(), and those strings are always 4-byte aligned.
> Phase 3
> sets m_leftOver to 7, since we've only used 9 bytes out of the 16 in
> m_buffer. This works fine. The 9 bytes of ciphertext
> (0551d7974ff1d9e29c)
> returned by process() are correct. The output buffer so far:
>
> 09,23: 0551d7974ff1d9e29c
>
> The second call to process() wants to process 23 bytes of
> plaintext. It
> allocates a Python string for the result, let's pretend it gets
> address
> 0x1000 (it will generally be 4-byte aligned, so it could just as
> easily be at
> 0x1004 or 0x1008 or 0x100c). Phase 1 will see that m_leftOver=7, so
> it does
> byte-wise XORs the remaining stream data from m_buffer into the
> output and
> gets 9d883746080343. So far, so good. Phase 1 finishes by
> incrementing the
> output pointer, in this case 0x1000+7=0x1007, and decrementing the
> remaining
> length to be processed, 23-7=16. The output buffer now has:
>
> 09,23: 0551d7974ff1d9e29c 9d883746080343
>
> Phase 2 wants to work with whole blocks, and since length>=16, it
> gets to
> run. It calls OperateKeystream() and tells it to encrypt-and-XOR
> data using a
> ciphertext target pointer (outblock) of... 0x1007. Here is the
> problem. That
> 32-bit cast and obw[0]= dereference causes an unaligned access, and
> on ARM
> you wind up with writes to [0x1004-0x1007] instead of
> [0x1007-0x100a]. This
> clobbers the bytes which were already written, giving us:
>
> 09,23: 0551d7974ff1d9e29c 9d883746080343
> xxxxxxxx <- expected write
> f146a6,ff <- actual write
> 09,23: 0551d7974ff1d9e29c 9d883746f146a6,ff <- resulting data
>
> The rest of Phase 2 does more damage. I think the unalignedness of the
> plaintext pointer (xorBlock) is affecting stuff too, which is why the
> corrupted ciphertext looks both shifted and bit-flipped from the
> expected
> value.
>
> I instrumented the ProcessData() code in cryptopp-5.5.2 to confirm
> that the
> last byte of the output buffer was correct (0x43) at the end of
> Phase 1 and
> then clobbered (0xa6) at the end of Phase 2.
>
> This corruption doesn't happen if Phase 1 didn't run, since then
> the pointers
> are still equal to the original (4-byte aligned) Python string
> buffer, which
> means that if your call to ProcessData() leaves it on a 16-byte
> boundary,
> then you're fine.
>
> It also doesn't happen unless Phase 2 runs, which requires that
> there be at
> least 16 bytes left to process (so it can do a full block). And it
> only
> happens on ARM where aligned accesses give you this weird
> corruption behavior
> (on other chips you might get a trap or a hard-to-spot performance
> problem).
>
> Calling ProcessData() only once per SigningKey instance won't
> trigger it.
> Always calling ProcessData() with short strings (<16 bytes) won't
> trigger it,
> and always calling it with multiples of 16-byte strings won't
> trigger it.
>
> == Episode 4: A New Hope ==
>
> cryptopp-5.6.0 changes the AES implementation considerably. I
> haven't traced
> through it enough to be sure, but I suspect that they've fixed the
> problem,
> because of a new #define CRYPTOPP_ALLOW_UNALIGNED_DATA_ACCESS
> (which is
> conservatively only set on x86, x64, and PowerPC). In addition, the
> vital
> write in rijndael.cpp now looks like:
>
> Block::Put(xorBlock, outBlock)(tbw[0]^rk[0])(tbw[1]^rk[1])(tbw[2]
> ^rk[2])(tbw[3]^rk[3]);
>
> and is done without casting xorBlock or outBlock to word32*. I
> can't find
> where Block::Put is defined, but I'm encouraged that this will fix the
> problem. The terse release notes for 5.6.0 don't mention
> intentionally fixing
> any problem like this.
>
> I've got a library compiling now to test this. I'll have to let it run
> overnight.. compiling libcrypto++ on Zandr's linkstation takes
> about 3 hours.
> (incidentally, it runs a lot faster if you edit the makefile to use
> -O0).
>
> What this means is that libcrypto++-5.5.2 has a grave bug on ARM,
> which is
> probably fixed in the current libcryptopp++-5.6.0 . If we really
> wanted to,
> pycryptopp could work around the bug, by keeping track of how many
> bytes we'd
> submitted to ProcessData() and splitting up the input data (at most
> one split
> per call to .process) such that we either tell ProcessData() to
> start at a
> 16-byte boundary, or we give it less than 16 bytes of data to work
> with.
> Something like:
>
> def process(self, plaintext):
> if len(plaintext) < 16 or self.counter%16==0:
> self.counter += len(plaintext)
> return self.e.ProcessData(plaintext)
> gap = 16 - (self.counter%16)
> return self.process(plaintext[:gap]) + self.process
> (plaintext[gap:])
>
> That ought to sound crazy to you.
>
> Lenny has 5.5.2, and while we should develop a minimal (C++)
> demonstration
> case and file a "serious" or "grave" debian/lenny bug on libcrypto++7
> (specific to arm/armel), I think it's unlikely that we could
> convince them to
> issue a security fix and upgrade stable all the way to 5.6.0 for
> just one
> architecture. So ARM boxes that want to use pycryptopp with
> --disable-embedded-cryptopp will need to upgrade to sid (which has
> 5.6.0).
> Fortunately, pycryptopp.deb isn't going to go into Lenny anyways
> (since it's
> already shipped), so this is less of an issue for the
> get-pycryptopp-into-Debian effort, just for people who want to use
> pycryptopp
> in the syslib form on a lenny/stable ARM box. We may want to consider
> building a backported cryptopp-5.6.0 for these folks.
>
> The embedded-cryptopp form should work (assuming my hopes for
> Block::Put are
> justified), because Zooko recently upgraded the embedded copy to
> 5.6.0 .
>
> We should also look at crypto++'s test suite and make sure there's
> a case
> which uses a single encryption object to two ProcessData operations
> (with
> various lengths) so this case is being exercised right from the
> source. I've
> attached the python equivalent below.
>
> cheers,
> -Brian
>
>
> #! /usr/bin/python
>
> from pycryptopp.cipher.aes import AES
> from binascii import a2b_hex, b2a_hex
>
> key = a2b_hex
> ("54b0789aeeddb6b3351e6d7b8d79d8d489582376ab1b322dd3362ccbfdb82f7a")
> assert len(key) == 32
> plaintext = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
> assert len(plaintext) == 52
> expected_ciphertext = AES(key=key).process(plaintext)
> #print b2a_hex(expected_ciphertext[:16]), b2a_hex
> (expected_ciphertext[16:32]), b2a_hex(expected_ciphertext[32:48])
> expected_s =
> "0551d7974ff1d9e29c9d883746080343"+"f15abafbc9ca5ac6a9a7eeaada790a42"+
> "73a45de485a97ecd52d79fe702a07a67"+"9b8d73ad"
> assert b2a_hex(expected_ciphertext) == expected_s
>
> errors = []
> for i in range(1, 25):
> for j in range(1, 25):
> p = AES(key=key)
> a = p.process(plaintext[:i])
> assert len(a) == i
> b = p.process(plaintext[i:i+j])
> assert len(b) == j
> c = a+b
> if c != expected_ciphertext[:i+j]:
> errors.append((i,j,a,b,expected_ciphertext
> [:i],expected_ciphertext[i:i+j]))
> for (i,j,a,b,ea,eb) in errors:
> print "%02d,%02d: %s %s <--GOT" % (i, j, b2a_hex(a), b2a_hex(b))
> print "%02d,%02d: %s %s <--EXPECTED" % (i, j, b2a_hex(ea),
> b2a_hex(eb))
> print
> if not errors:
> print "no errors detected"
> _______________________________________________
> tahoe-dev mailing list
>
tahoe-dev@...
>
http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-devYou received this message because you are subscribed to the "Crypto++ Users" Google Group.