Fwd: [tahoe-dev] pycryptopp vs ARM

View: New views
3 Messages — Rating Filter:   Alert me  

Parent Message unknown Fwd: [tahoe-dev] pycryptopp vs ARM

by zooko :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Begin forwarded message:

> From: Brian Warner <warner-tahoe@...>
> Date: June 3, 2009 5:22:23 AM MDT
> To: tahoe-dev@...
> Subject: [tahoe-dev] pycryptopp vs ARM
> Reply-To: tahoe-dev@...
>
>
> I did some digging into the mysterious pycrypto++ corruption  
> failures on
> Zandr's ARM-based linkstation. The executive summary: crypto+
> +-5.5.2 is
> broken, 5.6.0 probably fixes it, the problem won't appear on x86 or  
> most
> other processors, there's a workaround but it's too ugly to  
> contemplate.
>
>
> == The problem with crypto++-5.5.2 and ARM ==
>
> The main workhorse of AES-CTR_Mode is in strcipr.cpp,
> AdditiveCipherTemplate<S>::ProcessData. This is responsible for  
> each call to
> AES.ProcessData(), which means it is effectively generating a  
> stream cipher
> that looks like:
>
>  16 bytes of AES.encode(00000) ("00000" means 128 bits of zeros)
>  16 bytes of AES.encode(00001) (127 bits of zeros and a single  
> "one" bit)
>  16 bytes of AES.encode(00002)
>  etc..
>
> and XORing that stream cipher with the input data to produce the  
> output data.
> Each call to ProcessData() can involve an arbitrary amount of data,  
> but
> always works with adjacent spans of the stream, so it is  
> responsible for
> keeping track of how many bytes have been processed already: an  
> offset from
> the beginning of the stream.
>
> It remembers this offset in two pieces. The high-order piece is the  
> number of
> 16-byte blocks which have been encoded, stored in "m_counterArray",  
> as a
> 128-bit/16-byte number (remember that we're using AES here, so the  
> keysize
> can be either 128-bits or 256-bits, but the blocksize is always
> 128-bits/16-bytes).
>
> The low-order piece is stored backwards in "m_leftOver": upon entry to
> ProcessData, if we're sitting at offset=0,16,32, then m_leftOver=0,  
> but if
> we're sitting at offset=1,17,33 then m_leftOver=15, and if we're at
> offset=15,31,47 then m_leftOver=1.
>
> The AES object has a buffer named "m_buffer" which is used to hold the
> leftover stream cipher blocks between calls to ProcessData. It  
> always holds
> exactly 16 bytes, and always updates these bytes as a unit. m_buffer
> [0] holds
> the stream byte that corresponds to offset=0,16,32 . m_buffer[1]  
> holds the
> byte that is used for offset=1,17,33.
>
> m_leftOver then tells you which of the right-most bytes in m_buffer
> [] are
> waiting to be used. If m_leftOver=3, then m_buffer[13..15] are  
> waiting to be
> XORed to provide offset=13..15, or offset=29..31 .
>
> ProcessData() has three phases:
>  1: use up any leftover stream data sitting in m_buffer from last time
>  2: process as many full 16-byte blocks as possible
>  3: process a trailing partial block, if any, putting the leftover  
> stream in
>     m_buffer
>
> Phase 1 happens if m_leftOver>0 and works by just XORing the right  
> offset in
> m_buffer with the input data, and incrementing all the pointers. It  
> does this
> one byte at a time: not as efficient as you might like, but it's  
> never being
> used for more than 15 bytes per operation.
>
> Phase 2 has two forms: some operations can handle multple blocks at  
> once very
> efficiently (remember that ProcessData() is pretty generic and is  
> used by
> lots of different codepaths), so it sets up the underlying  
> operation (i.e.
> AES256) to encrypt-and-XOR a couple of blocks and says Go. The  
> operation code
> that is passed in to OperateKeystream() includes a note that says  
> whether the
> input and output string pointers are aligned, but many operations  
> (including
> modes.cpp:CTR_ModePolicy::OperateKeystream) ignore this note.
>
> If the first form couldn't be used, it falls back to the second  
> form, which
> has a while(length>=bufferByteSize) loop and explictly writes the  
> keystream
> into m_buffer, XORs it with the input string into the output  
> string, and
> increments all the pointers. This XOR loop will run one byte at a  
> time if its
> arguments aren't aligned.
>
> Phase 3 (which only happens if the remaining length is nonzero)  
> writes the
> keystream into m_buffer, XORs just the partial block (one byte at a  
> time),
> then sets m_leftOver so that the next time through Phase 1 will use  
> the
> previously generated keystream. A given keystream block is only  
> generated
> once.
>
> Note that Phase 1 and Phase 3 (and the second form of Phase 2, but  
> not the
> first) all use byte-at-a-time loops. Also note that Phase 2 is used  
> for the
> bulk of the data, so it wants to be as fast as possible.
>
> == The Gun On The Mantle In Act One ==
>
> All full-size CPUs like big fat memory reads, for efficiency:  
> they've got 32-
> or 64- bit memory busses, and read whole cache lines at a time.  
> They prefer
> aligned memory reads: x=*((int32_t*)0x0), because unaligned reads like
> x=*((int32_t*)0x1) or x=*((int32_t*)0x2) actually require the  
> processor to
> perform two reads (the first of 0x0-0x3, the second of 0x4-0x7) and  
> then
> shuffle bytes around to get them into the right place. Memory  
> writes behave
> similarly.
>
> (microcontrollers, at least those with an 8-bit bus, don't care about
> alignment. In fact, many of them don't even have 16- or 32- bit  
> operations,
> so the issue never comes up).
>
> Some processors will begrudgingly perform unaligned reads for you, but
> they'll be slow and grouchy about it. Some will throw a hardware  
> exception,
> which might be caught by the kernel and emulated (meaning it'll  
> work, but now
> the kernel is grouchy too, and things are really really slow), or  
> might be
> delivered as a SIGILL to your program (which will probably kill it).
>
> The ARM processor has an exciting quirk. In certain configurations  
> (which
> depend upon what the kernel wants to do), it uses a third mode, in  
> which
> unaligned reads cause the specific byte to be loaded correctly but the
> remaining bytes get shifted around. This effectively corrupts most  
> of the
> loaded word. http://www.aleph1.co.uk/chapter-10-arm-structured- 
> alignment-faq
> has a good writeup on the issue: their basic example is (remember  
> these are
> little-endian):
>
>         Address: 0  1  2  3  4  5  6  7
>         Value  : 10 21 66 23 ab 5e 9c 1d
>
>   Using *(unsigned long*)2 would give:
>
>         on x86: 0x5eab2366
>         on ARM: 0x21102366
>
> Similar things happen if you try to do an unaligned write: instead of
> modifying bytes [2..5], you'll modify bytes [0..3] with some  
> surprisingly
> rotated version of what you were writing. Byte [2] might get the  
> right thing,
> but byte[0] will be clobbered.
>
> == The Gun On The Mantle In Act Two ==
>
> The function in Phase 2 which generates the stream data bottoms out in
> rijndael.cpp's Rijndael::Enc::ProcessAndXorBlock, which takes three  
> (byte *)
> pointers (plus the internal key, already prepared and turned into  
> subkeys):
> the input block (i.e. the counter), the XOR block (i.e. the  
> plaintext), and
> the output block (i.e. the ciphertext). Most of the code in  
> rijndael.cpp is
> x86 assembly code, but when that's disabled (hint: use M-x
> cpp-highlight-buffer to make emacs colorize or hide code inside  
> specific
> conditionals, like !defined(CRYPTOPP_X86_ASM_AVAILABLE) ) the part  
> that's
> left is the regular C AES implementation. It shuffles a bunch of  
> data around
> internally and then does something like this:
>
> word32 *const obw = (word32 *)outBlock;
> const word32 *const xbw = (const word32 *)xorBlock;
>
> obw[0] = tbw[0] ^ xbw[0] ^ rk[0];
>
> That means it's taking the xorBlock plaintext pointer (a byte*),  
> pretending
> it's really a 4-byte (word32*), and then reading from it one word  
> at a time.
> If xorBlock was not actually a multiple of 4 bytes, then this will  
> be an
> unaligned read. The write is doing the same thing, using outBlock  
> and casting
> it to a word32* and then writing words into it, so you can get an  
> unaligned
> write from this. This doesn't occur on x86 because the assembly  
> version gets
> used instead of the C version (I assume that either the assembly  
> handles
> misalignment explicitly, or that the x86 behaves correctly in the  
> face of
> unaligned accesses).
>
> == The Gun Is Fired In Act Three ==
>
> The sequence of operations that triggers a pycryptopp failure on  
> ARM is when
> an AES encryption instance is used to process two chunks of data in  
> which the
> first chunk is less than 16 bytes long and the two chunks combined  
> are at
> least 32 bytes long. The python code looks something like this:
>
>  expected = AES(key).process(plaintext)
>  e2 = AES(key)
>  got = e2.process(plaintext[:9]) + e2.process(plaintext[9:9+23])
>  assert expected == got # fails
>
> In particular, on ARM we see something like:
>  09,23: 0551d7974ff1d9e29c  
> 9d883746f146a6,ffc9ce5ecaa9abd890da470e46000000 <--GOT
>  09,23: 0551d7974ff1d9e29c  
> 9d883746080343,f15abafbc9ca5ac6a9a7eeaada790a42 <--EXPECTED
>
> (where the space is the break between the 9-byte call and the 23-
> byte call,
> and the comma is where the 16-byte blocksize lands).
>
> We also see failures with other length combinations: 9+24, 10+22, 10
> +23,
> 10+24, 15+17, etc.
>
> Let's look specifically at 9+23. The first call to process() will  
> skip Phase
> 1 (since this is the first time we've used this object), will skip  
> Phase 2
> (since we're not processing a full block), and only run Phase 3.  
> Phase 3 will
> generate the stream data for offset=0..15 and store it in m_buffer,  
> then XOR
> the first 9 bytes of that to produce the ciphertext, and store it  
> in the
> output pointer. Note that pycryptopp/Python allocates a new string  
> for each
> call to ProcessData(), and those strings are always 4-byte aligned.  
> Phase 3
> sets m_leftOver to 7, since we've only used 9 bytes out of the 16 in
> m_buffer. This works fine. The 9 bytes of ciphertext  
> (0551d7974ff1d9e29c)
> returned by process() are correct. The output buffer so far:
>
>  09,23: 0551d7974ff1d9e29c
>
> The second call to process() wants to process 23 bytes of  
> plaintext. It
> allocates a Python string for the result, let's pretend it gets  
> address
> 0x1000 (it will generally be 4-byte aligned, so it could just as  
> easily be at
> 0x1004 or 0x1008 or 0x100c). Phase 1 will see that m_leftOver=7, so  
> it does
> byte-wise XORs the remaining stream data from m_buffer into the  
> output and
> gets 9d883746080343. So far, so good. Phase 1 finishes by  
> incrementing the
> output pointer, in this case 0x1000+7=0x1007, and decrementing the  
> remaining
> length to be processed, 23-7=16. The output buffer now has:
>
>  09,23: 0551d7974ff1d9e29c 9d883746080343
>
> Phase 2 wants to work with whole blocks, and since length>=16, it  
> gets to
> run. It calls OperateKeystream() and tells it to encrypt-and-XOR  
> data using a
> ciphertext target pointer (outblock) of... 0x1007. Here is the  
> problem. That
> 32-bit cast and obw[0]= dereference causes an unaligned access, and  
> on ARM
> you wind up with writes to [0x1004-0x1007] instead of  
> [0x1007-0x100a]. This
> clobbers the bytes which were already written, giving us:
>
>  09,23: 0551d7974ff1d9e29c 9d883746080343
>                                           xxxxxxxx  <- expected write
>                                    f146a6,ff   <- actual write
>  09,23: 0551d7974ff1d9e29c 9d883746f146a6,ff   <- resulting data
>
> The rest of Phase 2 does more damage. I think the unalignedness of the
> plaintext pointer (xorBlock) is affecting stuff too, which is why the
> corrupted ciphertext looks both shifted and bit-flipped from the  
> expected
> value.
>
> I instrumented the ProcessData() code in cryptopp-5.5.2 to confirm  
> that the
> last byte of the output buffer was correct (0x43) at the end of  
> Phase 1 and
> then clobbered (0xa6) at the end of Phase 2.
>
> This corruption doesn't happen if Phase 1 didn't run, since then  
> the pointers
> are still equal to the original (4-byte aligned) Python string  
> buffer, which
> means that if your call to ProcessData() leaves it on a 16-byte  
> boundary,
> then you're fine.
>
> It also doesn't happen unless Phase 2 runs, which requires that  
> there be at
> least 16 bytes left to process (so it can do a full block). And it  
> only
> happens on ARM where aligned accesses give you this weird  
> corruption behavior
> (on other chips you might get a trap or a hard-to-spot performance  
> problem).
>
> Calling ProcessData() only once per SigningKey instance won't  
> trigger it.
> Always calling ProcessData() with short strings (<16 bytes) won't  
> trigger it,
> and always calling it with multiples of 16-byte strings won't  
> trigger it.
>
> == Episode 4: A New Hope ==
>
> cryptopp-5.6.0 changes the AES implementation considerably. I  
> haven't traced
> through it enough to be sure, but I suspect that they've fixed the  
> problem,
> because of a new #define CRYPTOPP_ALLOW_UNALIGNED_DATA_ACCESS  
> (which is
> conservatively only set on x86, x64, and PowerPC). In addition, the  
> vital
> write in rijndael.cpp now looks like:
>
>   Block::Put(xorBlock, outBlock)(tbw[0]^rk[0])(tbw[1]^rk[1])(tbw[2]
> ^rk[2])(tbw[3]^rk[3]);
>
> and is done without casting xorBlock or outBlock to word32*. I  
> can't find
> where Block::Put is defined, but I'm encouraged that this will fix the
> problem. The terse release notes for 5.6.0 don't mention  
> intentionally fixing
> any problem like this.
>
> I've got a library compiling now to test this. I'll have to let it run
> overnight.. compiling libcrypto++ on Zandr's linkstation takes  
> about 3 hours.
> (incidentally, it runs a lot faster if you edit the makefile to use  
> -O0).
>
> What this means is that libcrypto++-5.5.2 has a grave bug on ARM,  
> which is
> probably fixed in the current libcryptopp++-5.6.0 . If we really  
> wanted to,
> pycryptopp could work around the bug, by keeping track of how many  
> bytes we'd
> submitted to ProcessData() and splitting up the input data (at most  
> one split
> per call to .process) such that we either tell ProcessData() to  
> start at a
> 16-byte boundary, or we give it less than 16 bytes of data to work  
> with.
> Something like:
>
>     def process(self, plaintext):
>         if len(plaintext) < 16 or self.counter%16==0:
>             self.counter += len(plaintext)
>             return self.e.ProcessData(plaintext)
>         gap = 16 - (self.counter%16)
>         return self.process(plaintext[:gap]) + self.process
> (plaintext[gap:])
>
> That ought to sound crazy to you.
>
> Lenny has 5.5.2, and while we should develop a minimal (C++)  
> demonstration
> case and file a "serious" or "grave" debian/lenny bug on libcrypto++7
> (specific to arm/armel), I think it's unlikely that we could  
> convince them to
> issue a security fix and upgrade stable all the way to 5.6.0 for  
> just one
> architecture. So ARM boxes that want to use pycryptopp with
> --disable-embedded-cryptopp will need to upgrade to sid (which has  
> 5.6.0).
> Fortunately, pycryptopp.deb isn't going to go into Lenny anyways  
> (since it's
> already shipped), so this is less of an issue for the
> get-pycryptopp-into-Debian effort, just for people who want to use  
> pycryptopp
> in the syslib form on a lenny/stable ARM box. We may want to consider
> building a backported cryptopp-5.6.0 for these folks.
>
> The embedded-cryptopp form should work (assuming my hopes for  
> Block::Put are
> justified), because Zooko recently upgraded the embedded copy to  
> 5.6.0 .
>
> We should also look at crypto++'s test suite and make sure there's  
> a case
> which uses a single encryption object to two ProcessData operations  
> (with
> various lengths) so this case is being exercised right from the  
> source. I've
> attached the python equivalent below.
>
> cheers,
>  -Brian
>
>
> #! /usr/bin/python
>
> from pycryptopp.cipher.aes import AES
> from binascii import a2b_hex, b2a_hex
>
> key = a2b_hex
> ("54b0789aeeddb6b3351e6d7b8d79d8d489582376ab1b322dd3362ccbfdb82f7a")
> assert len(key) == 32
> plaintext = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
> assert len(plaintext) == 52
> expected_ciphertext = AES(key=key).process(plaintext)
> #print b2a_hex(expected_ciphertext[:16]), b2a_hex
> (expected_ciphertext[16:32]), b2a_hex(expected_ciphertext[32:48])
> expected_s =  
> "0551d7974ff1d9e29c9d883746080343"+"f15abafbc9ca5ac6a9a7eeaada790a42"+
> "73a45de485a97ecd52d79fe702a07a67"+"9b8d73ad"
> assert b2a_hex(expected_ciphertext) == expected_s
>
> errors = []
> for i in range(1, 25):
>     for j in range(1, 25):
>         p = AES(key=key)
>         a = p.process(plaintext[:i])
>         assert len(a) == i
>         b = p.process(plaintext[i:i+j])
>         assert len(b) == j
>         c = a+b
>         if c != expected_ciphertext[:i+j]:
>             errors.append((i,j,a,b,expected_ciphertext
> [:i],expected_ciphertext[i:i+j]))
> for (i,j,a,b,ea,eb) in errors:
>     print "%02d,%02d: %s %s <--GOT" % (i, j, b2a_hex(a), b2a_hex(b))
>     print "%02d,%02d: %s %s <--EXPECTED" % (i, j, b2a_hex(ea),  
> b2a_hex(eb))
>     print
> if not errors:
>     print "no errors detected"
> _______________________________________________
> tahoe-dev mailing list
> tahoe-dev@...
> http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the "Crypto++ Users" Google Group.
To unsubscribe, send an email to cryptopp-users-unsubscribe@....
More information about Crypto++ and this group is available at http://www.cryptopp.com.
-~----------~----~----~----~------~----~------~--~---


Re: [tahoe-dev] pycryptopp vs ARM

by zooko :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Dear Wei Dai:

It looks like you specifically fixed this bug, judging by the name of  
the macro: "CRYPTOPP_ALLOW_UNALIGNED_DATA_ACCESS".  But there is no  
mention of this issue in the Readme.txt about release 5.6.0, so I'm  
guessing that you forgot to mention it or decided not to mention it  
in the release notes.  Is that right?

Thanks,

Zooko

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the "Crypto++ Users" Google Group.
To unsubscribe, send an email to cryptopp-users-unsubscribe@....
More information about Crypto++ and this group is available at http://www.cryptopp.com.
-~----------~----~----~----~------~----~------~--~---


Re: [tahoe-dev] pycryptopp vs ARM

by Wei Dai :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


In 5.6.0 I did make a change to the BlockTransformation interface, so that
classes (such as AES) that implement it no longer assume that input and
output are aligned. In 5.5.x and earlier this alignment assumption was
supposed to be enforced by the Mode classes, but apparently (as Brian's work
shows) there was a bug that manifested itself on ARM. I made that change in
part because I thought the alignment enforcement code was too complicated,
but I wasn't specifically aware of this bug at that point.

I'll add a note to the change history about the modification to the
BlockTransformation interface, as well as this bug.

--------------------------------------------------
From: "Zooko Wilcox-O'Hearn" <zooko@...>
Sent: Thursday, June 04, 2009 3:06 PM
To: "Crypto++ Users" <cryptopp-users@...>; "Brian Warner"
<warner-tahoe@...>
Subject: Re: [tahoe-dev] pycryptopp vs ARM

>
> Dear Wei Dai:
>
> It looks like you specifically fixed this bug, judging by the name of
> the macro: "CRYPTOPP_ALLOW_UNALIGNED_DATA_ACCESS".  But there is no
> mention of this issue in the Readme.txt about release 5.6.0, so I'm
> guessing that you forgot to mention it or decided not to mention it
> in the release notes.  Is that right?
>
> Thanks,
>
> Zooko
>
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the "Crypto++ Users" Google Group.
To unsubscribe, send an email to cryptopp-users-unsubscribe@....
More information about Crypto++ and this group is available at http://www.cryptopp.com.
-~----------~----~----~----~------~----~------~--~---