EncryptionUtil should expose Cipher.getOutputSize()

View: New views
2 Messages — Rating Filter:   Alert me  

EncryptionUtil should expose Cipher.getOutputSize()

by Jonathan Harlap :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi James,

I wrote up some quick test code to determine how accurate  
getOutputSize() is.  Rather than test on a selection of files, I just  
generated messages (byte arrays) of different sizes, created a Cipher,  
used getOutputSize to predict the output size and then run the  
encryption and measure the output size to see if the prediction was  
good.  One of the issues that may come up, based on the documentation,  
is that the prediction quality may be dependent on which algorithm is  
being used.  I've run the test on messages sized from 100 bytes to  
10000 bytes using 1 byte increments and the cipher types  
PBEWITHSHA256AND256BITAES-CBC-BC (using the Bouncy Castle provider)  
and PBEWithMD5AndDES and PBEWithMD5AndTripleDES (both using Sun's  
provider), and the prediction is accurate every time.

Based on the documentation, I believe the case where the prediction  
will be inaccurate is if input data is put into the cipher but the  
output data is not retrieved, as in this case getOutputSize() will add  
the currently buffered output data to the prediction.  However, this  
is easily avoidable either by creating a new Cipher object for each  
encryption or just making sure to always read everything out of the  
encryption buffer before requesting another prediction.

As to your suggested API, my only additional desire would then be to  
be able to choose on a file-by-file basis whether to use compression  
before the encryption or not, as some of my data files are already  
compressed while others are highly compressible - so I generally  
prefer not to re-compress already compressed large files, while I  
generally do compress the data stream on uncompressed files...

I attach my test code so my method is clearly visible, and in case  
anyone wants to do other experiments with it.  Note that the code, as  
is, explicitly adds the BouncyCastle provider and defaults to testing  
with the PBEWITHSHA256AND256BITAES-CBC-BC algorithm, so the bcprov jar  
needs to be in your classpath.

Cheers,
Jon








> Hi Jonathan,
> Your suggestion is very interesting. I wasn't aware of the  
> Cipher.getOutputSize() method, and so have never attempted to do  
> proper streamed compression of content straight to S3. As you point  
> out, the current implementation tranforms input streams into  
> temporary files, performing any requested compression and/or  
> encryption, before uploading the temporary file to S3.
> Ideally, we could use the getOutputSize() method to determine the  
> size of an encrypted file before doing the actual encryption,  
> allowing JetS3t to encrypt the data on-the-fly and bypass the need  
> for temporary files altogether.
> My current thinking is that the best approach would be to provide a  
> utility method in EncryptionUtil that takes a bucket name, object  
> key, and source File as inputs. This method would return an S3Object  
> with a CipherInputStream on top of the file's input stream, and with  
> the Content-Length value pre-set according to the value provided by  
> Cipher.getOutputSize() based on the file's size. The resulting  
> S3Object could be uploaded to S3 directly.
> One potential problem is this line in the API documentation for the  
> method: "The actual output length of the next update or doFinal  
> call may be smaller than the length returned by this method."  
> If the method does not provide an accurate estimate of the final  
> result, this will cause S3 uploads to fail at the very end of the  
> upload process due to an over-run or under-run of data being  
> provided by JetS3t. Do you have much experience with this method and  
> its accuracy? It's certainly worth trying, but may not be workable  
> in all situations.
> If this approach works reliably it could be applied for all the  
> JetS3t tools, allowing them to stream uploads to S3 in cases where  
> encryption is performed as they already do when a file's data is not  
> transformed.
> I'm currently on Christmas holidays and away from my computer so I  
> won't be able to try this out immediately. Perhaps as a preparatory  
> experiment you could try experimenting with the getOutputSize()  
> method on a number of files to see whether it gives the right result  
> in all, or at least most, cases.
> If the getOutputSize() method does not prove to be reliable enough,  
> another work-around is still available. A file's data could be read  
> through a cipher input stream prior to performing the upload just to  
> find out how much data will be in the result, then the input stream  
> could be reset and re-read for the actual upload. This would save  
> the storage overhead of creating temporary files, at the expense of  
> extra processing to encrypt each file twice (once to count the  
> bytes, a second time to send the bytes to S3). Depending on your  
> circumstances, the double-processing approach may be preferable to  
> the temporary-file approach. I opted to use temporary files in  
> JetS3t as storage is generally more plentiful that CPU resources.
> Hmm, this email has become something of a tome. Let me know if you  
> have further thoughts. I'll do some experimentation when I get back  
> home in a couple of weeks.
> Cheers,
> James
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...

testCipherOutputSize.java (5K) Download Attachment

Re: EncryptionUtil should expose Cipher.getOutputSize()

by James Murty-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Jon,

Thanks for testing that. I've done a little rudimentary testing myself and the estimate seems accurate for the default JetS3t encryption as well.

I have added the method getEncryptedOutputSize() to the EncryptionUtil class, which returns the result of the getOutputSize() method using a new instance of the chosen cipher. This is probably a better approach than over-complicating things by bringing in file-based utility methods. The method does a little extra work if the input size is large, as the default implementation only works with integer-sized data objects (and doesn't work well as data sizes approach the maximum integer limit).

I have done some rudimentary testing of this method with the code below using files from 700MB to 4.5GB and the results were promising:

        // We assume the first parameter is a file name
        EncryptionUtil util = new EncryptionUtil("bodgy");
        java.io.File file = new java.io.File(args[0]);
        long estimate = util.getEncryptedOutputSize(file.length());
        System.out.println("=== Estimated: " + estimate);
        CipherInputStream cis = util.encrypt(new java.io.BufferedInputStream(
            new java.io.FileInputStream(file)));
        long outputSize = 0;
        byte[] buffer = new byte[8192];
        int read = 0;
        while ((read = cis.read(buffer)) != -1) {
            outputSize += read;
        }
        cis.close();
        System.out.println("=== Actual: " + outputSize);
        System.out.println("=== Matches? " + (estimate == outputSize));

Let me know if this works for you?

James


On 12/24/07, Jonathan Harlap <jharlap@...> wrote:
Hi James,

I wrote up some quick test code to determine how accurate
getOutputSize() is.  Rather than test on a selection of files, I just
generated messages (byte arrays) of different sizes, created a Cipher,
used getOutputSize to predict the output size and then run the
encryption and measure the output size to see if the prediction was
good.  One of the issues that may come up, based on the documentation,
is that the prediction quality may be dependent on which algorithm is
being used.  I've run the test on messages sized from 100 bytes to
10000 bytes using 1 byte increments and the cipher types
PBEWITHSHA256AND256BITAES-CBC-BC (using the Bouncy Castle provider)
and PBEWithMD5AndDES and PBEWithMD5AndTripleDES (both using Sun's
provider), and the prediction is accurate every time.

Based on the documentation, I believe the case where the prediction
will be inaccurate is if input data is put into the cipher but the
output data is not retrieved, as in this case getOutputSize() will add
the currently buffered output data to the prediction.  However, this
is easily avoidable either by creating a new Cipher object for each
encryption or just making sure to always read everything out of the
encryption buffer before requesting another prediction.

As to your suggested API, my only additional desire would then be to
be able to choose on a file-by-file basis whether to use compression
before the encryption or not, as some of my data files are already
compressed while others are highly compressible - so I generally
prefer not to re-compress already compressed large files, while I
generally do compress the data stream on uncompressed files...

I attach my test code so my method is clearly visible, and in case
anyone wants to do other experiments with it.  Note that the code, as
is, explicitly adds the BouncyCastle provider and defaults to testing
with the PBEWITHSHA256AND256BITAES-CBC-BC algorithm, so the bcprov jar
needs to be in your classpath.

Cheers,
Jon







> Hi Jonathan,
> Your suggestion is very interesting. I wasn't aware of the
> Cipher.getOutputSize() method, and so have never attempted to do
> proper streamed compression of content straight to S3. As you point
> out, the current implementation tranforms input streams into
> temporary files, performing any requested compression and/or

> encryption, before uploading the temporary file to S3.
> Ideally, we could use the getOutputSize() method to determine the
> size of an encrypted file before doing the actual encryption,
> allowing JetS3t to encrypt the data on-the-fly and bypass the need
> for temporary files altogether.
> My current thinking is that the best approach would be to provide a
> utility method in EncryptionUtil that takes a bucket name, object
> key, and source File as inputs. This method would return an S3Object
> with a CipherInputStream on top of the file's input stream, and with
> the Content-Length value pre-set according to the value provided by
> Cipher.getOutputSize() based on the file's size. The resulting
> S3Object could be uploaded to S3 directly.
> One potential problem is this line in the API documentation for the
> method: &quot;The actual output length of the next update or doFinal
> call may be smaller than the length returned by this method.&quot;
> If the method does not provide an accurate estimate of the final
> result, this will cause S3 uploads to fail at the very end of the
> upload process due to an over-run or under-run of data being
> provided by JetS3t. Do you have much experience with this method and
> its accuracy? It's certainly worth trying, but may not be workable
> in all situations.
> If this approach works reliably it could be applied for all the
> JetS3t tools, allowing them to stream uploads to S3 in cases where
> encryption is performed as they already do when a file's data is not
> transformed.
> I'm currently on Christmas holidays and away from my computer so I
> won't be able to try this out immediately. Perhaps as a preparatory
> experiment you could try experimenting with the getOutputSize()
> method on a number of files to see whether it gives the right result
> in all, or at least most, cases.
> If the getOutputSize() method does not prove to be reliable enough,
> another work-around is still available. A file's data could be read
> through a cipher input stream prior to performing the upload just to
> find out how much data will be in the result, then the input stream
> could be reset and re-read for the actual upload. This would save
> the storage overhead of creating temporary files, at the expense of
> extra processing to encrypt each file twice (once to count the
> bytes, a second time to send the bytes to S3). Depending on your
> circumstances, the double-processing approach may be preferable to
> the temporary-file approach. I opted to use temporary files in
> JetS3t as storage is generally more plentiful that CPU resources.
> Hmm, this email has become something of a tome. Let me know if you
> have further thoughts. I'll do some experimentation when I get back
> home in a couple of weeks.
> Cheers,
> James
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...