Reading PDF documents

View: New views
3 Messages — Rating Filter:   Alert me  

Reading PDF documents

by Patrick O. Thurman :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I have a source pdf file that I want to burst into several other pdf files.
I can burst it using PdfCopy...  The problem is that I need to get the name
of the bursted PDF files from the text in the source PDF file.


The source PDF file pages look like the following:

                                Account Statement Report
CLIFTON JONES
                                                        SFA STATION
                                                        PO BOX 13009
                                                        NACOGDOCHES, TX
75962-0001 USA
                                                        XXXX-XXXX-9999-1234

                Posting Date: 08/07/2007 Thru 08/10/2007

Posting Date Transaction Date Description Location Country Original Amount

08/08/2007 08/07/2007 FINANCIAL MANAGEMENT A 999-999-9999, FL UNITED STATES

In the above example the file name that I need is the 8 numbers after the
"xxxx-xxxx-".  In other words in the above example the file name would be
"bsr99991234.pdf"

How do I do this??  My code so far is the following:

  private void buttonOpen_mouseClicked(MouseEvent e) {
      Document document = null;  
    //Create a file chooser
    final JFileChooser fc = new JFileChooser();
    //In response to a button click:
    if (e.getSource() == buttonOpen) {
        int returnVal = fc.showOpenDialog(basFrame.this);
        if (returnVal == JFileChooser.APPROVE_OPTION) {
            File file = fc.getSelectedFile();
            log.append("Opening: " + file.getName() + "." + newline);
            String outPath = null;
          try {
            String inPath = "C:\\Documents and
Settings\\thurmanpatri\\Desktop\\";
            outPath = "C:\\Documents and Settings\\thurmanpatri\\My
Documents\\aaapdf\\";
            PdfReader reader = new PdfReader(inPath + file.getName());
            int endPage = reader.getNumberOfPages();
            String fileName = null;
            for (int currentPage = 1; currentPage <= endPage; currentPage++)
{
              document = new
Document(reader.getPageSizeWithRotation(currentPage));
              fileName = "bsr" + currentPage + ".pdf";
              File newPdf = new File(outPath, fileName);
              FileOutputStream fos = new FileOutputStream(newPdf);
              PdfCopy copy = new PdfCopy(document, fos);
              document.open();
              PdfImportedPage page = copy.getImportedPage(reader,
currentPage);
              copy.addPage(page);
              log.append(document.toString() + newline);
              document.close();
              fos.close();
            }
           
          } catch (Exception ex) {
              ex.printStackTrace();
              log.append("9 -- Failed");
          }
          String vRetCode = doFileRename(outPath, log);  
          log.append("0 -- Successfull");
        } else {
            log.append("Open command cancelled by user." + newline);
        }
    }
    }

This will burst the source PDF file into 152 PDF files of one page each.  I
need to name them properly.

Pat,






Patrick O. Thurman
Stephen F. Austin State University
Information Technology Services
Data Base Administrator
Phone:  (936) 468-1074
Fax:    (936) 468-1117



-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
iText-questions mailing list
iText-questions@...
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://itext.ugent.be/itext-in-action/

Re: Reading PDF documents

by Paulo Soares :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

You need some other program to extract the text.

Paulo

----- Original Message -----
From: "Patrick O. Thurman" <pthurman@...>
To: <itext-questions@...>
Sent: Thursday, August 30, 2007 9:42 PM
Subject: [iText-questions] Reading PDF documents


>I have a source pdf file that I want to burst into several other pdf files.
> I can burst it using PdfCopy...  The problem is that I need to get the
> name
> of the bursted PDF files from the text in the source PDF file.
>
>
> The source PDF file pages look like the following:
>
> Account Statement Report
> CLIFTON JONES
> SFA STATION
> PO BOX 13009
> NACOGDOCHES, TX
> 75962-0001 USA
> XXXX-XXXX-9999-1234
>
>                Posting Date: 08/07/2007 Thru 08/10/2007
>
> Posting Date Transaction Date Description Location Country Original Amount
>
> 08/08/2007 08/07/2007 FINANCIAL MANAGEMENT A 999-999-9999, FL UNITED
> STATES
>
> In the above example the file name that I need is the 8 numbers after the
> "xxxx-xxxx-".  In other words in the above example the file name would be
> "bsr99991234.pdf"
>
> How do I do this??  My code so far is the following:
>
>  private void buttonOpen_mouseClicked(MouseEvent e) {
>      Document document = null;
>    //Create a file chooser
>    final JFileChooser fc = new JFileChooser();
>    //In response to a button click:
>    if (e.getSource() == buttonOpen) {
>        int returnVal = fc.showOpenDialog(basFrame.this);
>        if (returnVal == JFileChooser.APPROVE_OPTION) {
>            File file = fc.getSelectedFile();
>            log.append("Opening: " + file.getName() + "." + newline);
>            String outPath = null;
>          try {
>            String inPath = "C:\\Documents and
> Settings\\thurmanpatri\\Desktop\\";
>            outPath = "C:\\Documents and Settings\\thurmanpatri\\My
> Documents\\aaapdf\\";
>            PdfReader reader = new PdfReader(inPath + file.getName());
>            int endPage = reader.getNumberOfPages();
>            String fileName = null;
>            for (int currentPage = 1; currentPage <= endPage;
> currentPage++)
> {
>              document = new
> Document(reader.getPageSizeWithRotation(currentPage));
>              fileName = "bsr" + currentPage + ".pdf";
>              File newPdf = new File(outPath, fileName);
>              FileOutputStream fos = new FileOutputStream(newPdf);
>              PdfCopy copy = new PdfCopy(document, fos);
>              document.open();
>              PdfImportedPage page = copy.getImportedPage(reader,
> currentPage);
>              copy.addPage(page);
>              log.append(document.toString() + newline);
>              document.close();
>              fos.close();
>            }
>
>          } catch (Exception ex) {
>              ex.printStackTrace();
>              log.append("9 -- Failed");
>          }
>          String vRetCode = doFileRename(outPath, log);
>          log.append("0 -- Successfull");
>        } else {
>            log.append("Open command cancelled by user." + newline);
>        }
>    }
>    }
>
> This will burst the source PDF file into 152 PDF files of one page each.
> I
> need to name them properly.
>
> Pat,


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
iText-questions mailing list
iText-questions@...
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://itext.ugent.be/itext-in-action/

Re: Reading PDF documents

by Bruno Lowagie (iText) :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Paulo Soares wrote:
> You need some other program to extract the text.

That's true; however I have the impression the PDF the OP is
talking about is generated in an automated process. If a tool
similar to iText is used, it might be possible to use a hack
to find out the specific String.

You could try something like this:

byte[] streamBytes = reader.getPageContent(1);
String contentStream = new String(streamBytes);
int pos = contentStream.indexOf("XXXX-XXXX-") + 11;
String name = contentStream.substring(pos, pos + 10));

Of course: if "XXXX-XXXX-" isn't a String recurring
on every page, you'll have a hard time finding the rest
of the String. Also, it heavily depends on the tool
that was used to create the original PDF document
whether or not this hack will work.

Note that I generally don't advise workarounds like
this, because they aren't always waterproof.

br,
Bruno

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
iText-questions mailing list
iText-questions@...
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://itext.ugent.be/itext-in-action/