« Return to Thread: [jira] Created: (TIKA-244) Missing Header/Footer text for Word'97 documents

[jira] Created: (TIKA-244) Missing Header/Footer text for Word'97 documents

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View in Thread

Missing Header/Footer text for Word'97 documents
------------------------------------------------

                 Key: TIKA-244
                 URL: https://issues.apache.org/jira/browse/TIKA-244
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.3
            Reporter: Maxim Valyanskiy
         Attachments: tika-patch

Tika output lacks header/footer text for Word'07 document. This patch fixes this problem:

diff -u -r apache-tika-0.3/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java apache-tika-0.3-modified/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java
--- apache-tika-0.3/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java 2009-02-14 03:07:51.000000000 +0300
+++ apache-tika-0.3-modified/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java 2009-06-09 13:24:56.000000000 +0400
@@ -75,9 +75,14 @@
             } else if ("WordDocument".equals(name)) {
                 setType(metadata, "application/msword");
                 WordExtractor extractor = new WordExtractor(filesystem);
+
+                xhtml.element("p", extractor.getHeaderText());
+
                 for (String paragraph : extractor.getParagraphText()) {
                     xhtml.element("p", paragraph);
                 }
+
+                xhtml.element("p", extractor.getFooterText());
             } else if ("PowerPoint Document".equals(name)) {
                 setType(metadata, "application/vnd.ms-powerpoint");
                 PowerPointExtractor extractor =


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

 « Return to Thread: [jira] Created: (TIKA-244) Missing Header/Footer text for Word'97 documents