|
View:
New views
3 Messages
—
Rating Filter:
Alert me
|
|
|
[jira] Created: (TIKA-285) Update media type registry to the latest httpd mime type databaseUpdate media type registry to the latest httpd mime type database
----------------------------------------------------------------- Key: TIKA-285 URL: https://issues.apache.org/jira/browse/TIKA-285 Project: Tika Issue Type: Improvement Components: mime Reporter: Jukka Zitting The MIME type database included in the Apache HTTP Server is one of the more complete and accurate media type and file extension resources out there. Their magic byte settings don't seem to be as complete as the ones in Tika, but it would be good to check also those settings for extra information. ... and we should contribute any of the recent Tika settings back to httpd where they don't already know of those details. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (TIKA-285) Update media type registry to the latest httpd mime type database[ https://issues.apache.org/jira/browse/TIKA-285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760023#action_12760023 ] Ken Krugler commented on TIKA-285: ---------------------------------- The "file" command line utility also has a pretty good set of magic byte settings - we'd looked at it when working on Krugle. FWIR, it also has a slightly more sophisticated method for processing magic bytes than what Nutch (and I guess now Tika) has. One of the issues we'd run into was the need to be able to use a regex against the header bytes to determine true file type, versus fixed offsets/values. > Update media type registry to the latest httpd mime type database > ----------------------------------------------------------------- > > Key: TIKA-285 > URL: https://issues.apache.org/jira/browse/TIKA-285 > Project: Tika > Issue Type: Improvement > Components: mime > Reporter: Jukka Zitting > > The MIME type database included in the Apache HTTP Server is one of the more complete and accurate media type and file extension resources out there. > Their magic byte settings don't seem to be as complete as the ones in Tika, but it would be good to check also those settings for extra information. > ... and we should contribute any of the recent Tika settings back to httpd where they don't already know of those details. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Resolved: (TIKA-285) Update media type registry to the latest httpd mime type database[ https://issues.apache.org/jira/browse/TIKA-285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jukka Zitting resolved TIKA-285. -------------------------------- Resolution: Fixed Fix Version/s: 0.5 Assignee: Jukka Zitting Yes, the file(1) command comes with a pretty impressive set of magic byte patterns. I'll file a separate issue for getting those included in Tika. Meanwhile I've now updated the Tika type registry to contain everything included in the mime.types and magic files in the latest Apache HTTP Server trunk. The summary is pretty impressive: * The media type registry in Tika was synchronized with the MIME type configuration in the Apache HTTP Server. Tika now knows about 1274 different media types and can detect 672 of those using 927 file extension and 280 magic byte patterns. (TIKA-285) > Update media type registry to the latest httpd mime type database > ----------------------------------------------------------------- > > Key: TIKA-285 > URL: https://issues.apache.org/jira/browse/TIKA-285 > Project: Tika > Issue Type: Improvement > Components: mime > Reporter: Jukka Zitting > Assignee: Jukka Zitting > Fix For: 0.5 > > > The MIME type database included in the Apache HTTP Server is one of the more complete and accurate media type and file extension resources out there. > Their magic byte settings don't seem to be as complete as the ones in Tika, but it would be good to check also those settings for extra information. > ... and we should contribute any of the recent Tika settings back to httpd where they don't already know of those details. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
| Free embeddable forum powered by Nabble | Forum Help |