extractor metadata and XML/RDF

View: New views
3 Messages — Rating Filter:   Alert me  

extractor metadata and XML/RDF

by Andreas Harth :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello,

I'm working on SWSE [1], a Semantic Web Search Engine.  The aim
is to collect arbitrary content from the Web and make the metadata
available for search and query.

Extractor looks like exactly the right tool for extracting metadata
from legacy formats.  However, the resulting metadata are name-value
pairs, which makes post-processing difficult.

Do you have (or are there efforts in that direction) a more formal
way of returning metadata? I can see XML or better RDF fitting there.
I'd like to add some terms from standard ontologies (such as Dublin
Core and Friend of a Friend) to the output, probably using sed
scripts in the beginning if there is currently nothing else available.

Regards,
Andreas.

[1] http://swse.org/


_______________________________________________
libextractor mailing list
libextractor@...
http://lists.gnu.org/mailman/listinfo/libextractor

Re: extractor metadata and XML/RDF

by Christian Grothoff :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Monday 09 July 2007 10:07, Andreas Harth wrote:
> Hello,
>
> I'm working on SWSE [1], a Semantic Web Search Engine.  The aim
> is to collect arbitrary content from the Web and make the metadata
> available for search and query.
>
> Extractor looks like exactly the right tool for extracting metadata
> from legacy formats.  However, the resulting metadata are name-value
> pairs, which makes post-processing difficult.

I don't see how it makes post-processing difficult.  It is pretty much the
simplest format possible.  Now, certainly having data in highly standardized
format (such as dates, numbers, etc.) would help certain forms of
post-processing.  However, given that some of the file-formats are a bit
vague in how they encode the data in the first place, I don't see how it
would be possible to always achieve this.

> Do you have (or are there efforts in that direction) a more formal
> way of returning metadata? I can see XML or better RDF fitting there.
> I'd like to add some terms from standard ontologies (such as Dublin
> Core and Friend of a Friend) to the output, probably using sed
> scripts in the beginning if there is currently nothing else available.

The metadata types used by LE were motivated by Dublin Core.  Additional terms
are added as needed by particular formats.  Improvements in the set of
available metadata types are welcome but should be driven by adding or
modifying existing plugins to produce better terms, not by just adding terms
that will never be extracted.  I am not aware of any effort to add support
for RDF or XML.

Best regards,

Christian


_______________________________________________
libextractor mailing list
libextractor@...
http://lists.gnu.org/mailman/listinfo/libextractor

Parent Message unknown Re: extractor metadata and XML/RDF

by Christian Grothoff :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wednesday 11 July 2007 06:00, you wrote:

> Hi,
>
> Christian Grothoff wrote:
> >> Extractor looks like exactly the right tool for extracting metadata
> >> from legacy formats.  However, the resulting metadata are name-value
> >> pairs, which makes post-processing difficult.
> >
> > I don't see how it makes post-processing difficult.  It is pretty much
> > the simplest format possible.  Now, certainly having data in highly
> > standardized format (such as dates, numbers, etc.) would help certain
> > forms of post-processing.  However, given that some of the file-formats
> > are a bit vague in how they encode the data in the first place, I don't
> > see how it would be possible to always achieve this.
>
> Using a standardised format (even XML) would facilitate parsing,
> e.g. newline handling, Unicode, etc.  You're right about dates and
> numbers, haven't thought about that one.

I think your problems arise from using the "extract" command-line tool, not
from using the library.  When you use the library directly, you get the
metadata as a linked list with type + char*-content. So there are no problems
with newlines in that case.  As for Unicode, internally LE defines that
everything should be UTF-8.  When given to the console (by extract), extracts
converts UTF-8 to whatever the locale is (which may result in some losses if
the locale lacks ways to represent certain characters).  But again, this
problem goes away once you use the library and not the command-line tool --
at that point, you do not need to do any parsing.

> Also, using defined tags (maybe with URIs) helps to understand
> what the data is about (what's template, split, format)?

Template is the template used to create the document (see office products).  
Split are keyword generated by splitting metadata (see splitextractor).  
Format is the document format.  

> At the moment I'm ok with the tool.  We only use the extracted
> title, and don't care too much about Unicode and newlines.  IMHO
> having a more formal metadata format would be extremely helpful
> in the long term though.

Having better documentation as to what the different LE keyword types mean, I
agree, having good (verbose) documentation would be nice.  However, just
saying what each type means is not enough -- we would also have to check all
of the plugins to make sure that the types are used consistently according to
those updated definitions.  If someone wants to do this, I'd be very happy
about it.

Christian


_______________________________________________
libextractor mailing list
libextractor@...
http://lists.gnu.org/mailman/listinfo/libextractor