Writer - Word Frequency?

View: New views
10 Messages — Rating Filter:   Alert me  

Writer - Word Frequency?

by Harold Fuchs-6 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Is there an extension (or other software) that will produce a word
frequency table in Writer (2.4.1 or 3.x)? Where, please?
Note: I do not mean a word count but a list of the number of times each
word is used in  a document.

--
Harold Fuchs
London, England
Please reply *only* to dev@...


Re: Writer - Word Frequency?

by Marcin Miłkowski :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Save as text file, and run this awk script on it from command line (gawk
-f <scriptfile> <filename.txt>):

----------
  # Print list of word frequencies
      {
          for (i = 1; i <= NF; i++)
              freq[$0]++
      }

      END {
          for (word in freq)
              printf "%s\t%d\n", word, freq[word]
      }

--------------

To get better results you could remove all punctuation by simple search
and replace before saving as a text file. An extension would be easy to
write but a nightmare in a language without hash tables as used in awk.

Best
Marcin


Harold Fuchs pisze:
> Is there an extension (or other software) that will produce a word
> frequency table in Writer (2.4.1 or 3.x)? Where, please?
> Note: I do not mean a word count but a list of the number of times each
> word is used in  a document.
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@...
For additional commands, e-mail: dev-help@...


Re: Writer - Word Frequency?

by Harold Fuchs-6 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Thanks but it's not exactly what I had in  mind. As far as I know
extensions to OOo can be written in Java which, again as far as I know,
can handle the associative array you used in your awk example. So, for
someone familiar with the ODF structure and API, writing such an
extension should be quite simple. Or ???

In addition, OOo can already produce a word *count* so it knows what a
"word" is ...

Harold Fuchs
London, England
Please reply *only* to dev@...



On 06/01/2009 02:18, Marcin Miłkowski wrote:

> Save as text file, and run this awk script on it from command line
> (gawk -f <scriptfile> <filename.txt>):
>
> ----------
>  # Print list of word frequencies
>      {
>          for (i = 1; i <= NF; i++)
>              freq[$0]++
>      }
>
>      END {
>          for (word in freq)
>              printf "%s\t%d\n", word, freq[word]
>      }
>
> --------------
>
> To get better results you could remove all punctuation by simple
> search and replace before saving as a text file. An extension would be
> easy to write but a nightmare in a language without hash tables as
> used in awk.
>
> Best
> Marcin
>
>
> Harold Fuchs pisze:
>> Is there an extension (or other software) that will produce a word
>> frequency table in Writer (2.4.1 or 3.x)? Where, please?
>> Note: I do not mean a word count but a list of the number of times
>> each word is used in  a document.
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@...
> For additional commands, e-mail: dev-help@...
>
>




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@...
For additional commands, e-mail: dev-help@...


Re: Writer - Word Frequency?

by Marcin Miłkowski :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Well, as I'm developing LanguageTool, I know too well that this is not
so trivial - as we painfully found out :( You would have to traverse the
whole document (including tables and footnotes) to find the text - we
didn't find a proper way to do it and were happy to abandon this as soon
as new API was available. Actually exporting to text document and
running a script would be an easier option from the developer's point of
view. Or even using a standalone Java program that parses ODF as XML file.

If you have to create frequency lists very frequently, then maybe it
could make some sense to create such an extension that you describe.
What would be the use of the frequency list? I simply cannot see a
realistic usage scenario for non-scripting environment.

Regards
Marcin

Harold Fuchs pisze:

> Thanks but it's not exactly what I had in  mind. As far as I know
> extensions to OOo can be written in Java which, again as far as I know,
> can handle the associative array you used in your awk example. So, for
> someone familiar with the ODF structure and API, writing such an
> extension should be quite simple. Or ???
>
> In addition, OOo can already produce a word *count* so it knows what a
> "word" is ...
>
> Harold Fuchs
> London, England
> Please reply *only* to dev@...
>
>
>
> On 06/01/2009 02:18, Marcin Miłkowski wrote:
>> Save as text file, and run this awk script on it from command line
>> (gawk -f <scriptfile> <filename.txt>):
>>
>> ----------
>>  # Print list of word frequencies
>>      {
>>          for (i = 1; i <= NF; i++)
>>              freq[$0]++
>>      }
>>
>>      END {
>>          for (word in freq)
>>              printf "%s\t%d\n", word, freq[word]
>>      }
>>
>> --------------
>>
>> To get better results you could remove all punctuation by simple
>> search and replace before saving as a text file. An extension would be
>> easy to write but a nightmare in a language without hash tables as
>> used in awk.
>>
>> Best
>> Marcin
>>
>>
>> Harold Fuchs pisze:
>>> Is there an extension (or other software) that will produce a word
>>> frequency table in Writer (2.4.1 or 3.x)? Where, please?
>>> Note: I do not mean a word count but a list of the number of times
>>> each word is used in  a document.
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@...
>> For additional commands, e-mail: dev-help@...
>>
>>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@...
> For additional commands, e-mail: dev-help@...
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@...
For additional commands, e-mail: dev-help@...


Re: Writer - Word Frequency?

by brandelune :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On mercredi 07 janv. 09, at 08:42, Marcin Miłkowski wrote:

> If you have to create frequency lists very frequently, then maybe it  
> could make some sense to create such an extension that you describe.  
> What would be the use of the frequency list?

Glossary creation. Translators (in general) use that. Such a list  
would be created for any and all documents that needs to be translated.





Jean-Christophe Helary

------------------------------------
http://mac4translators.blogspot.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@...
For additional commands, e-mail: dev-help@...


Re: Writer - Word Frequency?

by Marcin Miłkowski :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Jean-Christophe Helary pisze:
>
> On mercredi 07 janv. 09, at 08:42, Marcin Miłkowski wrote:
>
>> If you have to create frequency lists very frequently, then maybe it
>> could make some sense to create such an extension that you describe.
>> What would be the use of the frequency list?
>
> Glossary creation. Translators (in general) use that. Such a list would
> be created for any and all documents that needs to be translated.

I'd rather create repetitions and word clusters rather than single word
frequency list - you'd need to filter out stopwords as well to get
sensible results for that.

I think the developer of Anaphraseus could do something like that, as
this is the only CAT system that runs straight under OOo (though it has
its problems with 3.x, as I've heard). I suppose he has solved
traversing the text problem effectively.

Regards
Marcin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@...
For additional commands, e-mail: dev-help@...


Parent Message unknown Re: Writer - Word Frequency?

by ge-7 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Harold,

In that case you should write an enhancement request
over the standard bug/enhancement channel to Oo.

-Eleonora

Re: [lingu-dev] Writer - Word Frequency?
 (Harold Fuchs, Wed Jan  7 00:18:12 2009)
Thanks but it's not exactly what I had in  mind. As far as I know
extensions to OOo can be written in Java which, again as far as I know,
can handle the associative array you used in your awk example. So, for
someone familiar with the ODF structure and API, writing such an
extension should be quite simple. Or ???

In addition, OOo can already produce a word *count* so it knows what a
"word" is ...

Harold Fuchs
London, England
Please reply *only* to dev@...



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@...
For additional commands, e-mail: dev-help@...


Re: Writer - Word Frequency?

by Olivier R. :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Harold,

The extension Linguist can list all words in a document, list all unrecognized words in a document, calculate number of sentences, etc.
http://extensions.services.openoffice.org/project/Linguist

I modify the extension to calculate the number of iteration of each words:
Linguist-1.2.2-wordscount.oxt

Hope this helps. ;)

Regards,
Olivier R.

Harold Fuchs-6 wrote:
Is there an extension (or other software) that will produce a word
frequency table in Writer (2.4.1 or 3.x)? Where, please?
Note: I do not mean a word count but a list of the number of times each
word is used in  a document.

Re: Writer - Word Frequency?

by Mathias Bauer :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Marcin,

Marcin Miłkowski wrote:

> Well, as I'm developing LanguageTool, I know too well that this is not
> so trivial - as we painfully found out :( You would have to traverse the
> whole document (including tables and footnotes) to find the text - we
> didn't find a proper way to do it and were happy to abandon this as soon
> as new API was available.
The new "Proof reading" in OOo uses a new API that provides the whole
document in so called "flat paragraphs". It works on Writer documents
only but perhaps is exactly the API you might need for your traversing.

This API will give you the whole text of the document, including tables,
text frames, footnotes etc. separated in paragraphs.

You still need to parse this text by yourself to create the word breaks
- but I think you can use the break iterator or any other code that is
able to do that.

Regards,
Mathias

--
Mathias Bauer (mba) - Project Lead OpenOffice.org Writer
OpenOffice.org Engineering at Sun: http://blogs.sun.com/GullFOSS
Please don't reply to "nospamformba@...".
I use it for the OOo lists and only rarely read other mails sent to it.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@...
For additional commands, e-mail: dev-help@...


Re: Writer - Word Frequency?

by Marcin Miłkowski :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Mathias Bauer pisze:

> Hi Marcin,
>
> Marcin Miłkowski wrote:
>
>> Well, as I'm developing LanguageTool, I know too well that this is not
>> so trivial - as we painfully found out :( You would have to traverse the
>> whole document (including tables and footnotes) to find the text - we
>> didn't find a proper way to do it and were happy to abandon this as soon
>> as new API was available.
> The new "Proof reading" in OOo uses a new API that provides the whole
> document in so called "flat paragraphs". It works on Writer documents
> only but perhaps is exactly the API you might need for your traversing.

Yes, I didn't think of using the *new* API for counting words. With new
API, it's trivially simple, as the new API is quite good :)

Regards
Marcin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@...
For additional commands, e-mail: dev-help@...