Feature Request for less noisy output

View: New views
4 Messages — Rating Filter:   Alert me  

Feature Request for less noisy output

by Bugzilla from udippel@gmail.com :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Here we have to do with plenty documents with lots of white space.
A usual headache are patches (lines) of streaky noises, in between valid
text lines.
I have now started to write a filter; using -x and then extracting the line
numbers from there, and store the line numbers with low height. Then I split
the OCR-ed text into its lines to purge those lines from the text.
Cumbersome, I thought. And then I had the impression that this might be done
much easier within ocrad; with an option, somewhat like
ocrad -h <height>
that simply suppresses the output of lines with a height of <height> or
lower.
In my humble opinion, the file created with -x should still show everything.
But text output would be much cleaner from dirt and dust, when any 'line' of
a height below a certain threshold is simply dropped when using this option.
I am well aware, that this would as well drop a straight, horizontal line,
though, but would not matter in our case. We fight much more with patches
and dots of dirt on the scanner surface, that usually screw up inter-line
white-space; adding dots, dashes and underscores into the text.

Uwe
_______________________________________________
Bug-ocrad mailing list
Bug-ocrad@...
http://lists.gnu.org/mailman/listinfo/bug-ocrad

Re: Feature Request for less noisy output

by Gareth-14 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I wrote a little processing script which you might find usefull.
See http://www.gbnetwork.co.uk/mailscanner/gbpgmdiff/

On Fri, 2007-05-25 at 15:50, Uwe Dippel wrote:

> Here we have to do with plenty documents with lots of white space.
> A usual headache are patches (lines) of streaky noises, in between valid
> text lines.
> I have now started to write a filter; using -x and then extracting the line
> numbers from there, and store the line numbers with low height. Then I split
> the OCR-ed text into its lines to purge those lines from the text.
> Cumbersome, I thought. And then I had the impression that this might be done
> much easier within ocrad; with an option, somewhat like
> ocrad -h <height>
> that simply suppresses the output of lines with a height of <height> or
> lower.
> In my humble opinion, the file created with -x should still show everything.
> But text output would be much cleaner from dirt and dust, when any 'line' of
> a height below a certain threshold is simply dropped when using this option.
> I am well aware, that this would as well drop a straight, horizontal line,
> though, but would not matter in our case. We fight much more with patches
> and dots of dirt on the scanner surface, that usually screw up inter-line
> white-space; adding dots, dashes and underscores into the text.
>
> Uwe
> _______________________________________________
> Bug-ocrad mailing list
> Bug-ocrad@...
> http://lists.gnu.org/mailman/listinfo/bug-ocrad



_______________________________________________
Bug-ocrad mailing list
Bug-ocrad@...
http://lists.gnu.org/mailman/listinfo/bug-ocrad

Re: Feature Request for less noisy output

by Bugzilla from ant_diaz@teleline.es :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello Uwe,

I think I can implement this as a filter in ocrad. Something like
"--filter line_height=MIN[,MAX]", but it will surely need some
reorganization of the code because all currently implemented filters are
caracter filters, not line filters. Also the current filters affect the
'-x' output.


Regards,
Antonio.


_______________________________________________
Bug-ocrad mailing list
Bug-ocrad@...
http://lists.gnu.org/mailman/listinfo/bug-ocrad

Re: Feature Request for less noisy output

by Bugzilla from udippel@gmail.com :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Antonio Diaz Diaz wrote:

> I think I can implement this as a filter in ocrad. Something like
> "--filter line_height=MIN[,MAX]", but it will surely need some
> reorganization of the code because all currently implemented filters are
> caracter filters, not line filters.

Yes, would be useful here. Thanks for looking into it.

 > Also the current filters affect the
> '-x' output.

I am not so sure about what I wrote with respect to this. What do you,
what do the others think ? Should -x output all characters or only those
above the limit ?

Uwe


_______________________________________________
Bug-ocrad mailing list
Bug-ocrad@...
http://lists.gnu.org/mailman/listinfo/bug-ocrad