FYI - same books - new files

View: New views
4 Messages — Rating Filter:   Alert me  

FYI - same books - new files

by Greg Price-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Greetings!

To those who downloaded the May 2008 OCR version of
the OS/VS2 MVS Overview manual, this is to let you know
that the page images have been reOCRed with what I think
is a better end result, so you may feel like downloading
the newer file.

To those who downloaded the April 2007 OCR version of
the OS/360 PL/I LRM, this is to let you know that a new
PDF has been cut with JPEG image compression, which
reduces the 25 million bytes to 6.75 million without any
other noticable changes, so you may deem a new download
also opportune.

Both of these PDFs have been linearized for fast web view,
and can be downloaded from
http://www.prycroft6.com.au/misc/

Cheers,
Greg




Re: FYI - same books - new files

by halfmeg :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> Greg Price wrote:

<snip>

> To those who downloaded the April 2007 OCR version of
> the OS/360 PL/I LRM, this is to let you know that a new
> PDF has been cut with JPEG image compression, which
> reduces the 25 million bytes to 6.75 million without any
> other noticable changes, so you may deem a new download
> also opportune.
 
> Both of these PDFs have been linearized for fast web view,
> and can be downloaded from
> http://www.prycroft6.com.au/misc/

There were 2 reasons DjVu was chosen for softlib.org, size of files and fast web view.  You mention both these reasons above, but concerning PDF files.

My scans of manuals are at 600 DPI and I am curious as to the comparision between DjVu and the method you mention above.  Do you have a reference?  Want a test Tiff bundle of a manual?

Phil



Re: Re: FYI - same books - new files

by Greg Price-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Mon, 24 Aug 2009 01:46:54 -0000, halfmeg wrote:

>There were 2 reasons DjVu was chosen for softlib.org, size of files and fast web view. You mention
both these reasons above, but concerning PDF files.

>My scans of manuals are at 600 DPI and I am curious as to the comparision between DjVu and the
method you mention above. Do you have a reference? Want a test Tiff bundle of a manual?

>Phil

Hi Phil,

I may have waffled on about this before, so apologies
it this is all a repeat.

I am interested in having searchable manuals which look
(on the screen) much like the original.  Ideally some magic
piece of software would digest the scanned images of the
original pages, and be able to reproduce them in character
form in the correct fonts and matching page layout.

Using the text with appropriate display attributes means
that the data can be stored in much less space than
pictures of each page could be.

The first manual I wanted searchable was the PL/I LRM
so I could easily find out which bits of the current PL/I
were in the OS/360 version and which bits weren't.

Although I could OCR the PL/I LRM the re-rendering
of the recognized text by the package I used was so
unimpressive that I decided the only usable form was
to keep the original page images for the display and
hiding the text words behind it.

Hence the original 25 million byte searchable version.
If you just clicked on the link the whole 25 million
bytes had to be downloaded before the first page
was shown.  Yuk.  But the plan was that interested
folks could "Save linked file as...".

With a new release of the OCR package I can now
save the same book in about a quarter of the space
and it looks the same to me, and the first page is
now shown quickly if accessed of the web.  So, I
decided to issue a FYI although I still expect downloads
for subsequent local reference rather than using the
web each time the book is to be accessed.

The 370 manuals seem to render much better than
the 360 manuals, so the Overview manuals are not
"pictures over text" but just the actual text meaning
that images (other than non-text figures) are not
stored in the PDF.

I also wanted a system codes manual that was searchable.
I got it from your DjVu hosting and used that DLL (under
Cygwin?) to get the TIFFs which I then OCRed.  The
result is the one on my site which is again just text
and not bitmaps (in the main).

I do agree that that the DjVu accessing of manuals is
faster than accessing PDFs (ie. smaller files), but I
don't think the softlib.org books are searchable by
text content, hence my modest effort at OCRing.

If the scan quality is not good or the original book
had flaws I have to spend time editing bitmaps
to reconstruct the text characters, or to make the
pictures black if the scans are in greyscale.  Quite
a tedious process really.

I don't know which manuals are the most important
to have as searchable.  I have Supervisor SPL and
maybe even System Messages on my plate.  After
that I could trawl softlib to see what I think might
be good to get the TIFFs for.  Of course, smaller
manuals take less time to process than larger ones,
but then usually the need for text searching is
proportionally smaller.

And then I got a ton of paper books which aren't
even scanned yet...

So, what's the most important manual to have in
a searchable form?

Cheers,
Greg




Re: FYI - same books - new files

by halfmeg :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> Greg Price wrote:

><snip>
 
> I may have waffled on about this before, so apologies
> it this is all a repeat.

A refresher helps, at least for me it does.
 
> I am interested in having searchable manuals which look
> (on the screen) much like the original.  Ideally some magic
> piece of software would digest the scanned images of the
> original pages, and be able to reproduce them in character
> form in the correct fonts and matching page layout.

> Using the text with appropriate display attributes means
> that the data can be stored in much less space than
> pictures of each page could be.
 
> The first manual I wanted searchable was the PL/I LRM
> so I could easily find out which bits of the current PL/I
> were in the OS/360 version and which bits weren't.
 
> Although I could OCR the PL/I LRM the re-rendering
> of the recognized text by the package I used was so
> unimpressive that I decided the only usable form was
> to keep the original page images for the display and
> hiding the text words behind it.
 
> Hence the original 25 million byte searchable version.
> If you just clicked on the link the whole 25 million
> bytes had to be downloaded before the first page
> was shown.  Yuk.  But the plan was that interested
> folks could "Save linked file as...".
 
> With a new release of the OCR package I can now
> save the same book in about a quarter of the space
> and it looks the same to me, and the first page is
> now shown quickly if accessed of the web.  So, I
> decided to issue a FYI although I still expect downloads
> for subsequent local reference rather than using the
> web each time the book is to be accessed.
 
> The 370 manuals seem to render much better than
> the 360 manuals, so the Overview manuals are not
> "pictures over text" but just the actual text meaning
> that images (other than non-text figures) are not
> stored in the PDF.
 
> I also wanted a system codes manual that was searchable.
> I got it from your DjVu hosting and used that DLL (under
> Cygwin?) to get the TIFFs which I then OCRed.  The
> result is the one on my site which is again just text
> and not bitmaps (in the main).
 
> I do agree that that the DjVu accessing of manuals is
> faster than accessing PDFs (ie. smaller files), but I
> don't think the softlib.org books are searchable by
> text content, hence my modest effort at OCRing.

There is a method, which I haven't mastered yet, to use Abby FineReader, which I purchased, to OCR and provide DjVu with searchable text files.

> If the scan quality is not good or the original book
> had flaws I have to spend time editing bitmaps
> to reconstruct the text characters, or to make the
> pictures black if the scans are in greyscale.  Quite
> a tedious process really.
 
> I don't know which manuals are the most important
> to have as searchable.  I have Supervisor SPL and
> maybe even System Messages on my plate.  After
> that I could trawl softlib to see what I think might
> be good to get the TIFFs for.  Of course, smaller
> manuals take less time to process than larger ones,
> but then usually the need for text searching is
> proportionally smaller.

Let me know on list or off which TIFFs you might want to mess with.  The conversion to and from DjVu back to TIFF does introduce a difference as the encode is in 'lossy' mode.
 
> And then I got a ton of paper books which aren't
> even scanned yet...
 
> So, what's the most important manual to have in
> a searchable form?

Unknown currently.

Phil