Skip to main content

View Post [edit]

Poster: Bri-Elma-NY Date: Aug 22, 2018 10:07pm
Forum: faqs Subject: Re: PDF WITH TEXT now appears, but 99% is not searchable text; large high-res images used to create PDF upload

Thank you for the reply.

I appreciate that there are many search results in the link you've posted, but the PDF WITH TEXT has only a small percentage of searchable text. That is, the vast majority of the pages are mostly image-based. Also, in the link you sent, the search-word "Buffalo" should've garnered many, many more hits, I assume.

I know I'm easily confused, but I still believe there's an issue with the searchable PDF WITH TEXT being mostly images for this document.

https://ia801508.us.archive.org/14/items/1888BuffaloNYIndustrialFairPaper/1888%20Buffalo%20NY%20Industrial%20Fair%20Paper_text.pdf

I've had many people ask me for access to this rather significant Buffalo-NY historical document, and they've asked for a searchable version. I don't have the capability to create a searchable PDF for upload here.

Again, thank you, and apologies for being so overly verbose, wordy, and long-winded.

/Bri in Elma NY USA

Reply [edit]

Poster: TwoNucker Date: Aug 23, 2018 2:53am
Forum: faqs Subject: Re: PDF WITH TEXT now appears, but 99% is not searchable text; large high-res images used to create PDF upload

Mr Kaplan,
Your search example misses many 'hits' on the word "Buffalo". For instance it shows none between page n39 and n44 yet each page header has that word.
This has been a problem with many uploaded pdf's. But thanks for the help.

A side note is the 1880's seem to be a time when beards were very popular with industrialist.

Reply [edit]

Poster: Jeff Kaplan Date: Aug 23, 2018 11:36am
Forum: faqs Subject: Re: PDF WITH TEXT now appears, but 99% is not searchable text; large high-res images used to create PDF upload

we cant speak to the text pdf. that is created only because the source pdf that was uploaded has a text layer.

Reply [edit]

Poster: Bri-Elma-NY Date: Aug 23, 2018 1:51pm
Forum: faqs Subject: There is no text layer in the uploaded PDF

Thank you for the reply. The PDF which was uploaded was completely image-based. JPG files were used to create the PDF using the link below. There should be zero text in the uploaded PDF.
https://smallpdf.com/jpg-to-pdf
The attached image is of the fonts window in the PDF from its Properties window. As can be seen:
#1) There are NO FONTS, and therefore, I assume,
#2) There is NO TEXT LAYER
Thank you in advance for solving this issue of lack of a properly searchable PDF.
/Bri in Elma NY USA
Document being discussed: https://archive.org/details/1888BuffaloNYIndustrialFairPaper


Attachment: 1888_Bflo_Industrial_Fair_Paper_properties_fonts_screen_image.JPG

Reply [edit]

Poster: Jeff Kaplan Date: Aug 23, 2018 7:17pm
Forum: faqs Subject: Re: There is no text layer in the uploaded PDF

i'll need to consult an engineer. in general, afaik we do not create a text pdf unless there is a text layer. did you scan and create the pdf? and, we would not add a text layer based on the OCR.
This post was modified by Jeff Kaplan on 2018-08-24 02:17:44

Reply [edit]

Poster: Bri-Elma-NY Date: Aug 23, 2018 9:58pm
Forum: faqs Subject: Does this reply answer the question posed in the last reply?

Thank you for the reply
The uploaded PDF was created using the website link mentioned before (repeated below). The website took the 64 JPG files, and it created the PDF without any problems. That exact same PDF was uploaded herein without any modifications.
https://smallpdf.com/jpg-to-pdf
Given that there are small portions of the PDF WITH TEXT file which contain text, it would seem that the problem is at Archive. That is, if there was some sort of text layer issue, then wouldn't there be absolutely zero searchable text?
If I have not properly answered the question posed, then please rephrase it, and then I will try answering again. As before, thank you in advance for resolving this issue.
/Brian in Elma NY USA
PS, BTW, will someone please explain what a text layer is, and how it affects this issue? Or, could someone add a link to a simple explanation of what a text layer is, and how it affects the OCR process performed on an image-based PDF? Thx!

Reply [edit]

Poster: Jeff Kaplan Date: Aug 24, 2018 2:26am
Forum: faqs Subject: Re: Does this reply answer the question posed in the last reply?

well, we never modify the originally uploaded file. if you open the text pdf in acrobat you'll see in image load and then a moment latest the text will load on top of it. i'll check with an engineer but i believe that indicated there was a text layer in the original which is why the text pdf is created by our system.
This post was modified by Jeff Kaplan on 2018-08-24 09:26:36

Reply [edit]

Poster: Bri-Elma-NY Date: Aug 24, 2018 5:21am
Forum: faqs Subject: QED: text which appears in the PDF WITH TEXT file was generated at Archive

Again, thank you for your reply.

The small amount of searchable text which does appear in the PDF WITH TEXT file came from the original JPG whole-page image files.

QED: the text which appears in the PDF WITH TEXT file must have been generated at Archive.

Thank you in advance for fixing this issue.

/Bri in Elma NY USA

Reply [edit]

Poster: Jeff Kaplan Date: Aug 24, 2018 9:52am
Forum: faqs Subject: Re: QED: text which appears in the PDF WITH TEXT file was generated at Archive

hi. thanks for your patience. here's what i learned. it's the opposite of what i suggested. if there is no text layer in the pdf we create one with a text layer that is based on the ocr. that way it is searchable. but, if ocr is imperfect, which is often the case, then the text search in the textpdf will also be less than perfect. so i learned something here as well. hope this helps.

Reply [edit]

Poster: Bri-Elma-NY Date: Aug 24, 2018 10:04am
Forum: faqs Subject: Thank you, so what is the solution now that we know more?

Again, thanks for the reply

So, what is the solution ("fix") to this issue now that we know more?

Do I need to do something for the fix? If so, what do I need to do?

Does Archive need to do something for this fix?
If so, what is the fix at Archive?
And, in what timeframe will the fix be implemented at Archive?

Thank you in advance for your detailed reply.

/Bri in Elma NY USA

Reply [edit]

Poster: Jeff Kaplan Date: Aug 24, 2018 10:13am
Forum: faqs Subject: Re: Thank you, so what is the solution now that we know more?

there is no fix. this is the way it is. the text layer we insert into the pdf we generate will exactly match the abbyy.gz ocr results; that may or may not exactly match the images in the pdf, depending on the accuracy of our ocr.

Reply [edit]

Poster: Bri-Elma-NY Date: Aug 24, 2018 4:48pm
Forum: faqs Subject: Ignoring a problem really hard doesn't make it magically go away

Thank you for the reply, but your reply is wholly unacceptable.

There has to be a reason why this particular PDF cannot be properly OCR'ed. That is, what is different about this PDF such that it cannot be properly OCR-ed? To repeat, there has to be a reason that this is happening to this particular and, most likely, to others like it.

Just because the cause of the faulty OCR system is not fully understood is no reason to abandon a search for the true cause. I say that because if you cannot explain why this is OCR issue is happening to this PDF, then you cannot fully understand the cause. Likewise, there is no excuse to abandon a search for a solution.

A wise man once said: "ignoring a problem really hard doesn't make it magically go away."

Thank you in advance for doing right by members here by pursuing a solution to the faulty OCR system.

/Bri in Elma NY USA

Reply [edit]

Poster: StarbriteScanz Date: Aug 24, 2018 7:07pm
Forum: faqs Subject: Re: Ignoring a problem really hard doesn't make it magically go away

Can I offer a possible path to solving this problem, if you don't mind me chipping in that is?

I downloaded the original high-res PDF of this publication and imported it into the OCR software I use, it's ABBY 9 and while the latest version is currently 14 I find it's pretty solid and always gives good results. The thing is, ABBY complains about the pages being too high a definition to handle - in some cases 840dpi which is a really high figure - and suggests dropping the resolution down to 300dpi.

If you do this and then run the OCR on the file I would say that 95%+ of the entire text content gets auto-recognized\read correctly and I have included a sample page of output so you can see for yourself.

It MAY be that your very high dpi is throwing off the OCR software on this site in which case it will be easy to test by reducing the dpi to 300 and resubmitting it. Failing that can I suggest you get hold of a decent OCR package like ABBY (mine came free with a scanner) and convert the file yourself and see if you get better results.

If all else fails I can always post the full document I have here to a file hosting site for you to grab but the comparative file size will be larger due to the different compression methods used. Also according to the FAQ you won't be able to replace the auto-generated one here with it. SB

Attachment: Buffalo_Page_16.pdf

Reply [edit]

Poster: Bri-Elma-NY Date: Aug 24, 2018 7:48pm
Forum: faqs Subject: Thank you thank you thank you! (+!!!!!)

I've assumed all along that this particular document had to have some obvious issue that drove the Archive OCR program bonkers. A couple of years ago I had posted this 1888 document for sharing on Google Drive, but I had to seriously compress it, and ended up with text which couldn't be read readily.

I assume the problem might stem from the fact that these are HUGE newspaper pages (it's a special edition printed on high-quality paper), and they have very-fine print. The firm which performed the scanning used a 400 dpi setting (the maximum that their large-format scanner offers). The original (uncompressed) scans run between 10 and 30 Mb. I assumed I needed to compress the document in order to keep the PDF at a manageable size.

I don't have much editing software other than Picasa (which we all know will never be upgraded, nor have all its myriad of issues repaired), and Paint 2D (don't get me started about all the issues with the 3D version).

I don't like the way www.SmallPDF.com compresses PDF's (too fuzzy). So, I used Picasa to compress (well, they call it "Export") the JPG files. I used settings in Picasa's export app of page-size "original size" and "50%" for Image Quality (which seems to be less than their "Minimum" setting).

As mentioned previously, I used the SmallPDF website to create the PDF, which made the PDFs page dimensions a somewhat smaller size than the originals. There are few settings options for creating the PDF at that website.

Because the old newspaper sheets are a unique size (i.e., a non-standard paper size) I had to use SmallPDF.com. Otherwise, my PDF converting software (Wondershare PDF Converter Pro, or CutePDF as a backup) defaults to 8½ by 11 inches.

For all my other documents posted herein, I use MS Word to hold the JPGs for creating the PDF. Believe me, I've tried every possible setting to prevent that page-size change to 8½ by 11 inches. Only preset page sizes in Word allow converting to a PDF page size which is not 8½ by 11. That is, if I use a unique, non-pre-set page size in MS Word, I'll always end up with a PDF with 8½ by 11 inch pages, the apparent default. So, that's why I used an online JPG-to-PDF converter. Am I making sense? If I am that's a first.

Given that I've compressed the JPG files already, and they are at the limit of text fuzziness, I don't see where I can reduce the dpi count. By the way, all the JPG files I started with show 96 dpi, which is confusing (that happens a lot, it really does). So I don't understand why the Abby OCR program is seeing a much-higher dpi setting on the PDF pages. Given all the JPG file-compression, it just doesn't seem correct that the PDF pages are ~840 dpi.

To me, it seems like I'm using the wrong means of compressing this PDF, that is, it's not right to compress the JPG files and then create the PDF. Maybe I should just create the PDF with the existing JPG files, and then compress the PDF. But I have no options with respect to fine-tuning that process at SmallPDF.com. That is, I get what I get, and that's all I get.

So, the issue I see is that I have an existing document posted here which won't OCR with the Archive software. I have to create a new document here which works with the Archive OCR, and then ask for the old document to be deleted.

Is there a text section here where I can test PDF uploads to see how well they OCR? I've used test uploads before, but they never OCR-ed.

Thank you in advance for any help offered.

/Brian in Elma NY

PS-BTW, a lot of people have asked for this 1888 special edition as a searchable document, so I need to make this happen. Thx.

Reply [edit]

Poster: StarbriteScanz Date: Aug 25, 2018 10:45am
Forum: faqs Subject: Re: Thank you thank you thank you! (+!!!!!)

As far as I've experienced there's nothing in that document, at least at bitmap resolution you uploaded, that is problematic. It has a neat regimented colum layout that makes first-pass area recognition by (at least) my OCR software quite accurate. One thing I did find is that OCR'ing the pages as-is at a without changing the dpi doesn't result in smoother area recognition, quite the opposite really as there's more background noise that is picked up as false positives.

I ran a second OCR process on your original 692MB PDF but first I extracted all the pages as bitmaps and did some pre-processing on them. I changed the black and white levels to clear up some of the greyness and strengthen the lettering. I then applied a light Gaussian blur followed by a mild sharpen to clean up some of the edges, changed the dpi to 300 and saved them out as JPGs at 50% compression. This gave a total size for all the pages as 530MB

The JPGs were loaded into ABBY 9 and auto-recognized then I manually removed all the graphics and tidied up some of the text boxes and processed everything. I created two PDFs, one was saved out at 200dpi at 50% JPEG compression for the graphics at 335MB the other was text only, no graphics, which came to only 2.5MB. So you can see how accurate the text recognition was I've attached the text only PDF here. If you have or can get access to a program that can merge layers then you can add this PDF to an existing one of the same size as a text layer and bypass the whole process but I don't believe there's any free software that will currently do this.

With regard to the 335MB version, I think it's quite readable on my HD monitor up to a magnification factor of 800%. Reducing the JPG compression level results in more artifacting\blurring at that level while reducing the dpi (to say 72dp) lowers the resolution of the image regardless of the compression level. It's always a balancing act to produce the smallest sized PDF with the clearest readability and there is software available that can take a PDF and let you change both the dpi and compression levels, as well as the compression method, until you get the balance you feel comfortable with before saving it off.

Having checked the resolution of the bitmaps extracted above all my software is telling me that they are 72dpi, so where ABBY gets 840dpi from I don't know. I can only guess that it's decided the pages are a certain size (perhaps related to the problems with you had with forced page sizing) and done the maths to come up with some OTT figure. Then again at 72dpi if you printed a page out it would be 73x100 inches or roughly 6 foot by 8 foot so something is amiss somewhere in the original file(s).

While all this does not directly address the problems you are having converting your file I hope it shows that things are not as bleak as they seem. If you still have the original scans I would suggest batch-changing the dpi of all of them to 300dpi - you can use a free program like Irfanview or ImBatch to do this and creating the PDF yourself again with a free program such as Compulsivecode's 'Image To PDF or XPS' and resubmit it. Alternatively you can create a CBR file by .RAR compressing the bitmap files, changing the suffix to CBR and uploading that instead. OCR will still be performed on it. I don't know if either approaches will invoke the OCR if you upload them to the Test section but it's worth trying with a small subset of pages just to see if it does and how accurate the output is.

SB

Attachment: Amended_Buffalo_Text_Only.pdf

Reply [edit]

Poster: Bri-Elma-NY Date: Aug 25, 2018 11:53am
Forum: faqs Subject: Thank you thank you thank you! (+!!!!!)

Again, a thousand thanks for the reply.

Some of what you said above (well, OK, a LOT of it) is over my head. I'm just a guy who has some old documents who wants to share them with the world. Mainly I deal with medical issues day-to-day, so this document stuff is only a part-time part-time past-time. I very-occasional write some articles for the local steam-show newsletter, but that's another very-seldom-time past-time.

I tried the post-PDF compression after creating a PDF using the original 400-dpi scans, and sure enough you can't read the text (i.e., it's too compressed). It'd be nice if they offered different compression levels at SmallPDF.com.

I believe you may've hit the head on the nail with the concept of lightening up of the pages. I usually do that on my other documents, but these pages were scanned off-site, and the folks there tweaked the pages post-scan.

So, I lazily left the JPGs alone because I'd been asked repeatedly for this document in more-legible form, so I was shooting for expeditiousness (gee, whut a surprize, I initiallly spelted that werd wrongedly). For the record, the original scans were 400-dpi TIFFs, which I converted using Picasa. Now I can't find the original TIFFs, not that that'd help.

I'll keep trying. Again, thank you for the much-detailed reply.

/Bri in Elma NY USA

Reply [edit]

Poster: StarbriteScanz Date: Aug 25, 2018 4:04pm
Forum: faqs Subject: Re: Thank you thank you thank you! (+!!!!!)

I'll keep this one short. If you don't want to continue trying with this then I've created a new PDF of this publication and put it on a 30day hosting site for you to do with what you want. The URL for it is https://ufile.io/js2gp. The PDF is 83MB but it's the same dimensions as the original, 300dpi and the image quality is a lot better. There are likely a few spelling errors due to the OCR but the whole thing is fully searchable. I think it was said that if you upload it here the included text layer (which BTW is just the OCR'ed text which can either be invisible i.e. "below" the page or visible and overwrites the original page) means it won't be re-processed. Hope this helps you out.

SB

Reply [edit]

Poster: Bri-Elma-NY Date: Aug 25, 2018 4:34pm
Forum: faqs Subject: Re: Thank you thank you thank you! (+!!!!!)

Thank you, but I believe my 30 days are already up... "file not found -- 404." I'm sorry to be a pest.

/Bri...

Reply [edit]

Poster: Bri-Elma-NY Date: Aug 25, 2018 4:37pm
Forum: faqs Subject: Re: Thank you thank you thank you! (+!!!!!)

Nevermind... I copy & pasted the text, and took out the period. It seems to be working... to wit: "slow download for free yada yada."

Bri...

Reply [edit]

Poster: Bri-Elma-NY Date: Aug 25, 2018 4:52pm
Forum: faqs Subject: Re: Thank you thank you thank you! (+!!!!!)

I've attached a screen shot of what I'm attempting. Maybe it'll work.

I tried uploading your PDF to replace the PDF WITH TEXT file.

/Bri...

Attachment: Uploading_the_new_1888_file_screen_shot.JPG

Reply [edit]

Poster: StarbriteScanz Date: Aug 25, 2018 6:25pm
Forum: faqs Subject: Re: Thank you thank you thank you! (+!!!!!)

I don't think it will let you add your own OCR'ed file, not according to this - https://archive.org/about/faqs.php#1165

The best you can probably achieve is to replace the main PDF file and hope that since it already contains a text layer a new derive won't produce a second OCR'ed file. I think that's what was said by a mod earlier in this Thread?

Reply [edit]

Poster: Bri-Elma-NY Date: Aug 25, 2018 7:23pm
Forum: faqs Subject: Re: Thank you thank you thank you! (+!!!!!)

Uh, I believe the replacement "took." Check it out. PDF-page 04 is rotated (as in the file you provided), and the text is highlightable (is that even a word?). Link is below.

https://ia801508.us.archive.org/14/items/1888BuffaloNYIndustrialFairPaper/1888%20Buffalo%20NY%20Industrial%20Fair%20Paper_text.pdf

A wise man once said: "Never tell a stupid person he can't do something because he'll somehow, someway, and quite ineptly, make it happen, often with little effort or issue."

But, will it stay that way? Inquiring nerds want to know.

Again, a thousand thanks.

/Bri in Elma NY USA

Reply [edit]

Poster: Bri-Elma-NY Date: Aug 25, 2018 10:31pm
Forum: faqs Subject: Re: Thank you thank you thank you! (+!!!!!)

If I may mail you a free t-shirt (which advertises my website) as a very-small thanks for the work you performed so kindly, please send shirt size and mailing address to my webmaster email address.

/Bri in Elma NY USA
>webmaster of www.BuffaloPitts.com

Reply [edit]

Poster: Adeelkhann Date: Oct 2, 2020 1:32am
Forum: faqs Subject: Re: Does this reply answer the question posed in the last reply?

You can also check out this tool when you want to convert JPG to PDF. This is the easiest one I have ever used. https://jpgtopdf.com/

Reply [edit]

Poster: Bri-Elma-NY Date: Oct 2, 2020 4:14am
Forum: faqs Subject: Thank you for the site link

Thank you. I've since started using a website which seems very similar with respect to features (link below). I just wish they'd both get the fact that nobody wants a margin much of the time, and so they should make "no margin" the default, along with "fit image."

https://www.convert-jpg-to-pdf.net/