View Post [edit]
Poster: | Bri-Elma-NY | Date: | Aug 22, 2018 10:07pm |
Forum: | faqs | Subject: | Re: PDF WITH TEXT now appears, but 99% is not searchable text; large high-res images used to create PDF upload |
I appreciate that there are many search results in the link you've posted, but the PDF WITH TEXT has only a small percentage of searchable text. That is, the vast majority of the pages are mostly image-based. Also, in the link you sent, the search-word "Buffalo" should've garnered many, many more hits, I assume.
I know I'm easily confused, but I still believe there's an issue with the searchable PDF WITH TEXT being mostly images for this document.
https://ia801508.us.archive.org/14/items/1888BuffaloNYIndustrialFairPaper/1888%20Buffalo%20NY%20Industrial%20Fair%20Paper_text.pdf
I've had many people ask me for access to this rather significant Buffalo-NY historical document, and they've asked for a searchable version. I don't have the capability to create a searchable PDF for upload here.
Again, thank you, and apologies for being so overly verbose, wordy, and long-winded.
/Bri in Elma NY USA
Reply [edit]
Poster: | TwoNucker | Date: | Aug 23, 2018 2:53am |
Forum: | faqs | Subject: | Re: PDF WITH TEXT now appears, but 99% is not searchable text; large high-res images used to create PDF upload |
Your search example misses many 'hits' on the word "Buffalo". For instance it shows none between page n39 and n44 yet each page header has that word.
This has been a problem with many uploaded pdf's. But thanks for the help.
A side note is the 1880's seem to be a time when beards were very popular with industrialist.
Reply [edit]
Poster: | Jeff Kaplan | Date: | Aug 23, 2018 11:36am |
Forum: | faqs | Subject: | Re: PDF WITH TEXT now appears, but 99% is not searchable text; large high-res images used to create PDF upload |
Reply [edit]
Poster: | Bri-Elma-NY | Date: | Aug 23, 2018 1:51pm |
Forum: | faqs | Subject: | There is no text layer in the uploaded PDF |
https://smallpdf.com/jpg-to-pdf
The attached image is of the fonts window in the PDF from its Properties window. As can be seen:
#1) There are NO FONTS, and therefore, I assume,
#2) There is NO TEXT LAYER
Thank you in advance for solving this issue of lack of a properly searchable PDF.
/Bri in Elma NY USA
Document being discussed: https://archive.org/details/1888BuffaloNYIndustrialFairPaper
Attachment: 1888_Bflo_Industrial_Fair_Paper_properties_fonts_screen_image.JPG
Reply [edit]
Poster: | Jeff Kaplan | Date: | Aug 23, 2018 7:17pm |
Forum: | faqs | Subject: | Re: There is no text layer in the uploaded PDF |
This post was modified by Jeff Kaplan on 2018-08-24 02:17:44
Reply [edit]
Poster: | Bri-Elma-NY | Date: | Aug 23, 2018 9:58pm |
Forum: | faqs | Subject: | Does this reply answer the question posed in the last reply? |
The uploaded PDF was created using the website link mentioned before (repeated below). The website took the 64 JPG files, and it created the PDF without any problems. That exact same PDF was uploaded herein without any modifications.
https://smallpdf.com/jpg-to-pdf
Given that there are small portions of the PDF WITH TEXT file which contain text, it would seem that the problem is at Archive. That is, if there was some sort of text layer issue, then wouldn't there be absolutely zero searchable text?
If I have not properly answered the question posed, then please rephrase it, and then I will try answering again. As before, thank you in advance for resolving this issue.
/Brian in Elma NY USA
PS, BTW, will someone please explain what a text layer is, and how it affects this issue? Or, could someone add a link to a simple explanation of what a text layer is, and how it affects the OCR process performed on an image-based PDF? Thx!
Reply [edit]
Poster: | Jeff Kaplan | Date: | Aug 24, 2018 2:26am |
Forum: | faqs | Subject: | Re: Does this reply answer the question posed in the last reply? |
This post was modified by Jeff Kaplan on 2018-08-24 09:26:36
Reply [edit]
Poster: | Bri-Elma-NY | Date: | Aug 24, 2018 5:21am |
Forum: | faqs | Subject: | QED: text which appears in the PDF WITH TEXT file was generated at Archive |
The small amount of searchable text which does appear in the PDF WITH TEXT file came from the original JPG whole-page image files.
QED: the text which appears in the PDF WITH TEXT file must have been generated at Archive.
Thank you in advance for fixing this issue.
/Bri in Elma NY USA
Reply [edit]
Poster: | Jeff Kaplan | Date: | Aug 24, 2018 9:52am |
Forum: | faqs | Subject: | Re: QED: text which appears in the PDF WITH TEXT file was generated at Archive |
Reply [edit]
Poster: | Bri-Elma-NY | Date: | Aug 24, 2018 10:04am |
Forum: | faqs | Subject: | Thank you, so what is the solution now that we know more? |
So, what is the solution ("fix") to this issue now that we know more?
Do I need to do something for the fix? If so, what do I need to do?
Does Archive need to do something for this fix?
If so, what is the fix at Archive?
And, in what timeframe will the fix be implemented at Archive?
Thank you in advance for your detailed reply.
/Bri in Elma NY USA
Reply [edit]
Poster: | Jeff Kaplan | Date: | Aug 24, 2018 10:13am |
Forum: | faqs | Subject: | Re: Thank you, so what is the solution now that we know more? |
Reply [edit]
Poster: | Bri-Elma-NY | Date: | Aug 24, 2018 4:48pm |
Forum: | faqs | Subject: | Ignoring a problem really hard doesn't make it magically go away |
There has to be a reason why this particular PDF cannot be properly OCR'ed. That is, what is different about this PDF such that it cannot be properly OCR-ed? To repeat, there has to be a reason that this is happening to this particular and, most likely, to others like it.
Just because the cause of the faulty OCR system is not fully understood is no reason to abandon a search for the true cause. I say that because if you cannot explain why this is OCR issue is happening to this PDF, then you cannot fully understand the cause. Likewise, there is no excuse to abandon a search for a solution.
A wise man once said: "ignoring a problem really hard doesn't make it magically go away."
Thank you in advance for doing right by members here by pursuing a solution to the faulty OCR system.
/Bri in Elma NY USA
Reply [edit]
Poster: | StarbriteScanz | Date: | Aug 24, 2018 7:07pm |
Forum: | faqs | Subject: | Re: Ignoring a problem really hard doesn't make it magically go away |
I downloaded the original high-res PDF of this publication and imported it into the OCR software I use, it's ABBY 9 and while the latest version is currently 14 I find it's pretty solid and always gives good results. The thing is, ABBY complains about the pages being too high a definition to handle - in some cases 840dpi which is a really high figure - and suggests dropping the resolution down to 300dpi.
If you do this and then run the OCR on the file I would say that 95%+ of the entire text content gets auto-recognized\read correctly and I have included a sample page of output so you can see for yourself.
It MAY be that your very high dpi is throwing off the OCR software on this site in which case it will be easy to test by reducing the dpi to 300 and resubmitting it. Failing that can I suggest you get hold of a decent OCR package like ABBY (mine came free with a scanner) and convert the file yourself and see if you get better results.
If all else fails I can always post the full document I have here to a file hosting site for you to grab but the comparative file size will be larger due to the different compression methods used. Also according to the FAQ you won't be able to replace the auto-generated one here with it. SB
Attachment: Buffalo_Page_16.pdf
Reply [edit]
Poster: | Bri-Elma-NY | Date: | Aug 24, 2018 7:48pm |
Forum: | faqs | Subject: | Thank you thank you thank you! (+!!!!!) |
I assume the problem might stem from the fact that these are HUGE newspaper pages (it's a special edition printed on high-quality paper), and they have very-fine print. The firm which performed the scanning used a 400 dpi setting (the maximum that their large-format scanner offers). The original (uncompressed) scans run between 10 and 30 Mb. I assumed I needed to compress the document in order to keep the PDF at a manageable size.
I don't have much editing software other than Picasa (which we all know will never be upgraded, nor have all its myriad of issues repaired), and Paint 2D (don't get me started about all the issues with the 3D version).
I don't like the way www.SmallPDF.com compresses PDF's (too fuzzy). So, I used Picasa to compress (well, they call it "Export") the JPG files. I used settings in Picasa's export app of page-size "original size" and "50%" for Image Quality (which seems to be less than their "Minimum" setting).
As mentioned previously, I used the SmallPDF website to create the PDF, which made the PDFs page dimensions a somewhat smaller size than the originals. There are few settings options for creating the PDF at that website.
Because the old newspaper sheets are a unique size (i.e., a non-standard paper size) I had to use SmallPDF.com. Otherwise, my PDF converting software (Wondershare PDF Converter Pro, or CutePDF as a backup) defaults to 8½ by 11 inches.
For all my other documents posted herein, I use MS Word to hold the JPGs for creating the PDF. Believe me, I've tried every possible setting to prevent that page-size change to 8½ by 11 inches. Only preset page sizes in Word allow converting to a PDF page size which is not 8½ by 11. That is, if I use a unique, non-pre-set page size in MS Word, I'll always end up with a PDF with 8½ by 11 inch pages, the apparent default. So, that's why I used an online JPG-to-PDF converter. Am I making sense? If I am that's a first.
Given that I've compressed the JPG files already, and they are at the limit of text fuzziness, I don't see where I can reduce the dpi count. By the way, all the JPG files I started with show 96 dpi, which is confusing (that happens a lot, it really does). So I don't understand why the Abby OCR program is seeing a much-higher dpi setting on the PDF pages. Given all the JPG file-compression, it just doesn't seem correct that the PDF pages are ~840 dpi.
To me, it seems like I'm using the wrong means of compressing this PDF, that is, it's not right to compress the JPG files and then create the PDF. Maybe I should just create the PDF with the existing JPG files, and then compress the PDF. But I have no options with respect to fine-tuning that process at SmallPDF.com. That is, I get what I get, and that's all I get.
So, the issue I see is that I have an existing document posted here which won't OCR with the Archive software. I have to create a new document here which works with the Archive OCR, and then ask for the old document to be deleted.
Is there a text section here where I can test PDF uploads to see how well they OCR? I've used test uploads before, but they never OCR-ed.
Thank you in advance for any help offered.
/Brian in Elma NY
PS-BTW, a lot of people have asked for this 1888 special edition as a searchable document, so I need to make this happen. Thx.
Reply [edit]
Poster: | StarbriteScanz | Date: | Aug 25, 2018 10:45am |
Forum: | faqs | Subject: | Re: Thank you thank you thank you! (+!!!!!) |
I ran a second OCR process on your original 692MB PDF but first I extracted all the pages as bitmaps and did some pre-processing on them. I changed the black and white levels to clear up some of the greyness and strengthen the lettering. I then applied a light Gaussian blur followed by a mild sharpen to clean up some of the edges, changed the dpi to 300 and saved them out as JPGs at 50% compression. This gave a total size for all the pages as 530MB
The JPGs were loaded into ABBY 9 and auto-recognized then I manually removed all the graphics and tidied up some of the text boxes and processed everything. I created two PDFs, one was saved out at 200dpi at 50% JPEG compression for the graphics at 335MB the other was text only, no graphics, which came to only 2.5MB. So you can see how accurate the text recognition was I've attached the text only PDF here. If you have or can get access to a program that can merge layers then you can add this PDF to an existing one of the same size as a text layer and bypass the whole process but I don't believe there's any free software that will currently do this.
With regard to the 335MB version, I think it's quite readable on my HD monitor up to a magnification factor of 800%. Reducing the JPG compression level results in more artifacting\blurring at that level while reducing the dpi (to say 72dp) lowers the resolution of the image regardless of the compression level. It's always a balancing act to produce the smallest sized PDF with the clearest readability and there is software available that can take a PDF and let you change both the dpi and compression levels, as well as the compression method, until you get the balance you feel comfortable with before saving it off.
Having checked the resolution of the bitmaps extracted above all my software is telling me that they are 72dpi, so where ABBY gets 840dpi from I don't know. I can only guess that it's decided the pages are a certain size (perhaps related to the problems with you had with forced page sizing) and done the maths to come up with some OTT figure. Then again at 72dpi if you printed a page out it would be 73x100 inches or roughly 6 foot by 8 foot so something is amiss somewhere in the original file(s).
While all this does not directly address the problems you are having converting your file I hope it shows that things are not as bleak as they seem. If you still have the original scans I would suggest batch-changing the dpi of all of them to 300dpi - you can use a free program like Irfanview or ImBatch to do this and creating the PDF yourself again with a free program such as Compulsivecode's 'Image To PDF or XPS' and resubmit it. Alternatively you can create a CBR file by .RAR compressing the bitmap files, changing the suffix to CBR and uploading that instead. OCR will still be performed on it. I don't know if either approaches will invoke the OCR if you upload them to the Test section but it's worth trying with a small subset of pages just to see if it does and how accurate the output is.
SB
Attachment: Amended_Buffalo_Text_Only.pdf
Reply [edit]
Poster: | Bri-Elma-NY | Date: | Aug 25, 2018 11:53am |
Forum: | faqs | Subject: | Thank you thank you thank you! (+!!!!!) |
Some of what you said above (well, OK, a LOT of it) is over my head. I'm just a guy who has some old documents who wants to share them with the world. Mainly I deal with medical issues day-to-day, so this document stuff is only a part-time part-time past-time. I very-occasional write some articles for the local steam-show newsletter, but that's another very-seldom-time past-time.
I tried the post-PDF compression after creating a PDF using the original 400-dpi scans, and sure enough you can't read the text (i.e., it's too compressed). It'd be nice if they offered different compression levels at SmallPDF.com.
I believe you may've hit the head on the nail with the concept of lightening up of the pages. I usually do that on my other documents, but these pages were scanned off-site, and the folks there tweaked the pages post-scan.
So, I lazily left the JPGs alone because I'd been asked repeatedly for this document in more-legible form, so I was shooting for expeditiousness (gee, whut a surprize, I initiallly spelted that werd wrongedly). For the record, the original scans were 400-dpi TIFFs, which I converted using Picasa. Now I can't find the original TIFFs, not that that'd help.
I'll keep trying. Again, thank you for the much-detailed reply.
/Bri in Elma NY USA
Reply [edit]
Poster: | StarbriteScanz | Date: | Aug 25, 2018 4:04pm |
Forum: | faqs | Subject: | Re: Thank you thank you thank you! (+!!!!!) |
SB
Reply [edit]
Poster: | Bri-Elma-NY | Date: | Aug 25, 2018 4:34pm |
Forum: | faqs | Subject: | Re: Thank you thank you thank you! (+!!!!!) |
/Bri...
Reply [edit]
Poster: | Bri-Elma-NY | Date: | Aug 25, 2018 4:37pm |
Forum: | faqs | Subject: | Re: Thank you thank you thank you! (+!!!!!) |
Bri...
Reply [edit]
Poster: | Bri-Elma-NY | Date: | Aug 25, 2018 4:52pm |
Forum: | faqs | Subject: | Re: Thank you thank you thank you! (+!!!!!) |
I tried uploading your PDF to replace the PDF WITH TEXT file.
/Bri...
Attachment: Uploading_the_new_1888_file_screen_shot.JPG
Reply [edit]
Poster: | StarbriteScanz | Date: | Aug 25, 2018 6:25pm |
Forum: | faqs | Subject: | Re: Thank you thank you thank you! (+!!!!!) |
The best you can probably achieve is to replace the main PDF file and hope that since it already contains a text layer a new derive won't produce a second OCR'ed file. I think that's what was said by a mod earlier in this Thread?
Reply [edit]
Poster: | Bri-Elma-NY | Date: | Aug 25, 2018 7:23pm |
Forum: | faqs | Subject: | Re: Thank you thank you thank you! (+!!!!!) |
https://ia801508.us.archive.org/14/items/1888BuffaloNYIndustrialFairPaper/1888%20Buffalo%20NY%20Industrial%20Fair%20Paper_text.pdf
A wise man once said: "Never tell a stupid person he can't do something because he'll somehow, someway, and quite ineptly, make it happen, often with little effort or issue."
But, will it stay that way? Inquiring nerds want to know.
Again, a thousand thanks.
/Bri in Elma NY USA
Reply [edit]
Poster: | Bri-Elma-NY | Date: | Aug 25, 2018 10:31pm |
Forum: | faqs | Subject: | Re: Thank you thank you thank you! (+!!!!!) |
/Bri in Elma NY USA
>webmaster of www.BuffaloPitts.com
Reply [edit]
Poster: | Adeelkhann | Date: | Oct 2, 2020 1:32am |
Forum: | faqs | Subject: | Re: Does this reply answer the question posed in the last reply? |
Reply [edit]
Poster: | Bri-Elma-NY | Date: | Oct 2, 2020 4:14am |
Forum: | faqs | Subject: | Thank you for the site link |
https://www.convert-jpg-to-pdf.net/