May differ for Python 2 or for an older OS. These instructions assume you're using Python 3 on a recent OS. PDF ( f, "secret" ) # How many pages? print ( len ( pdf )) # Iterate over all the pages for page in pdf : print ( page ) # Read some individual pages print ( pdf ) print ( pdf ) # Read all the text into one string print ( " \n\n ". PDF ( f ) # If it's password-protected with open ( "secure.pdf", "rb" ) as f : pdf = pdftotext. If you start entering the library world of PDF manipulation, you should start with reading the spec, especially chapter 8 (Graphics) and chapter 9 (Text), and you'll get a better understanding of what you're going to have to do with the library.Simple PDF text extraction import pdftotext # Load your PDF with open ( "lorem_ipsum.pdf", "rb" ) as f : pdf = pdftotext. A txt file with the PDF content should have been created at the same location. In general, our customers want us to understand the spec instead of them and make the rest easy - but tasks like this (redaction is another one), are really hard to do without understanding the depth of the PDF specification. XPDF file path Path to the XPDF bin64 folder. The hardest part is that we do our very best to hide the complexity of PDF from our customers. My library is being used by Atalasoft, the company I work for, to generate PDFs from whole cloth and to do editing within a very limited domain (annotations, document metadata). If it were me, I would use tools that I've developed and I'd still be a little shy of this task. PdfLib, which is a commercial product, appears to be to generate PDF, although it's not clear if it can consume it, but you could certainly get both sides by gluing them together. I'm not going to recommend a library for you - sorry - I gave xpdf a brief look over and it's not clear whether or not it has PDF generation capabilities or if it is simply a consumer of PDF. This is not editing text - it's just trying to find a single word or phrase. This is why, when I wrote the find text tool for Acrobat 1.0, it took me two months of sweat to handle as many of the edge cases. And what if your text is laid our on a curve or an unusual orientation (maps, ads)? What about the cases where someone subtly changes the font size for a greater distinction between upper and lower case or simulates small caps? If you're not lucky (which is most of the time), they're instead lay out the text with a set of moves before every single glyph on the page. Some programs want to lay text down very precisely, so if you're lucky, they'll use the TJ operator which lays out text with specific kerning. There are PDF generation programs (I'm looking at you, troff) that lay all the plain text on a page first, then lay all the italic text, then all the bold text. Let me briefly describe why this is as bad as it sounds. Alter the content stream of the page to include your changed content.Īnd 3 is where you're going to get hung up, because there are an infinite number of ways to generate a page that has the content you describe and even with a decent library, you're going to have a hard time getting maybe 70% of them.You have to generate a new page, inserting new resources (you're adding a new font), embedding the font if allowable. You have to extract out the page and all its resources (non-trivial).Which would become: BT /F1 12 Tf 72 720 Td (this is a ) Tj /F2 12 Tf (text) Tj /F1 12 Tf So in this case, you have to transform this into something like this: BeginText() ShowText("this is a text in a pdf document") Which when translated into something more familiar, is this: BeginText() In a sane world, your text on the page is going to be represented by something like this: BT /F1 12 Tf 72 720 Td (this is a text in a pdf document) Tj ET It's a small language similar to PostScript in semantics, but without looping structures or function definitions (so there is no halting problem). Ive been using the pdftotxt tool to convert many PDFs in English and Chinese to TXT format. Page content in PDF is represented by short RPN programs that paint on the page. echan00 Posts: 12 Joined: Sun 2:53 am Whats wrong with this pdf to text conversion by echan00 » Sun 9:17 pm Xpdf has been the best tool Ive found among all the pdf libraries. Just so you understand the scope of what you're getting into, "basic editing" of PDF content is nearly always non-trivial.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |