Converting a PDF file into an HTML or a XML file has been made easy by a small useful utility called PDFTOHTML. PdftoHTml is a Xpdf based tool which can convert PDF files to HTML or XML format. PDFTOHTML also supports encrypted files and support for images in the PDF file by converting to PNG images files.
Install PDFTOHTML
PDFTOHTML can be installed in openSUSE 11.0, openSUSE 10.3, openSUSE 10.2 using the 1-click feature.
To install PDFTOHTML,
http://software.opensuse.org/ymp/home:lrupp/openSUSE_11.0/pdftohtml.ymp
NOTE: Click here to enable the 1-click install feature in openSUSE 10.2
This should download the YaST metapackage file (.ymp) for the 1-click install of PDFTOHTML and automatically launch using YaST package Manager. Click Next on the window showing the Repository selection. Click Next on the window showing the package selection. This should add the required repositories and install pdftohtml and required dependencies. Click Finish when the installation completes.
This installs the binaries for pdftohtml under /usr/bin/
opensuse11:~ # which pdftohtml
/usr/bin/pdftohtml
To convert a PDF file the syntax is
pdftohtml <source.pdf> <dest.html>
where source.pdf is the PDF file to converted into HTML (dest.html)
For instance,
opensuse11:~ # pdftohtml -i demo1.pdf demo2.html
Page-1
Page-2
Page-3
To convert to an XML file
opensuse11:~ # pdftohtml -xml demo1.pdf demo2.xml
To ignore images in the PDF file
opensuse11:~ # pdftohtml -i demo1.pdf demo2.html
For more options,
opensuse11:~ # pdftohtml
or
opensuse11:~ # pdftohtml -help
pdftohtml version 0.36 http://pdftohtml.sourceforge.net/, based on Xpdf version 2.02
Copyright 1999-2003 Gueorgui Ovtcharov and Rainer Dorsch
Copyright 1996-2003 Glyph & Cog, LLCUsage: pdftohtml [options] <PDF-file> [<html-file> <xml-file>]
-f <int> : first page to convert
-l <int> : last page to convert
-q : don’t print any messages or errors
-h : print usage information
-help : print usage information
-p : exchange .pdf links by .html
-c : generate complex document
-i : ignore images
-noframes : generate no frames
-stdout : use standard output
-zoom <fp> : zoom the pdf document (default 1.5)
-xml : output for XML post-processing
-hidden : output hidden text
-nomerge : do not merge paragraphs
-enc <string> : output text encoding name
-dev <string> : output device name for Ghostscript (png16m, jpeg etc)
-v : print copyright and version info
-opw <string> : owner password (for encrypted files)
-upw <string> : user password (for encrypted files)
just want to mention, that it is generally extremely unwise to use non-markup formats like PDF for creating markup language output like XML or HTML (or LaTeX, …).
PDF does not contain the necessary meta information, so you always get a “flat” output of what you originally had in the master document. What algorithm’s pdftohtml would ever use, it just will be able “guess” about structural informations like e.g. headers. You can compare this to run an OCR program over an scanned text document.
However it does this job very good. But for my personal taste it is often more useful to use the clipboard to get text out of a PDF file 😉
More useful is the extraction of pictures, which is indeed one of the most advanced features.
Another important feature is, that (html-) links will be preserved too.
pdftohtml also tries to produce an outline of the document. It is able to scan for headers and even does a good job when it comes to render scientific documents containing a numbered header hierarchy.
—————-
I think such informations are elementary introducing a tool like pdftohtml and it is a little bit weak – at least not very careful concerning the preparation of the article – just pointing out a few installer screen shots leaving the user alone. A short link to another howto explaining “one-click-install” and instead a concrete sample real world sample with screenshots would have been more useful to the reader.
The author should not title it’s article “How to convert PDF files to HTML or XML files …” when he/she just explains common installation procedures and copy/pastes the console output.
I hope, “just publishing” is not the main motto of the author “admin”.
Sorry about this comment, but I really hope the writer could take something of this into consideration for further journalism tasks 😉 More and more often I see such like articles cluttering the forums and howto sites and I – personally – think this is not the way teaching people linux or any knowledge.
Nevertheless Greets
P.S.: Installing pdftohtml in ubuntu: it is part of the “poppler-utils” package.
I do agree that its not the best way to convert PDFs but having a huge file and using a clipboard doesn’t help either… does it?
I look at all levels of audiences and not just experts but also beginners, end users.
Anyway, thanks for your comments and will remember the valid points you have raised.
Thanks
Thanks.
If you wonder why I need it tagged: is to to be able to have a nice rendering on my (old) windows mobile pda.
I converted a 70 pages OpenOffice document to PDF then I used pdftohtml. I had some error:
Error: Embedded TrueType font is missing a required table (‘fpgm’)
Error: Embedded TrueType font is missing a required table (‘prep’)
The index is off because of the trailling ……. to the page number. Maybe it is because of those TrueType font.
It is also a little bit off for the justified lines. Maybe also due again to the TrueType font error.
One very good improvement will be to put some NEXT and PREVIOUS buttons at the bottom of the page and get rid of the navigation left “panel” and replace it with an “index” button.
Also make use a switch to make a one page only HTML document.
As I wrote above, I am impressed mainly because it is a 0.36 version.
looking forward to the 1.0 version. It will be a small revolution for the HTML world.
Michel-Andre