How to convert PDF files to HTML or XML files in openSUSE

On: August 18, 2008

Tagged: encrypted files, linux, opensource, opensuse, package selection, pdf file, pdftohtml, png images, repositories

Converting a PDF file into an HTML or a XML file has been made easy by a small useful utility called PDFTOHTML. PdftoHTml is a Xpdf based tool which can convert PDF files to HTML or XML format. PDFTOHTML also supports encrypted files and support for images in the PDF file by converting to PNG images files.

Install PDFTOHTML

PDFTOHTML can be installed in openSUSE 11.0, openSUSE 10.3, openSUSE 10.2 using the 1-click feature.

To install PDFTOHTML,

click for openSUSE 11.0

http://software.opensuse.org/ymp/home:lrupp/openSUSE_11.0/pdftohtml.ymp

click for openSUSE 10.3

click for openSUSE 10.2

NOTE: Click here to enable the 1-click install feature in openSUSE 10.2

This should download the YaST metapackage file (.ymp) for the 1-click install of PDFTOHTML and automatically launch using YaST package Manager. Click Next on the window showing the Repository selection. Click Next on the window showing the package selection. This should add the required repositories and install pdftohtml and required dependencies. Click Finish when the installation completes.

This installs the binaries for pdftohtml under /usr/bin/

opensuse11:~ # which pdftohtml
/usr/bin/pdftohtml

To convert a PDF file the syntax is

pdftohtml <source.pdf> <dest.html>

where source.pdf is the PDF file to converted into HTML (dest.html)

For instance,

opensuse11:~ # pdftohtml -i demo1.pdf demo2.html
Page-1
Page-2
Page-3

To convert to an XML file

opensuse11:~ # pdftohtml -xml demo1.pdf demo2.xml

To ignore images in the PDF file

opensuse11:~ # pdftohtml -i demo1.pdf demo2.html

For more options,

opensuse11:~ # pdftohtml

or

opensuse11:~ # pdftohtml -help
pdftohtml version 0.36 http://pdftohtml.sourceforge.net/, based on Xpdf version 2.02
Copyright 1999-2003 Gueorgui Ovtcharov and Rainer Dorsch
Copyright 1996-2003 Glyph & Cog, LLC

Usage: pdftohtml [options] <PDF-file> [<html-file> <xml-file>]
-f <int>          : first page to convert
-l <int>          : last page to convert
-q                : don’t print any messages or errors
-h                : print usage information
-help             : print usage information
-p                : exchange .pdf links by .html
-c                : generate complex document
-i                : ignore images
-noframes         : generate no frames
-stdout           : use standard output
-zoom <fp>        : zoom the pdf document (default 1.5)
-xml              : output for XML post-processing
-hidden           : output hidden text
-nomerge          : do not merge paragraphs
-enc <string>     : output text encoding name
-dev <string>     : output device name for Ghostscript (png16m, jpeg etc)
-v                : print copyright and version info
-opw <string>     : owner password (for encrypted files)
-upw <string>     : user password (for encrypted files)

Previous Post: PyTube – Download,Edit,Convert videos from video sharing websites

Next Post: SMILE – Slideshow video creator for Linux in openSUSE 11.0

7 Comments

Hi there,

just want to mention, that it is generally extremely unwise to use non-markup formats like PDF for creating markup language output like XML or HTML (or LaTeX, …).

PDF does not contain the necessary meta information, so you always get a “flat” output of what you originally had in the master document. What algorithm’s pdftohtml would ever use, it just will be able “guess” about structural informations like e.g. headers. You can compare this to run an OCR program over an scanned text document.

However it does this job very good. But for my personal taste it is often more useful to use the clipboard to get text out of a PDF file 😉

More useful is the extraction of pictures, which is indeed one of the most advanced features.

Another important feature is, that (html-) links will be preserved too.

pdftohtml also tries to produce an outline of the document. It is able to scan for headers and even does a good job when it comes to render scientific documents containing a numbered header hierarchy.

—————-

I think such informations are elementary introducing a tool like pdftohtml and it is a little bit weak – at least not very careful concerning the preparation of the article – just pointing out a few installer screen shots leaving the user alone. A short link to another howto explaining “one-click-install” and instead a concrete sample real world sample with screenshots would have been more useful to the reader.

The author should not title it’s article “How to convert PDF files to HTML or XML files …” when he/she just explains common installation procedures and copy/pastes the console output.

I hope, “just publishing” is not the main motto of the author “admin”.

Sorry about this comment, but I really hope the writer could take something of this into consideration for further journalism tasks 😉 More and more often I see such like articles cluttering the forums and howto sites and I – personally – think this is not the way teaching people linux or any knowledge.

Nevertheless Greets

P.S.: Installing pdftohtml in ubuntu: it is part of the “poppler-utils” package.

Axel

16 years ago

Permalink
How to convert PDF files to HTML or XML files in openSUSE : HowtoMatrix

16 years ago

Permalink
Author

@Axel: Thanks very much for your comment. I did think about showing screens of the converted file but then the outputs are fairly file and files only and there aren’t features that can represented pictorially.

I do agree that its not the best way to convert PDFs but having a huge file and using a clipboard doesn’t help either… does it?

I look at all levels of audiences and not just experts but also beginners, end users.

Anyway, thanks for your comments and will remember the valid points you have raised.

Thanks

admin

16 years ago

Permalink
Now all we need is war2pdf, war2html, war2txt, war2ps, etc., for KDE/Konqueror’s web archive format (which replaces saving html files, which in IE6 and maybe later, and in Firefox iirc saves the html file as one file and then creates a sub-directory for the relevant image files needed to render the html page, with the Konq web archive format compressing the html text and saving the image files all in a compressed archive file.

biff

16 years ago

Permalink
I believe PDF is a vector based binary format. Is there a tool to convert it directly to SVG (preferably with embedded imaages if possible)? It would be even better if the HTML markup would contain drawings as embedded SVG.
Thanks.

RK

16 years ago

Permalink
I have recently been looking for a solution to convert non-tagged pdf’s into tagged ones. The only reasonable thing I found was to first convert them to some format openoffice can read and then use openoffice to export the tagged pdf file. But that also is not so obvious. I’ll try with this tool.
If you wonder why I need it tagged: is to to be able to have a nice rendering on my (old) windows mobile pda.

Olivier

16 years ago

Permalink
I am impressed. It is almost there.

I converted a 70 pages OpenOffice document to PDF then I used pdftohtml. I had some error:
Error: Embedded TrueType font is missing a required table (‘fpgm’)
Error: Embedded TrueType font is missing a required table (‘prep’)

The index is off because of the trailling ……. to the page number. Maybe it is because of those TrueType font.

It is also a little bit off for the justified lines. Maybe also due again to the TrueType font error.

One very good improvement will be to put some NEXT and PREVIOUS buttons at the bottom of the page and get rid of the navigation left “panel” and replace it with an “index” button.

Also make use a switch to make a one page only HTML document.

As I wrote above, I am impressed mainly because it is a 0.36 version.

looking forward to the 1.0 version. It will be a small revolution for the HTML world.

Michel-Andre

Michel-André

16 years ago

Permalink

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

SUSE & openSUSE - Tips,Tricks, Tutorials,How Tos and Troubleshooting suse linux

How to convert PDF files to HTML or XML files in openSUSE

7 Comments

Leave a Reply