Aug 182008
 


Converting a PDF file into an HTML or a XML file has been made easy by a small useful utility called PDFTOHTML. PdftoHTml is a Xpdf based tool which can convert PDF files to HTML or XML format. PDFTOHTML also supports encrypted files and support for images in the PDF file by converting to PNG images files.


Install PDFTOHTML

PDFTOHTML can be installed in openSUSE 11.0, openSUSE 10.3, openSUSE 10.2 using the 1-click feature.

To install PDFTOHTML,

click for openSUSE 11.0

http://software.opensuse.org/ymp/home:lrupp/openSUSE_11.0/pdftohtml.ymp

click for openSUSE 10.3

click for openSUSE 10.2

NOTE: Click here to enable the 1-click install feature in openSUSE 10.2

This should download the YaST metapackage file (.ymp) for the 1-click install of PDFTOHTML and automatically launch using YaST package Manager. Click Next on the window showing the Repository selection. Click Next on the window showing the package selection. This should add the required repositories and install pdftohtml and required dependencies. Click Finish when the installation completes.

This installs the binaries for pdftohtml under /usr/bin/

opensuse11:~ # which pdftohtml
/usr/bin/pdftohtml

To convert a PDF file the syntax is

pdftohtml <source.pdf> <dest.html>

where source.pdf is the PDF file to converted into HTML (dest.html)

For instance,

opensuse11:~ # pdftohtml -i demo1.pdf demo2.html
Page-1
Page-2
Page-3

To convert to an XML file

opensuse11:~ # pdftohtml -xml demo1.pdf demo2.xml

To ignore images in the PDF file

opensuse11:~ # pdftohtml -i  demo1.pdf demo2.html

For more options,

opensuse11:~ # pdftohtml

or

opensuse11:~ # pdftohtml -help
pdftohtml version 0.36 http://pdftohtml.sourceforge.net/, based on Xpdf version 2.02
Copyright 1999-2003 Gueorgui Ovtcharov and Rainer Dorsch
Copyright 1996-2003 Glyph & Cog, LLC

Usage: pdftohtml [options] <PDF-file> [<html-file> <xml-file>]
-f <int>          : first page to convert
-l <int>          : last page to convert
-q                : don’t print any messages or errors
-h                : print usage information
-help             : print usage information
-p                : exchange .pdf links by .html
-c                : generate complex document
-i                : ignore images
-noframes         : generate no frames
-stdout           : use standard output
-zoom <fp>        : zoom the pdf document (default 1.5)
-xml              : output for XML post-processing
-hidden           : output hidden text
-nomerge          : do not merge paragraphs
-enc <string>     : output text encoding name
-dev <string>     : output device name for Ghostscript (png16m, jpeg etc)
-v                : print copyright and version info
-opw <string>     : owner password (for encrypted files)
-upw <string>     : user password (for encrypted files)