Convert pdf to text with xpdf

2011-01-12  来源:本站原创  分类:Java  人气:125 

Using the xpdf PDF documents to deal with the Chinese

package ch7.xpdf;

import java.io.*;

public class Pdf2Text {
        // PDF  File Name
        private File pdffile;
        //   Converter storage location, the default in  c:\xpdf  The following
        private String CONVERTOR_STORED_PATH = "c:\\xpdf";
        //   The name of the converter, the default is  pdftotext
        private String CONVERTOR_NAME = "pdftotext";
        //   Constructor parameters  pdf  File path
        public Pdf2Text(String pdffile) throws IOException {
                this(new File(pdffile));
        }
        //   Constructor parameters  pdf  File on the image
        public Pdf2Text(File pdffile) throws IOException {
                this.pdffile = pdffile;
        }
        //   Pdf into a text document will be
        public void toTextFile() throws IOException {
                toTextFile(pdffile, true);
        }
        //   Pdf into a text document will be  , Parameters for the target file path, default  PDF  The layout of the document
        public void toTextFile(String targetfile) throws IOException {
                toTextFile(new File(targetfile), true);
        }
        //   Pdf into a text document will be  ,  Parameters of a target path to the file  ,
        //   Parameter 2  true  Is that the use of the layout of PDF files
        public void toTextFile(String targetfile, boolean isLayout)
                        throws IOException {
                toTextFile(new File(targetfile), isLayout);
        }

        //   Pdf into a text document will be  ,  Parameters for the target file
        public void toTextFile(File targetfile) throws IOException {
                toTextFile(targetfile, true);
        }
        //   Pdf into a text document will be  ,  Parameter 1 is the target file  ,
        //   Parameter 2  true  Is that the use of the layout of PDF files
        public void toTextFile(File targetfile, boolean isLayout)
                        throws IOException {
                String[] cmd = getCmd(targetfile, isLayout);
                Process p = Runtime.getRuntime().exec(cmd);
        }
        //   Get the path to PDF Converter
        public String getCONVERTOR_STORED_PATH() {
                return CONVERTOR_STORED_PATH;
        }
        //   Set the path to PDF Converter
        public void setCONVERTOR_STORED_PATH(String path) {
                if (!path.trim().endsWith("\\"))
                        path = path.trim() + "\\";
                this.CONVERTOR_STORED_PATH = path;
        }
        //   Parse command line arguments
        private String[] getCmd(File targetfile, boolean isLayout) {
                //   Command character
                String command = CONVERTOR_STORED_PATH + CONVERTOR_NAME;
                // PDF  The absolute path of the file
                String source_absolutePath = pdffile.getAbsolutePath();
                //   The absolute path of the output text file
                String target_absolutePath = targetfile.getAbsolutePath();
                //   To maintain the original  layout
                String layout = "-layout";
                //   Set encoding
                String encoding = "-enc";
                String character = "GBK";
                //   Setting does not print any messages and errors
                String mistake = "-q";
                //   Not to join the page between pages
                String nopagebrk = "-nopgbrk";
                //   If isLayout is  false,  The setting does not keep the original  layout
                if (!isLayout)
                        layout = "";
                return new String[] { command, layout, encoding, character, mistake,
                                nopagebrk, source_absolutePath, target_absolutePath };
        }
}

package ch7.xpdf;

import java.io.*;

public class Pdf2TextTest {
        public static void main(String[] args) {
                try {
                        //   Parameter input PDF file is placed
                        Pdf2Text p2t = new Pdf2Text("c:\\test.pdf");
                        //   Set the location of the converter
                        p2t.setCONVERTOR_STORED_PATH("c:\\xpdftest\\xpdf");
                        //   Set the text file storage location
                        p2t.toTextFile("c:\\test.txt");
                } catch (Exception e) {
                        e.printStackTrace();
                }
        }
}
相关文章
  • Convert pdf to text with xpdf 2011-01-12

    Using the xpdf PDF documents to deal with the Chinese package ch7.xpdf; import java.io.*; public class Pdf2Text { // PDF File Name private File pdffile; // Converter storage location, the default in c:\xpdf The following private String CONVERTOR_STOR

  • perl-word convert pdf 2010-03-08

    # Word convert pdf # Program: Xing # Time: 2010.3.19 # Blog: http://deepfuture.javaeye.com # QQ: 782322192 use warnings; use strict; use Win32:: OLE; use Win32:: OLE:: Const 'Microsoft Word'; my $ word = CreateObject Win32:: OLE 'Word.Application' or

  • Convert between windows text file to Unix type 2013-06-13

    dos2unix and unix2dos The utilities dos2unix and unix2dos are available for converting files from the Unix command line. To convert a Windows file to a Unix file, enter: dos2unix winfile.txt unixfile.txt To convert a Unix file to Windows, enter: unix

  • Use ImageMagick to convert pdf to png image vector graphics 2010-10-16

    As I write, "to achieve based on MoinMoin + ConTeXt Wiki site contains mathematical content," a paper addressed to the generated pdf TeX vector graphics into the background transparent png image, which is the use of ImageMagick's convert command

  • ASP.Net Word convert PDF format to achieve 2010-06-26

    1: an essential tool Install the necessary tools to MS VS.Net2003, MS Office2003, Adobe Acrobat 7.0 Professional, postscript.exe, gs811w32.exe MS VS.Net2003 the installation does not explain MS Office2003 installation does not explain Adobe Acrobat 7

  • Convert PDF to Word ASP.Net format to achieve 2010-06-26

    1: an essential tool Install the necessary tools to MS VS.Net2003, MS Office2003, Adobe Acrobat 7.0 Professional, postscript.exe, gs811w32.exe MS VS.Net2003 the installation does not explain MS Office2003 installation does not explain Adobe Acrobat 7

  • To deal with xpdf, and pdfbox Chinese PDF document and its comparison 2009-01-09

    In my previous project using pdfbox, in reading Chinese documents can be read out most of the text, but in numbers, paging and other places, or the inevitable garbled. So I searched the internet to see if there is no solution, see saying: "PDFBox loo

  • Using OpenOffice to convert word PDF [reprint] 2010-11-05

    Using OpenOffice to convert PDF to word Articles Category: Java Programming Before looking for a way through the call to adobe PDF virtual printer jacob generate PDF, but many people also call from time to time synchronization problems; to remind col

  • In the Web application to dynamically create a PDF file 2010-03-29

    In the Web application to dynamically create a PDF file In the Web application to dynamically create a PDF file text: Sean C. Sullivan translation: gagaghost In a recent logistics projects, the customer asked us to let the user build a query from a l

  • Text Mining (04812052) 2010-03-30

    Speaker: Jian-Wu Yang [email protected] TA: Wang Chenfeng [email protected] Teaching Time: Thursday 18:30 - 21: 00 Venue: History House 219 Exam Time :2009-06-11 (Thursday) 18:30 - 20: 30 Test Location: a teaching 104 Course Main

  • lucene index of non-txt document (pdf word rtf html xml) 2010-10-12

    Search should be the first to the index, then the simplest way is to index txt files, as already described. Here are what some of the index documents in other formats, such as ms word, pdf, rtf and so on. Indexing methods: the first is to first conve

  • java word and pdf and other documents presented information 2010-12-15

    Since the project was doing when the site content search, the information is stored to the office of the word, and pdf files. Fortunately, the treatment based on lucence set aside a good extension interface. Add the following tool support, can achiev

  • pdf转换成jpg示例分享 2014-02-12

    由于项目需要在.net下将pdf转换为普通图像格式,找到了一个好方法,现在分享给大家 using System; using System.Collections.Generic; using System.Text; using System.Runtime.InteropServices; using System.Collections; /** Convert PDF to Image Format(JPEG) using Ghostscript API convert a pdf to

  • [Reprinted] RSS full output of the three mandatory effective tool that allows you read in Google Reader blog text 2010-03-05

    I used Google Reader to access the network through the latest information, to receive the latest article from each blog, in addition to a simple RSS / Atom Feed Subscribe to the management of foreign, Google Reader is also available, such as Send To,

  • word2007 transfer pdf 2010-03-16

    Previously used word2003 when trying to convert pdf word document when you want to download a special software, today found word2007 very convenient, as long as the http://www.microsoft.com/downloads/details.aspx?FamilyId = 4D951911-3E7E-4AE6-B059-A2

  • View PDF with the Flex SWF is called 2010-04-12

    Sometimes some of the information only for others to see, such as printable documents such as DOC, PDF, but not for other people without permission to save or print, especially web version of the document management system class, always take this int

  • Call SWF with Flex View PDF 2010-04-12

    Sometimes some of the information only for others to see, such as printable documents such as DOC, PDF, but not for other people without permission to save or print, especially web version of the document management system class, always take this int

  • PDFBox to read PDF document metadata 2010-07-23

    PDFBox is to provide the next ASF lib PDF document open source projects operate. The latest version of the current PDFBox 1.2.1, the main provider about feature * PDF to text extraction * Merge PDF Documents * PDF Document Encryption / Decryption * L

  • Flex: using FlexPaper display PDF documents (transfer) 2010-09-23

    FlexPaper is an open source lightweight document display components, are designed to be used together with PDF2SWF to display PDF in Flex possible. It can be used as Flex's library. 1. Use PDF2SWF prepare your documents First, convert PDF to SWF, thi

  • Document viewer (Evince) to open Chinese PDF display box problem 2011-02-03

    From Ubuntu Wiki: http://wiki.ubuntu.org.cn/PDF% E6% 96% 87% E6% A1% A3% E7% 9A% 84% E4% B9% E7% B1% E9% A0% 81% 97% AE% E9% A2% 98 Use Evince to open some pdf files, especially in China know the pdf download papers, often shown as boxes. evince, oku