NAME Catmandu::Importer::PDF - Catmandu importer to extract data from one pdf SYNOPSIS #From the command line #Export pdf information, and text $ catmandu convert PDF --file input.pdf to YAML #In a script use Catmandu::Sane; use Catmandu::Importer::PDF; my $importer = Catmandu::Importer::PDF->new( file => "/tmp/input.pdf" ); $importer->each(sub{ my $pdf = $_[0]; #.. }); EXAMPLE OUTPUT IN YAML document: author: ~ creation_date: 1207274644 creator: PDFplus keywords: ~ metadata: ~ modification_date: 1421574847 producer: "Nobody at all" subject: ~ title: "Hello there" version: PDF-1.6 pages: - label: Cover Page height: 878 width: 595 text: "Hello world" INSTALLATION In order to install this package you need the following system packages installed Centos Requires Centos 7 at minimum. Centos 6 only has poppler-glib 0.12. * perl-devel * make * gcc * gcc-c++ * libyaml-devel * libyaml * poppler-glib ( >= 0.16 ) * poppler-glib-devel ( >= 0.16 ) * gobject-introspection-devel Ubuntu Requires Ubuntu 14 at minimum. * libpoppler-glib8 * libpoppler-glib-dev * gobject-introspection * libgirepository1.0-dev NOTES * Catmandu::Importer::PDF returns one record, containing both document information, and page text * Catmandu::Importer::PDFPages returns multiple records, each for each page * Catmandu::Importer::PDFInfo returns one record, containing document information KNOWN ISSUES * Due to a bug in older versions of poppler-glib (bug #94173), the creation_date and modification_date can be returned in local time, instead of utc. This module tries to fix that. * Some versions of Poppler add form feeds and newlines to a text line, while others don't. AUTHORS Nicolas Franck SEE ALSO Catmandu::Importer::PDFInfo, Catmandu::Importer::PDFPages, Catmandu, Catmandu::Importer , Poppler