scan2seed
In the context of the PROMISE project (https://www.kbr.be/fr/projets/projet-promise/), we needed a way to extract URLs from PDF files. This project implement a method to do it.
How to use this project ?
The TL;DR version
- Copy ocerized PDF files in a ./data directory inside this repository;
- Run the
make extractcommand; - Open the file created in the ./output directory.
The complete version
This project is aimed to be as easy as possible to use, but in order to achieve that, it is built on some assumptions :
- The tool is developed with Linux as the main target, so it should be able to work on other operating systems, but you might need to tweak some things (especially on Windows);
- The requirements for this tool are:
pdftotext;perl;- and
cpanmto install the dependencies of the Perl script
- and
make;find;
- In order to be able to use the Perl script, you'll need to install
its dependencies. I've created a special file for that so that you
can run the following command (in the project's directory):
cpanm --installdeps .. If you don't usecpanm, you'll need to install required Perl modules (IO::AllandRegexp::Common) the way you do it on your system.
If those requirements are satisfied, you'll be able to follow this workflow (all commands are typed in a shell where the current path is in this project):
- run
make initwill create the required directories (./data, ./data/text and ./output); - copy your ocerized PDF files in the ./data directory; you can use subdirectories if it is easier for you. Don't place any PDF files in the ./data/text directory as it will be ignored during the PDF conversion to text files;
- run
make pdftotextto create the text version of each PDF files found in the ./data directory; - run
make extractto perform the extraction of URLs from the text files created previously; - you'll find a file named
links-xxx.txtwhere xxx is a timestamp. That file contains the extracted links.
Description
In the context of the PROMISE project (https://www.kbr.be/fr/projets/projet-promise/), we needed a way to extract URLs from PDF files. This project implement a method to do it.
Languages
Makefile
58.5%
Perl
41.5%