scan2seed/README.md

# scan2seed

In the context of the PROMISE project
(https://www.kbr.be/fr/projets/projet-promise/), we needed a way to
extract URLs from PDF files. This project implement a method to do it.

# How to use this project ?

## The TL;DR version

Run `make extract` after copying files PDF files in the *./data*
directory.

## The complete version

This project is aimed to be as easy as possible to use, but in order
to achieve that, it is built on some assumptions :

1. The tool is developed with Linux as the main target, so it should
   be able to work on other operating systems, but you might need to
   tweak some things (especially on Windows);
2. The requirements for this tool are:
   - `pdftotext`;
   - `perl`;
     - and `cpanm` to install the dependencies of the Perl script
   - `make`;
   - `find`;
3. In order to be able to use the Perl script, you'll need to install
   its dependencies. I've created a special file for that so that you
   can run the following command (in the project's directory): `cpanm
   --installdeps .`. If you don't use `cpanm`, you'll need to install
   required Perl modules (`IO::All` and `Regexp::Common`) the way you
   do it on your system.

If those requirements are satisfied, you'll be able to follow this
workflow (all commands are typed in a shell where the current path is
in this project):
1. run `make init` will create the required directories (*./data*,
   *./data/text* and *./output*);
2. copy your ocerized PDF files in the *./data* directory; you can use
   subdirectories if it is easier for you. Don't place any PDF files
   in the *./data/text* directory as it will be ignored during the
   PDF conversion to text files;
3. run `make pdftotext` to create the text version of each PDF files
   found in the *./data* directory;
4. run `make extract` to perform the extraction of URLs from the text
   files created previously;
5.  you'll find a file named `links-xxx.txt` where *xxx* is a
   timestamp. That file contains the extracted links.