# scan2seed In the context of the PROMISE project (https://www.kbr.be/fr/projets/projet-promise/), we needed a way to extract URLs from PDF files. This project implement a method to do it. # How to use this project ? ## The TL;DR version 1. Copy ocerized PDF files in a *./data* directory inside this repository; 2. Run the `make extract` command; 3. Open the file created in the *./output* directory. ## The complete version This project is aimed to be as easy as possible to use, but in order to achieve that, it is built on some assumptions : 1. The tool is developed with Linux as the main target, so it should be able to work on other operating systems, but you might need to tweak some things (especially on Windows); 2. The requirements for this tool are: - `pdftotext`; - `perl`; - and `cpanm` to install the dependencies of the Perl script - `make`; - `find`; 3. In order to be able to use the Perl script, you'll need to install its dependencies. I've created a special file for that so that you can run the following command (in the project's directory): `cpanm --installdeps .`. If you don't use `cpanm`, you'll need to install required Perl modules (`IO::All` and `Regexp::Common`) the way you do it on your system. If those requirements are satisfied, you'll be able to follow this workflow (all commands are typed in a shell where the current path is in this project): 1. run `make init` will create the required directories (*./data*, *./data/text* and *./output*); 2. copy your ocerized PDF files in the *./data* directory; you can use subdirectories if it is easier for you. Don't place any PDF files in the *./data/text* directory as it will be ignored during the PDF conversion to text files; 3. run `make pdftotext` to create the text version of each PDF files found in the *./data* directory; 4. run `make extract` to perform the extraction of URLs from the text files created previously; 5. you'll find a file named `links-xxx.txt` where *xxx* is a timestamp. That file contains the extracted links.