scan2seed/README.md
2022-10-14 11:15:14 +02:00

2.0 KiB

scan2seed

In the context of the PROMISE project (https://www.kbr.be/fr/projets/projet-promise/), we needed a way to extract URLs from PDF files. This project implement a method to do it.

How to use this project ?

The TL;DR version

Run make extract after copying files PDF files in the ./data directory.

The complete version

This project is aimed to be as easy as possible to use, but in order to achieve that, it is built on some assumptions :

  1. The tool is developed with Linux as the main target, so it should be able to work on other operating systems, but you might need to tweak some things (especially on Windows);
  2. The requirements for this tool are:
    • pdftotext;
    • perl;
      • and cpanm to install the dependencies of the Perl script
    • make;
    • find;
  3. In order to be able to use the Perl script, you'll need to install its dependencies. I've created a special file for that so that you can run the following command (in the project's directory): cpanm --installdeps .. If you don't use cpanm, you'll need to install required Perl modules (IO::All and Regexp::Common) the way you do it on your system.

If those requirements are satisfied, you'll be able to follow this workflow (all commands are typed in a shell where the current path is in this project):

  1. run make init will create the required directories (./data, ./data/text and ./output);
  2. copy your ocerized PDF files in the ./data directory; you can use subdirectories if it is easier for you. Don't place any PDF files in the ./data/text directory as it will be ignored during the PDF conversion to text files;
  3. run make pdftotext to create the text version of each PDF files found in the ./data directory;
  4. run make extract to perform the extraction of URLs from the text files created previously;
  5. you'll find a file named links-xxx.txt where xxx is a timestamp. That file contains the extracted links.