50 lines
2.0 KiB
Markdown
50 lines
2.0 KiB
Markdown
# scan2seed
|
|
|
|
In the context of the PROMISE project
|
|
(https://www.kbr.be/fr/projets/projet-promise/), we needed a way to
|
|
extract URLs from PDF files. This project implement a method to do it.
|
|
|
|
# How to use this project ?
|
|
|
|
## The TL;DR version
|
|
|
|
Run `make extract` after copying files PDF files in the *./data*
|
|
directory.
|
|
|
|
## The complete version
|
|
|
|
This project is aimed to be as easy as possible to use, but in order
|
|
to achieve that, it is built on some assumptions :
|
|
|
|
1. The tool is developed with Linux as the main target, so it should
|
|
be able to work on other operating systems, but you might need to
|
|
tweak some things (especially on Windows);
|
|
2. The requirements for this tool are:
|
|
- `pdftotext`;
|
|
- `perl`;
|
|
- and `cpanm` to install the dependencies of the Perl script
|
|
- `make`;
|
|
- `find`;
|
|
3. In order to be able to use the Perl script, you'll need to install
|
|
its dependencies. I've created a special file for that so that you
|
|
can run the following command (in the project's directory): `cpanm
|
|
--installdeps .`. If you don't use `cpanm`, you'll need to install
|
|
required Perl modules (`IO::All` and `Regexp::Common`) the way you
|
|
do it on your system.
|
|
|
|
If those requirements are satisfied, you'll be able to follow this
|
|
workflow (all commands are typed in a shell where the current path is
|
|
in this project):
|
|
1. run `make init` will create the required directories (*./data*,
|
|
*./data/text* and *./output*);
|
|
2. copy your ocerized PDF files in the *./data* directory; you can use
|
|
subdirectories if it is easier for you. Don't place any PDF files
|
|
in the *./data/text* directory as it will be ignored during the
|
|
PDF conversion to text files;
|
|
3. run `make pdftotext` to create the text version of each PDF files
|
|
found in the *./data* directory;
|
|
4. run `make extract` to perform the extraction of URLs from the text
|
|
files created previously;
|
|
5. you'll find a file named `links-xxx.txt` where *xxx* is a
|
|
timestamp. That file contains the extracted links.
|