Updating the README with some documentation

This commit is contained in:
Emmanuel Di Pretoro 2022-10-14 11:15:14 +02:00
parent e4da338f4c
commit 0b6ed3a2c8

View File

@ -4,3 +4,46 @@ In the context of the PROMISE project
(https://www.kbr.be/fr/projets/projet-promise/), we needed a way to
extract URLs from PDF files. This project implement a method to do it.
# How to use this project ?
## The TL;DR version
Run `make extract` after copying files PDF files in the *./data*
directory.
## The complete version
This project is aimed to be as easy as possible to use, but in order
to achieve that, it is built on some assumptions :
1. The tool is developed with Linux as the main target, so it should
be able to work on other operating systems, but you might need to
tweak some things (especially on Windows);
2. The requirements for this tool are:
- `pdftotext`;
- `perl`;
- and `cpanm` to install the dependencies of the Perl script
- `make`;
- `find`;
3. In order to be able to use the Perl script, you'll need to install
its dependencies. I've created a special file for that so that you
can run the following command (in the project's directory): `cpanm
--installdeps .`. If you don't use `cpanm`, you'll need to install
required Perl modules (`IO::All` and `Regexp::Common`) the way you
do it on your system.
If those requirements are satisfied, you'll be able to follow this
workflow (all commands are typed in a shell where the current path is
in this project):
1. run `make init` will create the required directories (*./data*,
*./data/text* and *./output*);
2. copy your ocerized PDF files in the *./data* directory; you can use
subdirectories if it is easier for you. Don't place any PDF files
in the *./data/text* directory as it will be ignored during the
PDF conversion to text files;
3. run `make pdftotext` to create the text version of each PDF files
found in the *./data* directory;
4. run `make extract` to perform the extraction of URLs from the text
files created previously;
5. you'll find a file named `links-xxx.txt` where *xxx* is a
timestamp. That file contains the extracted links.