Updating the README with some documentation

2022-10-14 11:15:14 +02:00 · 2022-10-14 11:15:14 +02:00 · 0b6ed3a2c8
commit 0b6ed3a2c8
parent e4da338f4c
1 changed files with 43 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -4,3 +4,46 @@ In the context of the PROMISE project
 (https://www.kbr.be/fr/projets/projet-promise/), we needed a way to
 extract URLs from PDF files. This project implement a method to do it.

+# How to use this project ?
+
+## The TL;DR version
+
+Run `make extract` after copying files PDF files in the *./data*
+directory.
+
+## The complete version
+
+This project is aimed to be as easy as possible to use, but in order
+to achieve that, it is built on some assumptions :
+
+1. The tool is developed with Linux as the main target, so it should
+   be able to work on other operating systems, but you might need to
+   tweak some things (especially on Windows);
+2. The requirements for this tool are: 
+   - `pdftotext`;
+   - `perl`;
+     - and `cpanm` to install the dependencies of the Perl script
+   - `make`;
+   - `find`;
+3. In order to be able to use the Perl script, you'll need to install
+   its dependencies. I've created a special file for that so that you
+   can run the following command (in the project's directory): `cpanm
+   --installdeps .`. If you don't use `cpanm`, you'll need to install
+   required Perl modules (`IO::All` and `Regexp::Common`) the way you
+   do it on your system.
+
+If those requirements are satisfied, you'll be able to follow this
+workflow (all commands are typed in a shell where the current path is
+in this project):
+1. run `make init` will create the required directories (*./data*,
+   *./data/text* and *./output*);
+2. copy your ocerized PDF files in the *./data* directory; you can use
+   subdirectories if it is easier for you. Don't place any PDF files
+   in the *./data/text* directory as it will be ignored during the
+   PDF conversion to text files;
+3. run `make pdftotext` to create the text version of each PDF files
+   found in the *./data* directory;
+4. run `make extract` to perform the extraction of URLs from the text
+   files created previously;
+5.  you'll find a file named `links-xxx.txt` where *xxx* is a
+   timestamp. That file contains the extracted links.