From 0b6ed3a2c805768dddbfde36ba7b0d4d4003cfd0 Mon Sep 17 00:00:00 2001
From: Emmanuel Di Pretoro <emmanuel.dipretoro@kikirpa.be>
Date: Fri, 14 Oct 2022 11:15:14 +0200
Subject: [PATCH] Updating the README with some documentation

---
 README.md | 43 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/README.md b/README.md
index 9afefa2..fd549eb 100644
--- a/README.md
+++ b/README.md
@@ -4,3 +4,46 @@ In the context of the PROMISE project
 (https://www.kbr.be/fr/projets/projet-promise/), we needed a way to
 extract URLs from PDF files. This project implement a method to do it.
 
+# How to use this project ?
+
+## The TL;DR version
+
+Run `make extract` after copying files PDF files in the *./data*
+directory.
+
+## The complete version
+
+This project is aimed to be as easy as possible to use, but in order
+to achieve that, it is built on some assumptions :
+
+1. The tool is developed with Linux as the main target, so it should
+   be able to work on other operating systems, but you might need to
+   tweak some things (especially on Windows);
+2. The requirements for this tool are: 
+   - `pdftotext`;
+   - `perl`;
+     - and `cpanm` to install the dependencies of the Perl script
+   - `make`;
+   - `find`;
+3. In order to be able to use the Perl script, you'll need to install
+   its dependencies. I've created a special file for that so that you
+   can run the following command (in the project's directory): `cpanm
+   --installdeps .`. If you don't use `cpanm`, you'll need to install
+   required Perl modules (`IO::All` and `Regexp::Common`) the way you
+   do it on your system.
+
+If those requirements are satisfied, you'll be able to follow this
+workflow (all commands are typed in a shell where the current path is
+in this project):
+1. run `make init` will create the required directories (*./data*,
+   *./data/text* and *./output*);
+2. copy your ocerized PDF files in the *./data* directory; you can use
+   subdirectories if it is easier for you. Don't place any PDF files
+   in the *./data/text* directory as it will be ignored during the
+   PDF conversion to text files;
+3. run `make pdftotext` to create the text version of each PDF files
+   found in the *./data* directory;
+4. run `make extract` to perform the extraction of URLs from the text
+   files created previously;
+5.  you'll find a file named `links-xxx.txt` where *xxx* is a
+   timestamp. That file contains the extracted links.