From 0b6ed3a2c805768dddbfde36ba7b0d4d4003cfd0 Mon Sep 17 00:00:00 2001 From: Emmanuel Di Pretoro Date: Fri, 14 Oct 2022 11:15:14 +0200 Subject: [PATCH] Updating the README with some documentation --- README.md | 43 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 43 insertions(+) diff --git a/README.md b/README.md index 9afefa2..fd549eb 100644 --- a/README.md +++ b/README.md @@ -4,3 +4,46 @@ In the context of the PROMISE project (https://www.kbr.be/fr/projets/projet-promise/), we needed a way to extract URLs from PDF files. This project implement a method to do it. +# How to use this project ? + +## The TL;DR version + +Run `make extract` after copying files PDF files in the *./data* +directory. + +## The complete version + +This project is aimed to be as easy as possible to use, but in order +to achieve that, it is built on some assumptions : + +1. The tool is developed with Linux as the main target, so it should + be able to work on other operating systems, but you might need to + tweak some things (especially on Windows); +2. The requirements for this tool are: + - `pdftotext`; + - `perl`; + - and `cpanm` to install the dependencies of the Perl script + - `make`; + - `find`; +3. In order to be able to use the Perl script, you'll need to install + its dependencies. I've created a special file for that so that you + can run the following command (in the project's directory): `cpanm + --installdeps .`. If you don't use `cpanm`, you'll need to install + required Perl modules (`IO::All` and `Regexp::Common`) the way you + do it on your system. + +If those requirements are satisfied, you'll be able to follow this +workflow (all commands are typed in a shell where the current path is +in this project): +1. run `make init` will create the required directories (*./data*, + *./data/text* and *./output*); +2. copy your ocerized PDF files in the *./data* directory; you can use + subdirectories if it is easier for you. Don't place any PDF files + in the *./data/text* directory as it will be ignored during the + PDF conversion to text files; +3. run `make pdftotext` to create the text version of each PDF files + found in the *./data* directory; +4. run `make extract` to perform the extraction of URLs from the text + files created previously; +5. you'll find a file named `links-xxx.txt` where *xxx* is a + timestamp. That file contains the extracted links.