Updating the README with some documentation
This commit is contained in:
parent
e4da338f4c
commit
0b6ed3a2c8
43
README.md
43
README.md
@ -4,3 +4,46 @@ In the context of the PROMISE project
|
|||||||
(https://www.kbr.be/fr/projets/projet-promise/), we needed a way to
|
(https://www.kbr.be/fr/projets/projet-promise/), we needed a way to
|
||||||
extract URLs from PDF files. This project implement a method to do it.
|
extract URLs from PDF files. This project implement a method to do it.
|
||||||
|
|
||||||
|
# How to use this project ?
|
||||||
|
|
||||||
|
## The TL;DR version
|
||||||
|
|
||||||
|
Run `make extract` after copying files PDF files in the *./data*
|
||||||
|
directory.
|
||||||
|
|
||||||
|
## The complete version
|
||||||
|
|
||||||
|
This project is aimed to be as easy as possible to use, but in order
|
||||||
|
to achieve that, it is built on some assumptions :
|
||||||
|
|
||||||
|
1. The tool is developed with Linux as the main target, so it should
|
||||||
|
be able to work on other operating systems, but you might need to
|
||||||
|
tweak some things (especially on Windows);
|
||||||
|
2. The requirements for this tool are:
|
||||||
|
- `pdftotext`;
|
||||||
|
- `perl`;
|
||||||
|
- and `cpanm` to install the dependencies of the Perl script
|
||||||
|
- `make`;
|
||||||
|
- `find`;
|
||||||
|
3. In order to be able to use the Perl script, you'll need to install
|
||||||
|
its dependencies. I've created a special file for that so that you
|
||||||
|
can run the following command (in the project's directory): `cpanm
|
||||||
|
--installdeps .`. If you don't use `cpanm`, you'll need to install
|
||||||
|
required Perl modules (`IO::All` and `Regexp::Common`) the way you
|
||||||
|
do it on your system.
|
||||||
|
|
||||||
|
If those requirements are satisfied, you'll be able to follow this
|
||||||
|
workflow (all commands are typed in a shell where the current path is
|
||||||
|
in this project):
|
||||||
|
1. run `make init` will create the required directories (*./data*,
|
||||||
|
*./data/text* and *./output*);
|
||||||
|
2. copy your ocerized PDF files in the *./data* directory; you can use
|
||||||
|
subdirectories if it is easier for you. Don't place any PDF files
|
||||||
|
in the *./data/text* directory as it will be ignored during the
|
||||||
|
PDF conversion to text files;
|
||||||
|
3. run `make pdftotext` to create the text version of each PDF files
|
||||||
|
found in the *./data* directory;
|
||||||
|
4. run `make extract` to perform the extraction of URLs from the text
|
||||||
|
files created previously;
|
||||||
|
5. you'll find a file named `links-xxx.txt` where *xxx* is a
|
||||||
|
timestamp. That file contains the extracted links.
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user