Adding the text extraction of the PDF files found in data

This commit is contained in:
Emmanuel Di Pretoro 2018-08-24 11:20:54 +02:00
parent 641f85f17c
commit 583c4aac0b
2 changed files with 8 additions and 0 deletions

View File

@ -1,2 +1,6 @@
help: help:
@cat doc/help.txt @cat doc/help.txt
pdftotext:
@find ./data -iname '*.pdf' -execdir pdftotext {} \;
@find ./data -not \( -path ./data/text -prune \) -iname '*.txt' -exec mv {} './data/text/' ';'

View File

@ -1,2 +1,6 @@
With this command, you'll be able to manage easily the extraction of With this command, you'll be able to manage easily the extraction of
URLs from books scanned by KBR URLs from books scanned by KBR
Here is the list of commands and what there are doing:
* make pdftotext: this command extract a text version of the PDF files
and copy these files to the data/text/ directory