Adding the text extraction of the PDF files found in data

This commit is contained in:
Emmanuel Di Pretoro 2018-08-24 11:20:54 +02:00
parent 641f85f17c
commit 583c4aac0b
2 changed files with 8 additions and 0 deletions

View File

@ -1,2 +1,6 @@
help:
@cat doc/help.txt
pdftotext:
@find ./data -iname '*.pdf' -execdir pdftotext {} \;
@find ./data -not \( -path ./data/text -prune \) -iname '*.txt' -exec mv {} './data/text/' ';'

View File

@ -1,2 +1,6 @@
With this command, you'll be able to manage easily the extraction of
URLs from books scanned by KBR
Here is the list of commands and what there are doing:
* make pdftotext: this command extract a text version of the PDF files
and copy these files to the data/text/ directory