Adding the text extraction of the PDF files found in data

2018-08-24 11:20:54 +02:00 · 2018-08-24 11:20:54 +02:00 · 583c4aac0b
commit 583c4aac0b
parent 641f85f17c
2 changed files with 8 additions and 0 deletions
--- a/4
+++ b/4
@ -1,2 +1,6 @@
 help:
 	@cat doc/help.txt
+
+pdftotext:
+	@find ./data -iname '*.pdf' -execdir pdftotext {} \;
+	@find ./data -not \( -path ./data/text -prune \) -iname '*.txt' -exec mv {} './data/text/' ';'
--- a/doc/help.txt
+++ b/doc/help.txt
@ -1,2 +1,6 @@
 With this command, you'll be able to manage easily the extraction of
 URLs from books scanned by KBR
+
+Here is the list of commands and what there are doing:
+* make pdftotext: this command extract a text version of the PDF files
+                  and copy these files to the data/text/ directory