Adding the text extraction of the PDF files found in data

2018-08-24 11:20:54 +02:00
parent 641f85f17c
commit 583c4aac0b
2 changed files with 8 additions and 0 deletions
@@ -1,2 +1,6 @@
 help:
 	@cat doc/help.txt
+
+pdftotext:
+	@find ./data -iname '*.pdf' -execdir pdftotext {} \;
+	@find ./data -not \( -path ./data/text -prune \) -iname '*.txt' -exec mv {} './data/text/' ';'
@@ -1,2 +1,6 @@
 With this command, you'll be able to manage easily the extraction of
 URLs from books scanned by KBR
+
+Here is the list of commands and what there are doing:
+* make pdftotext: this command extract a text version of the PDF files
+                  and copy these files to the data/text/ directory