Quels sont les outils simlaires ?¶

Solutions commerciales¶

Outils open source¶

Grobid: parsing and structuring document : https://github.com/kermitt2/grobid

Projet S2OCR:

Extraction de la structure du document http://kraken.re/

Librairies open source à utiliser

pytesseract
https://github.com/jbarlow83/OCRmyPDF
https://gitlab.gnome.org/World/OpenPaperwork/pyocr
https://github.com/chrismattmann/tika-python (document parser, wraper python de Apache Tika : https://tika.apache.org/)
Tabula pour l’extraction de tables
excalibur
camelot

Create ground truth documents : https://doc-creator.labri.fr/ Acquire document structure https://www-intuidoc.irisa.fr/docread-generateur-automatique-de-systemes-de-reconnaissance-de-documents-structures/

OCR benchmark : https://github.com/clovaai/deep-text-recognition-benchmark

Projets de recherche à l’INRIA¶

équipe Orpailleur ou LORIA ou READ https://members.loria.fr/ABelaid/
Equipe LaBRI https://www.labri.fr/perso/domenger/
Equipe Intuidoc https://www-intuidoc.irisa.fr/ <- cette équipe a bcp travaillé sur le sujet et propose bcp de solutions

notre api pdf https://github.com/Rob192/pdf_api

previous

L’outil OCR-Xtract

next

L’outil OCR-Xtract