- 1 year ago
Training Tesseract-ocr 4.0 LSTM on windows7 / windows10
Training Tesseract-ocr 4.0 LSTM on Windows 7 or Windows 10 is possible, but it requires some manual steps and can be a bit involved. Here is a step-by-step guide on how to train Tesseract-ocr 4.0 LSTM on Windows:
Install Prerequisites:
- Install Git for Windows: https://gitforwindows.org/
- Install Python 2.7 (not Python 3.x) from: https://www.python.org/downloads/release/python-2718/
- Add Python and Git to the system PATH during installation.
Install Required Libraries:
Open a Command Prompt and run the following commands:arduinopip install pillow
pip install pytesseract
pip install scikit-image
Clone the Tesseract Git Repository:
Open a Command Prompt and run the following commands:bashgit clone https://github.com/tesseract-ocr/tesseract.git
cd tesseract
Download Language Data:
- Download the necessary language training data files for the languages you want to train from: https://github.com/tesseract-ocr/tessdata
- Copy the downloaded .traineddata files to the "tesseract\tessdata" directory.
Create Training Data:
- Create training images for each font and character you want to train.
- Name the training images in the format "fontname_char.png" (e.g., "arial_a.png").
- Put the training images in a directory (e.g., "training_data").
Generate Box Files:
- Run the following command in the Command Prompt to generate box files for the training images:
csspython tesstrain.py --fonts_dir path/to/your/fonts --lang eng --linedata_only --langdata_dir ./langdata --tessdata_dir ./tessdata --output_dir ./training_data
Generate LSTM Training Data:
- Run the following command in the Command Prompt to generate the LSTM training data:
csspython lstmtraining.py --traineddata ./tessdata/eng.traineddata --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lh0.05 O1c111]' --model_output ./output_base --train_listfile ./training_data/all.training_files.txt --max_iterations 400
Fine-Tune LSTM Training Data (Optional):
- If you want to fine-tune the training data, you can do so by adding additional iterations:
csspython lstmtraining.py --traineddata ./tessdata/eng.traineddata --continue_from ./output_base_checkpoint --model_output ./output_base --train_listfile ./training_data/all.training_files.txt --max_iterations 400
Generate Tesseract OCR Model:
- Run the following command in the Command Prompt to generate the Tesseract OCR model:
bashcombine_tessdata -e ./output_base_checkpoint
Replace the Existing Language Model:
- Replace the existing .traineddata file for the language you trained (e.g., eng.traineddata) with the one you generated in the previous step.
Test the Trained Model:
- Test the trained model by running Tesseract-ocr on your test images.
Please note that training Tesseract-ocr can be a complex and time-consuming process. It is recommended to have some experience with OCR and Tesseract before attempting to train your own models. Additionally, the quality of the trained model will heavily depend on the quality and quantity of the training data used.