Training Tesseract-ocr 4.0 LSTM on windows7 / windows10

1 year ago

Comment

1 / 1

Training Tesseract-ocr 4.0 LSTM on Windows 7 or Windows 10 is possible, but it requires some manual steps and can be a bit involved. Here is a step-by-step guide on how to train Tesseract-ocr 4.0 LSTM on Windows:

Install Prerequisites:
- Install Git for Windows: https://gitforwindows.org/
- Install Python 2.7 (not Python 3.x) from: https://www.python.org/downloads/release/python-2718/
- Add Python and Git to the system PATH during installation.

Install Required Libraries:
Open a Command Prompt and run the following commands:

arduino
pip install pillow

pip install pytesseract

pip install scikit-image

Clone the Tesseract Git Repository:
Open a Command Prompt and run the following commands:

bash
git clone https://github.com/tesseract-ocr/tesseract.git

cd tesseract

Download Language Data:
- Download the necessary language training data files for the languages you want to train from: https://github.com/tesseract-ocr/tessdata
- Copy the downloaded .traineddata files to the "tesseract\tessdata" directory.
Create Training Data:
- Create training images for each font and character you want to train.
- Name the training images in the format "fontname_char.png" (e.g., "arial_a.png").
- Put the training images in a directory (e.g., "training_data").

Generate Box Files:

Run the following command in the Command Prompt to generate box files for the training images:

css
python tesstrain.py --fonts_dir path/to/your/fonts --lang eng --linedata_only --langdata_dir ./langdata --tessdata_dir ./tessdata --output_dir ./training_data

Generate LSTM Training Data:

Run the following command in the Command Prompt to generate the LSTM training data:

css
python lstmtraining.py --traineddata ./tessdata/eng.traineddata --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lh0.05 O1c111]' --model_output ./output_base --train_listfile ./training_data/all.training_files.txt --max_iterations 400

Fine-Tune LSTM Training Data (Optional):

If you want to fine-tune the training data, you can do so by adding additional iterations:

css
python lstmtraining.py --traineddata ./tessdata/eng.traineddata --continue_from ./output_base_checkpoint --model_output ./output_base --train_listfile ./training_data/all.training_files.txt --max_iterations 400

Generate Tesseract OCR Model:
- Run the following command in the Command Prompt to generate the Tesseract OCR model:
```
bash
combine_tessdata -e ./output_base_checkpoint
```
Replace the Existing Language Model:
- Replace the existing .traineddata file for the language you trained (e.g., eng.traineddata) with the one you generated in the previous step.
Test the Trained Model:
- Test the trained model by running Tesseract-ocr on your test images.

Please note that training Tesseract-ocr can be a complex and time-consuming process. It is recommended to have some experience with OCR and Tesseract before attempting to train your own models. Additionally, the quality of the trained model will heavily depend on the quality and quantity of the training data used.