The purpose of this work is to successfully diacritize an unmarked Arabic text, using
different techniques. Significant success has been achieved using a Bidirectional Long
Short-Term Memory Deep Neural Network, Encoder-Decoder Model, and post-processing
technique. The Tashkeela Corpus, which is available for free, has been sorted out based on
the categories of texts, namely, Classical and Modern Standard Arabic.
A standard for the
input and output text has also been laid out by a group of researchers from Jordan University,
who performed a comparison on several different models.
The reason behind setting such a
standard is to make comparisons between different models easier because different researches
apply different metrics for calculating the error rate. The results presented in this paper are
also based on this same standard of input, output, and error metrics. An input text which is
compliant to this standard can be easily prepared using a collection of open-source methods
provided by the same group of researchers. The best model available yet, based on a deep
neural network, learns from a set of 50,000 lines of marked Arabic text; including both
Classical and Modern Arabic.
We implemented Encoder-Decoder Simple LSTM with
sequence 4, with a little bit simplification, and trained it on 3615 sentences. This mini sample
model achieved 36% test accuracy, which was increased to 79% after post-processing. We
also implemented Encoder-Decoder BLSTM with sequence 5 and trained it on the same
dataset. This model achieved 40% test accuracy, which was increased to 79% after
Thus, BLSTM with sequence 5 is better than simple LSTM with sequence 4.
A web-based front-end shall be developed simultaneously with the back-end implementation,
giving students of the Arabic language a wider range of books on which they can practice
their reading skills.