Research Journal of Recent Sciences _________________________________________________ ISSN 2277-2502 Vol. 3(11), 98-102, November (2014) Res.J.Recent Sci. International Science Congress Association        
98
 
Offline Urdu Numeral Recognition Using Non-Negative Matrix Factorization Shahab Uddin, Muhammad Sarim, Abdul Basit Shaikh and Sheikh Kashif Raffat Department of Computer Science, Federal Urdu University of Arts, Sciences and Technology, Karachi, PAKISTAN Available online at: 
www.isca.in
, 
www.isca.me
Received 1st December 2013, revised  17th March 2014, accepted 25th July 2014Abstract  By the rapid change and advancement in technology a need for processing and preserving many texts had been felt. These texts are either in hard copies or in handwritten form. Hand-written numerals, written in various languages and scripts, are an integral part of these texts.  Several efforts have been made to recognize numerals and a variety of Optical Character Recognition (OCR) systems have been successfully implemented and marketed. Urdu numerals, as opposed to English numerals, are different due to their style and format of writing. Various methods have been proposed but majority of them only address computer typed numerals in different forms and sizes. Therefore we need to develop new and enhance existing handwritten Urdu numerals recognition systems due to their wide scale use and application in many fields. This research addresses the problem of handwritten offline numerals. A novel approach of Non-negative Matrix Factorization (NMF) for Urdu handwritten character recognition has been proposed in this research. Keywords: Urdu, handwritten, numeral, recognition, optical character recognition, offline, NMF, OCR. IntroductionFor the purpose of recognition, numerals can be classified into two classes. i.e. On-line and off-line. The word on-line suggests that the writing and recognition are carried out simultaneously. While in the case of off-line recognition, a digital image is presented to the system. Moreover, we use off-line recognition for printed and handwritten numerals recognition.  On-line recognition has an added advantage of time coordinate that is not available in case of off-line recognition. For a specific font type, printed numerals have only one style whereas styles and sizes vary in case of handwritten numerals for the same writer at different instances and between different writers. Furthermore, if we compare Urdu numerals to English numerals we find that Urdu numerals are written from right to left whereas English numerals are written from left to right. Handwritten numerals may look similar but they are different. It is also hard for the recognition system to spot the dissimilarity. Additionally, the length and width of the numerals can also be different. Moreover, same numerals can be written differently in various forms. In an automatic recognition system, the selection of feature extraction method might be the most important step for achieving high recognition accuracy. In multivariate analysis and linear algebra, Non-negative matrix factorization (NMF) is a matrix decomposition and dimension reduction technique based on low-rank approximation which makes use of a range of algorithms.  Two such algorithms are based on multiplicative update rule and alternating least-squares algorithm. NMF reduces the number of features along with the constraint that the features will have to be nonnegative. If  is a matrix then NMF of  gives two factor matrix (not unique) namely  and . Where  may be called weight matrix and may be called basis matrix. Mathematically,  



	
\n


(1) Where  is\r matrix and  is a positive integer such that  
\r



\r
	





(2) Thus NMF returns a non-negative matrices  of size and  of size \r which are  approximate non-negative factors of  that minimize the root mean square residual where:  














(3) Where  columns of  show transformations of matrix's data while the  rows of  show the coefficients of the linear combinations of \r data in  which had resulted in the form of transformed data in . Since   rank ( ) thus the product  gives condensed approximation of  . NMF is similar to Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) but the constraint with NMF is that it has non-negative factors. NMF had been used by many researchers in many fields for both the feature extraction and classification. Related Work: Sagheer et al. prepared the Urdu datasets of isolated digits, numeral strings, alphabets, dates, isolated letters and special characters. Normalized images of 36 x 36 pixels were used to extract 32-direction gradient maps. The image is then divided into 9 x 9 blocks by extracting 4 x 4 feature values from each block. Gaussian filtering was used to down sample the directions and blocks to get the feature of 400 dimensions. Support Vector Machine (SVM) with the Radial Basis kernel Function (RBF) was used as a classifier. 
Research Journal of Recent Sciences _____________________________________________________________ ISSN 2277-2502Vol. 3(11), 98-102, November (2014)                   Res.J.Recent Sci International Science Congress Association            
99
Das et al. exploited the 72 shadow features and 16 centroid features of normalized 32 x 32 pixel bounded boxed images.  The image is divided into octants. Considering the normalized lengths (i.e. the length of each projection divided by the length of maximum possible length of projection on respective side) of projections on three sides of each such octant, 24 shadow features are extracted from each window of the digit image. Then three such overlapping windows are considered that make the number of shadow features equal to 72. Furthermore, the x and y coordinates of centroids of images in the octant window are added to the feature set. Finally, multi-layer perception was used as a classifier.  Razzak et al. proposed fuzzy rule base, Hidden Markov Model (HMM) and Hybrid approaches for Urdu and Arabic numerals in unconstrained environment. For Persian numerals, Harifi and Aghagolzadeh used asymmetrical segmentation patterns and shadow coding feature extraction and multi-layer perceptron (MLP) for classification. For Kannada, Tamil, Telugu and Malayalam, the Indian scripts based languages, Rajashekararadhya and Ranjan presented centroid (zone and image based) distance for the extraction of features. Stuti Asthana et al. conducted numeral recognition of Urdu, English, Tamil, Devnagri and Telugu scripts through multilayer feed-forward back-propagation algorithm. Shuwair Sardar and Abdul Wahab had developed a system which was tested on 1050 individual urdu characters and liguatures and tested it on both online and offline characters. They have claimed an accuracy overall accuracy of 97.12%, 97.09% accuracy in extracting the lines of text and 98.86% accuracy in primary and secondary stroke extraction. Pal and Sarkar took advantage of water-reservoir features and binary-tree classifier for classification of 3050 characters and numerals and achieved 97.8% accuracy. According to Akram and Hussain, converted segments of text to a word sequence. Here, space, colon, semi colon etc were assumed to be word separators. As it is the case with several scripts whiles other scripts do not have clear cut boundaries for the words. In the latter case, linguistic knowledge, lexicon and machine-learningbased approaches can be utilized for recognition. Alaei et al.10 utilized IFHCDB, an isolated Farsi and Arabic handwritten character data set, for character recognition. The data set consists a total of 52020 handwritten character samples out of which 36682 samples were considered for training and the remaining 15338 characters were used for testing. Various feature such as line-fitting information, intersection/junction/ endpoint, under-sampled bitmap, shadow, directional chain code and gradient were used. Moreover, SVM, Nearest Neighbour (NN) and k-Nearest Neighbours (k-NN) were utilized for the purpose of classification. Finally, results from the combinations of all of the above mentioned classifiers and feature sets were calculated. In conclusion, the gradient features in combination with SVM as a classifier showed the best results of 96.91% accuracy in recognition rate. Mozaffari et al.11 proposed a new method for isolated handwritten Farsi/Arabic characters and numerals recognition using fractal codes. Fractal codes represent affine transformations. Each fractal code contained six parameters, such as corresponding domain coordinates for each range block, brightness offset and an affine transformation, which were used as inputs for a multilayer perceptron neural network for learning and identifying an input. This method was robust to scale and frame size changes. Farsi characters (32 in number) were categorized to eight different classes. Each class comprised of structurally similar characters. According to experimental results, classification rates of 91.37% and 87.26% were obtained for digits and characters respectively. Mowlaei et al.12 used discrete Wavelet transform to produce Wavelet  coefficients. These coefficients were used for classification. Haar wavelet was used for feature extraction. The features so extracted were given to Neural Network as an input. Eight classes of structurally similar characters were created. Recognition of 92.33% and 91.81% was attained for handwritten Farsi characters and numerals  respectively. Mozaffari et al.13 compared the method of fractal codes and wavelet  transform. Though the wavelet transform proved to be 25 times faster than the fractal codes but there wasn't any much difference in the recognition rates. Husain et al.14 exploited various structural features of ligatures. Approximately 50000 words were extracted from these ligatures. The recognition rate of base ligatures was 93% and 98% for base ligatures, secondary strokes respectively.  Data Collection : First of all a two-page form was developed to get the input of Urdu numerals from a variety of people. A specimen of the form can be seen in figure-1 and figure-2. As it can be seen that the first page of the form was divided in to three portion and three rectangular boxes were drawn. The first page had some Urdu pre-printed words to take some personal information and guide the user to input the numerals. While the second page contained just two boxes.  The first and the medium size box showed both the English and their equivalent Urdu numerals to serve as an illustration and guide for the person filling the form. This was very essential in our case since it has been observed that most of the Urdu, Sindhi, Arabic and Persian and some other languages' numerals are almost the same and the remaining digits are similar in shapes and thus causing the confusion for bi-lingual or tri-lingual people at several occasions to distinguish between respective languages' numerals. The second box from top was for the input of National Identity Number to get some random numerals but few  people filled the small squares within it as it was either not applicable to them due to their age or they didn't felt secure to disclose such kind of information publicly. The third and the bottom most boxes were divided into ten columns and each column had a printed Urdu numeral typed within its top most boxes. Each column contained five (05) 
Research Journal of Recent Sciences 
______
Vol. 3(11), 98-102, November (2014)
 
 
International Science Congress Association
empty boxes beneath it to take five inputs from our writers. The 
two boxes of second page were equa
l in size and were similar to 
the third box of first page with the exception that no numerals 
were written on the top of both of the boxes. The writers were 
allowed to freely input their own numerals. Basically the data 
was collected from three groups of p
eople based on their 
gender, age, education and mother tongue. Input was taken from 
approximately 800 people related to the above mentioned three 
categories. After that the forms were scanned using a high 
resolution scanner to get png images of the 2
Nearly 1600 pages were scanned to serve as an input for off
Urdu numeral recognition process.  Figure-1 
Form Page 1
Pre-Processing: 
For our purposes proper pre
forms was the most crucial and time taking process. The 
numerals from the first page of the form were used for training 
whereas the second page numerals were used for testing. It 
involved following steps: Remov
ing noise 
easily removable, l
ocating the rectangular boxes, 
noise - 
large sized and often confused with the numerals, 
i
dentifying the numerals and isolating each numerals image 
from the forms, p
adding the numerals with appropriat
to preserve their orientations, n
ormalizing the numerals Images 
to 175 x 175 image size, i
dentifying, naming the first page 
numerals only and saving the numerals. Data Preparation: After the pre-
processing, we had two data 
sets of images. The fi
rst dataset was from the first page's 
numerals images each of size 175 x 175 pixels. Each such image 
had already been properly named according to its numeral. i.e. 
each image's naming convention is designed in such a way that 
it bears the numeral number in 
its first place. Then its position 
______
________________________________________
_______________
     
 
International Science Congress Association
 
empty boxes beneath it to take five inputs from our writers. The 
l in size and were similar to 
the third box of first page with the exception that no numerals 
were written on the top of both of the boxes. The writers were 
allowed to freely input their own numerals. Basically the data 
eople based on their 
gender, age, education and mother tongue. Input was taken from 
approximately 800 people related to the above mentioned three 
categories. After that the forms were scanned using a high 
resolution scanner to get png images of the 2
-page forms. 
Nearly 1600 pages were scanned to serve as an input for off
-line 
Form Page 1
 
For our purposes proper pre
-processing of the 
forms was the most crucial and time taking process. The 
numerals from the first page of the form were used for training 
whereas the second page numerals were used for testing. It 
ing noise 
- small sized and 
ocating the rectangular boxes, 
removing 
large sized and often confused with the numerals, 
dentifying the numerals and isolating each numerals image 
adding the numerals with appropriat
e margins 
ormalizing the numerals Images 
dentifying, naming the first page 
processing, we had two data 
rst dataset was from the first page's 
numerals images each of size 175 x 175 pixels. Each such image 
had already been properly named according to its numeral. i.e. 
each image's naming convention is designed in such a way that 
its first place. Then its position 
starting from the top-
left to the bottom
the box number and finally the form number to which each 
image belonged to. Each naming part separated by a dash, as 
shown in table-
1. This naming scheme w
had been used to measure the performance of our method. 
Figure
Form Page 2
Table
Naming System of Numerals Saved from First Page
7 39
 
Numeral Number 
Position
 
The second dataset was from the second page's numerals images 
each of size 175 x 175 pixels also. No such naming convention 
had been used for this dataset as in the case of the first page 
rather an arbitrary naming convention was used. Training set 
was pre
pared from the first dataset and Testing set from the 
second data set. Each image of size 175x175 pixels was 
vectored and converted into column image. Thus each image 
amounted to 30625 values in a column. Let us say that 
images were taken from th
respectively. As a result, we had the training and testing 
matrices of sizes 30625 x 

had used the label of  
for training set and 
Feature Extraction: 
Now the technique o
applied on matrix 
to reduce the matrix dimension. As a result, 
we get two matrices namely

the weight matrix while 
stands for the basis matrix of 
its decomposition. 
_______________
ISSN 2277-2502
            Res.J.Recent Sci
           
100
 
left to the bottom
-right of each box. Then 
the box number and finally the form number to which each 
image belonged to. Each naming part separated by a dash, as 
1. This naming scheme w
as necessary since it 
had been used to measure the performance of our method. 
   
Figure
-2 
Form Page 2
  
Table
-1 
Naming System of Numerals Saved from First Page
 
 
1 1258 
Position
Box Number Form Number 
The second dataset was from the second page's numerals images 
each of size 175 x 175 pixels also. No such naming convention 
had been used for this dataset as in the case of the first page 
rather an arbitrary naming convention was used. Training set 
pared from the first dataset and Testing set from the 
second data set. Each image of size 175x175 pixels was 
vectored and converted into column image. Thus each image 
amounted to 30625 values in a column. Let us say that 
 and 
images were taken from th
e training and testing sets 
respectively. As a result, we had the training and testing 

 and 30625 x  respectively. We 
for training set and 
 for testing set.  
Now the technique o
f NMF is to be 
to reduce the matrix dimension. As a result, 

 and . Whereas  stands for 
stands for the basis matrix of 
 after 
Research Journal of Recent Sciences _____________________________________________________________ ISSN 2277-2502Vol. 3(11), 98-102, November (2014)                   Res.J.Recent Sci International Science Congress Association            
101
 



	
\n






(4) Where  columns of  show transformations of matrix 's data while the  rows of  show the coefficients of the linear combinations of  data in  which have resulted in the form of transformed data in .  The rank of approximation to be obtained is termed by. It also specifies the desired level of decomposition and the number of non-negative factors.  This value is adjusted according to the required level of accuracy and satisfaction. The higher the value of, the more close the low rank approximation. But this value does not exceed or even gets closer to the rank of A If 's value is set very close to the rank of  then NMF is of no use here. In this case, the purpose of such decomposition dies.  has to be less than the rank of  to provide the condensed approximation of  and the product  provides such condensation.  We had applied the function of NMF onto the matrix  to extract, the weight matrix and  , the basis matrix using multiplicative update rule and alternating least-squares algorithm. A large number of experiments were conducted using different sample sizes and different  values for the proper approximation of the matrix  and finding the most suitable�s value. It was found by trial and error that the most suitable value for  is 25 and it should be used for further investigations. After calculating the low approximation weight matrix  and basis  matrix of the matrix , we had used the weight matrix  to extract the basis matrix  of matrix  by dividing the matrix  by the matrix . This is done through left division of  by  such that:  








(5) Results obtained at   are summarized in the table-2 Table-2 Recognition Rates of Different Sized DatasetsTesting Data Set Size Training Data Set Size 
* 100 300 500 
100 67% 83% 86% 
300 73.67% 80.67% 84% 
500 73.8% 80.4% 85.8% 
We first trained our algorithm on 100 and tested the results on 100, 300 and 500 images respective. Then the same procedure was repeated for 300 and 500 images which produced the table-2. Classification: We had used L-Norm on the decomposed matrices   and  to classify the images. Those converted, reduced and approximated images in  which had the minimum difference with images in  were matched and classified according to their naming convention used in the table.1. This table mentioned the naming system for saving the training images in   matrix. i.e. the matrix   (and thus the matrix  ) only contained the pre-classified images according to the numerals to which those images belonged to. Figure-3 showed the results, in each of these figures the first image is from the database in the matrix. While the second adjoining figure besides it is the matched result from the second matrix . The variation covered by the algorithm can be evidently seen from these figures.  Figure-3 Results Conclusion Several techniques15 including neuro-cognitive and probabilistic pattern recognition techniques16 have been used for handwritten numeral recognition up till now but NMF is used for the first time for Urdu handwritten characters. Most of the prevalent techniques show good recognition rate but these techniques are not computationally efficient. NMF, on the other hand, allows us to do efficient computations due to its ability to reduce the matrix to a lower dimension approximation. Recognition rate of around 86% has been achieved through this technique.  References 1.Sagheer M.W., He C.L., Nobile N. and Suen C.Y., Holistic urdu handwritten word recognition using support vector machine, Int. Conf. on Pattern Recognition 
Research Journal of Recent Sciences _____________________________________________________________ ISSN 2277-2502Vol. 3(11), 98-102, November (2014)                   Res.J.Recent Sci International Science Congress Association            
102
(ICPR), 1900 �1903 (2010)2.Das N, Mollah A.F., Saha S. and Haque S.S., Handwritten arabic numeral recognition using a multi layer perceptron, National Conf. on Recent Trends in Inf. Sys., 200-203 (2006)3.Razzak M.I., Hussain S.A., Sher M. and Khan Z.S., Combining offline and online preprocessing for online urdu character recognition, Int. MultiConf. of Engineers and Computer Scientists, , 18�20 (2009)4.Harifi A. and Aghagolzadeh A., A new pattern for handwritten persian/arabic digit recognition, Int. Conf. on Info. Tech. (ICIT2004), Istanbul, Turkey, (2004)5.Rajashekararadhya S.V. and Ranjan P.V., Efficient zone based feature extraction algorithm for handwritten numeral recognition of four popular south indian scripts, J. of Theoretical and Applied Info. Tech., 4(12), 1171�1181 (2008)6.Asthana S., Haneef F. and Bhujade R.K. , Handwritten multiscript numeral recognition using artificial neural networks, Int. J. of Soft Comput. & Engin., 1(1), 1�5 (2011)7.Sardar S. and Wahab A., Optical character recognition system for Urdu, Int. Conf. Info. and Emerg. Tech. (ICIET), 14-16 (2010)8.Pal U. and Sarkar A., Recognition of Printed Urdu Script, th Int. Conf. on Doc. Anal. and Recog. (ICDAR), , (2003)9.Akram M. and Hussain S., Word segmentation for urdu ocr system, th Workshop on Asian Language Resources, Beijing, China, 88-94 (2010)10.Alaei A., Pal U. and Nagabhushan P., A comparative study of persian/arabic handwritten character recognition, Int. Conf. on Frontiers in Hand-writing Recognition, (2012)11.Mozaffari S., Faez K. and Kanan H.R., Recognition of isolated handwritten farsi/arabic alphanumeric using fractal codes, Image Analysis and Interpre-tation, 6th IEEE Southwest Symposium, 104-108 (2004)12.Mowlaei A., Faez K. and Haghighat A.T., Feature extraction with wavelet trans-form for recognition of isolated handwritten farsi/arabic characters and numerals, 14th Int. Conf.Digital Signal Processing, , 923-926 (2002)13.Mozaffari S., Faez K. and Kanan H.R., Feature comparison between fractal codes and wavelet transform in handwritten alphanumeric recognition using svm classifier, 17th Int. Conf.Pattern Recognition, , 331-334 (2004)14.Husain S.A., Sajjad A. and Anwar F., Online urdu character recognition system, Conf. on Machine Vision Applications (IAPR MVA), Tokyo, Japan, 16-18 (2007)  
15.
Sharif M., Shah J.H. Mohsin S. and Raza M., Sub-holistic hidden markov model for face recognition, Res. J. of Rec. Sci., 2(5), 10�14 (2013)
16.
KhanY.D., Ahmad F. and Khan S.A., A Survey on use of Neuro-Cognitive and Probablistic Paradigms in Patterm Recognition, Res. J. of Recent Sci., 2(4), 74-79 (2013)
 
 
<end>