Introduction

The whiteboard is the most commonly used tool for communication in the field of education and also in presentations and meetings. It provides a large writing space on which the user can write in an unconstrained manner. This makes video capture of the whiteboard a popular input mechanism for handwriting recognition, compared to using a touch sensitive SMART Board or a special pen. This research aims to implement a handwritten mathematical content recognizer for classroom videos. We will focus on extracting text from the classroom video, recognizing the characters and the structure o the mathematical content.

System Overview

The system can be implemented in various stages: Text Extraction, Character Segmentation, Character Recognition and Finally Structure Analysis. We have already completed the text extraction and character segmentation stages. We are currently implementing the character recognition stage in which we intend to use an audio-video based approach. This approach involves the use of speech information to resolve ambiguities in the video-based character recognition. Finally, we will need to implement a Structure Analysis stage. This stage is very specific to the field of mathematical content recognition.

Input

The input used is a video of the whiteboard recorded in a Georgia Tech classroom by the Distance Learning and Professional Education (DLPE) division and the Digital Media Lab (DML) of Georgia Tech. The resolution of the video is 720 by 480 pixels and the video is in NTSC format i.e. 29.997 frames/sec. This video was recorded for my research on Handwritten Equation Recognition so most of the contents are mathematical equations with a few titles and comments as text.

The main challenge of implementing a video-based recognition system for a whiteboard is that the pen trajectory can not be obtained at every instant. This is because the user may occlude what he/she is writing. Therefore, it is necessary to use an incremental offline recognition approach as implemented in this paper.

Text Extraction

The whiteboard images (frames) or the video can be divided into 3 regions:
1. Background Whiteboard
2. Text Region
3. Foreground Objects e.g. the user

First, the color image is converted into a gray-scale image. Now, the whiteboard appears to be light gray and the text regions have black strokes or dark gray strokes (the color markers are usually dark shades of green, blue or red). The foreground objects can be in different shades of gray.

The gray-scale image is divided into blocks of 20x20 pixels. To perform text extraction, the following features were calculated for each of the text blocks:

1. Number of edge pixels in the block (using a Canny edge detector)
2. Average pixel intensity of the block

Using features 1 and 2, the text regions for every frame are extracted. But they will also include some portions of the foreground objects for that frame. So to get best results, we should combine the text extracted from frames in the past and frames in the future along with that of the current frame to be able to totally remove any kind of noise due to foreground or background.

Character Segmentation

The text extracted image needs to be binarized i.e. converted into a black and white image, to make the segmentation step easy. We convert such that the text pixels are in white. There are some noise pixels in the background i.e. the whiteboard and also the pen.

We have tried different binarization techniques. Niblack's Binarization is one of the most simple techniques and it calculates the threshold as a linear combination of the mean and standard deviation of the pixel intensities of the frame. Another technique is Non-Linear Niblack's Binarization which is computationally intensive because its local threshold computation is quite complex. The non-linear technique is very good for scenes that have a lot of objects and also a lot of detail in the images. It is definitely an overkill for our video which is binarized quite well by the simple Niblack's Binarization technique. The results of Niblack's Binarization and Non-Linear Niblack's Binarization are shown below.

 Before Binarization 

 After Niblack's Binarization 

 After Non-Linear Niblack's Binarization 

Text segmentation is done by assuming characters to be non-touching and therefore every character is a connected component(CC). We separate out touching white pixels to be a character provided that their number exceeds a threshold. The threshold can be a local threshold or a sigle value for a frame of the video. Very large (edge of the whiteboard or instructor) or very small (noise) connected components are discarded at the end of this.

The results of the text extraction and text segmentation stages can be seen below.

 Text Extraction and Character Segmentation 

Skew Correction

From the vertical density histogram (as shown in figure below) of the text, the region corresponding to the descender is temporarily clipped off. Then the lowest remaining white pixel (white represents text in our application) is found for every vertical scan line. The line of best fit through these points is the lower baseline. This line of best fit computation is repeated iteratively by finding nearby points for the line of best fit and then recomputing the line of best fit for these points. This process is repeated 5 times in this case.

After baseline estimation, the angle of skew (alpha) between the baseline and the horizontal axis is calculated. Now, skew correction is done by rotating every point (x,y), by a –ve angle alpha to get the new point (x’,y’). The rotation matrix is given in the equation below.

The results of skew correction by baseline estimation are shown below.

 Skew Correction 

Characater Recognition

Our current approach to the character recognition problem involves the use of both audio as well as video to recognize the mathematical content written on the whiteboard. When there is some ambiguity in the recognition of a character written on the whiteboard, we will perform Hidden Markov Model(HMM) based speech recognition over a short window of speech to resolve the ambiguity. We are working on the speech recognizer which will recognize mathematical words i.e. numbers, alphabets, math operators and Greek symbols, from a dictionary of a few hundred words. We are also exploring alternate techniques such as word-spotting to overcome some of the problems in the HMM based speech recognizer due to mismatch in the grammar of the input and that of the recognizer. The bigger challenge would be to formulate a probabilistic technique by which the location of the word in the speech segment and the words before and after it would help in synchronizing the audio and video and therefore enable integration of speech information with the video. Decision making in case of conflicts between the video and speech recognizers would be resolved by using a combination of their confidence measures. If the errors in recognition of the symbols written on the whiteboard are considered to be independent of the errors in speech recognition of the same symbol, the recognition accuracy will be increased by using both the audio and video.

Future Work

On successful completion of the audio-video based character recognition stage, we will move onto the structure analysis stage which makes the problem more challenging than regular text recognition.This analysis of the structure of equations would make use of the logical and spatial relationships between recognized symbols and once again, we intend to use speech information to resolve ambiguity. The audio-video based approach would also prove to be useful for training the system for new users as errors in recognition of writing can be corrected by the speech and errors in the recognition of speech can be corrected by the writing.

Skew correction was found to be good for reasonable skew angles. Incase the skew is very high, say about 45 degrees, there might be problems. Also in the case of equations, we should not confuse skew with superscript. Being able to differentiate between subscripts and superscripts and skew in very complicated equations is a very interesting problem.

We are also interested in implementing a complete system that would detect the parts of the classroom video with mathematical content,recognize it and also recognize the related speech making it much easier to search for information from classroom videos. This research could be extended to a system that can separate out the contents of the whiteboard into math, text, tables, graphs and other diagrams.

References

[1] M. Wienecke, G. A. Fink and G. Sagerer. Towards automatic video-based whiteboard reading. In Proc. Int. Conf. on Document Analysis and Recognition, 2003.

[2] A. W. Senior and A. J. Robinson. An off-line cursive handwriting recognition system. IEEE Trans. on Pattern Analysis and Machine Intelligence, 20(3):309–321, 1998.

[3] P. Slavik and V. Govindaraju. Equivalence of Different Methods for Slant and Skew Corrections in Word Recognition Applications. IEEE Trans. on Pattern Analysis and Machine Intelligence, 23(3):323–326, 2001.

[4] Wayne Niblack. An Introduction to Digital Image Processing. Prentice Hall, 1986.