Saturday, August 30, 2014

Introduction of Semi-supervised Learning for Computational Linguistic

Introduction of Semi-supervised Learning for Computational Linguistic


Creating sufficient labeled data can be very time-consuming. Obtaining the output sequences is not difficult: English texts are available in great quantity. What is time-consum
Introduction of Semi-supervised Learning for Computational Linguistic
Subsequent work in computational linguistics led to development of alternative algorithms for semisupervised learning, the algorithm of Yarowsky being a prominent example. These algorithms were developed specifically for the sorts of problems that arise frequently in computational linguistics: problems in which there is a linguistically correct answer, and large amounts of unlabeled data, but very little labeled data. Unlike in the example of acoustic modeling, classic unsupervised learning is inappropriate, because not just any way of assigning classes will do. The learning method is largely unsupervised, because most of the data is unlabeled, but the labeled data is indispensable, because it provides the only characterization of the linguistically correct classes. 

The algorithms just mentioned turn out to be very similar to an older learning method known as self-training that was unknown in computational linguistics at the time. For this reason, it is more accurate to say that they were rediscovered, rather than invented, by computational linguists. Until very recently, most prior work on semisupervised learning has been little known even among researchers in the area of machine learning. One goal of the present volume is to make the prior and also the more recent work on semisupervised learning more accessible to computational linguists.

Shortly after the rediscovery of self-training in computational linguistics, a method called co-training was invented by Blum and Mitchell, machinelearning researchers working on text classification. Self-training and co-training have become popular and widely employed in computational linguistics; together they account for all but a fraction of the work on semisupervised learning in the field. We will discuss them in the next chapter. In the remainder of this chapter, we give a broader perspective on semisupervised learning, and lay out the plan of the rest of the book.

Motivation of Semi-supervised Learning


For most learning tasks of interest, it is easy to obtain samples of unlabeled data. For many language learning tasks, for example, the World Wide Web can be seen as a large collection of unlabeled data. By contrast, in most cases, the only practical way to obtain labeled data is to have subject-matter experts manually annotate the data, an expensive and time-consuming process.

The great advantage of unsupervised learning, such as clustering, is that it requires no labeled training data. The disadvantage has already been mentioned: under the best of circumstances, one might hope that the learner would recover the correct clusters, but hardly that it could correctly label the clusters. In many cases, even the correct clusters are too much to hope for. To say it another way, unsupervised learning methods rarely perform well if evaluated by the same yardstick used for supervised learners. If we expect a clustering algorithm to predict the labels in a labeled test set, without the advantage of labeled training data, we are sure to be disappointed.

The advantage of supervised learning algorithms is that they do well at the harder task: predicting the true labels for test data. The disadvantage is that they only do well if they are given enough labeled training data, but producing sufficient quantities of labeled data can be very expensive in manual effort. The aim of semisupervised learning is to have our cake and eat it, too. Semisupervised learners take as input unlabeled data and a limited source of label information, and, if successful, achieve performance comparable to that of supervised learners at significantly reduced cost in manual production of training data.

We intentionally used the vague phrase “a limited source of label information.” One source of label information is obviously labeled data, but there are alternatives. We will consider at least the following sources of label information:
  • labeled data
  • a seed classifier
  • limiting the possible labels for instances without determining a unique label
  • constraining pairs of instances to have the same, but unknown, label (co-training)
  • intrinsic label definitions
  • a budget for labeling instances selected by the learner (active learning)

One of the grand aims of computational linguistics is unsupervised learning of natural language. From a psychological perspective, it is widely accepted that explicit instruction plays little part in human language learning, and from a technological perspective, a completely autonomous system is more useful than one that requires manual guidance. Yet, in contradiction to the characterization sometimes given of the goal of unsupervised learning, the goal of unsupervised language learning is not the recovery of arbitrary “interesting” structure, but rather the acquisition of the correct target language. On the face of it, learning a target classification – much less an entire natural language – without labeled data hardly seems possible.
Motivation of Semi-supervised Learning

Semisupervised learning may provide the beginning of an account. If a kernel of labeled data can be acquired through unsupervised learning, semisupervised learning might be used to extend it to a complete solution. Something along these lines appears to characterize human language acquisition: in the psycholinguistic literature, bootstrapping refers to the process by which an initial kernel of language is acquired by explicit instruction, in the form, for example, of naming an object while drawing a child’s attention to it. The processes by which that kernel is extended to the entirety of the language are thought to be different; distributional regularities of linguistic forms, rather than direct connections to the physical world, seem to play a large role. Semisupervised learning methods provide possible characterizations of the process of extending the initial kernel.

No comments:

Post a Comment