Mining Formulaic Sequences from a Spoken Japanese Based on Consolidated Contextualized N-gram Analyses and Its Verification with Key Phrases in Japanese Language Textbooks
Abstract: In recent years, the rapidly expanding e-learning environment has brought great benefits to language education, affording us the opportunity to experience good language learning from anywhere in the world. However, even with the latest systems for Japanese language education, human professionals still select the contents of learning materials, without reference to statistical data on the actual frequency of use by native speakers. Compared to the recent dramatic changes in the learning environment in digital age, this point is still conservative. We believe that the automatic creation of more practical and rich learning contents from real data is important for future education systems. In this study, we extracted sequence patterns as formulaic sequences from a huge spoken Japanese corpus based on consolidated contextualized n-gram analyses. We compared frequently occurring formulaic sequences with 334 key phrases in Japanese language textbooks to verify how well the key phrases and the formulaic sequences match. The results indicate that most key phrases in Japanese language textbooks can be extracted inductively as high-frequency formulaic sequences by our method, which can be realized without manual work by human experts.