日期：2021 年 03 月 26 日 (週五) 下午2:00~3:30
Preparing 10-K Report for Empirical Studies Using Sequence Labeling Models: A Proposed Framework
10-K reports contain lengthy descriptions of companies’ financial activities during a given fiscal year. A sizeable number of studies leverages text in 10-K reports. Many of them adopted keyword-based (i.e., dictionary) approaches to extract needed information. Despite the popularity of keyword-based approaches, certain important text-based measures require more structured and well-formatted text. One such example is readability. Fog index, a readability measure, is the weighted average of sentence length and the proportion of complex words. To reliability compute readability, the input text documents need to be well-formatted text. Headlines, tables, and dangling text chunks will add noise and potentially bias the results. To address the issue, we propose two natural language processing (NLP) tasks, 10-K segmentations and prettifying, to prepare the 10-K reports for reliable and reproducible empirical results. Our preliminary designs are based on sequence labeling models that can generate high-quality outcomes. This talk will also review existing sequence labeling models and present partial results of the proposed NLP tasks.