行政公告
  • 行政公告
Utilizing Pre-trained Language Models and Large Language Models for 10-K Items Segmentation
活動起日:2026-05-29 
發佈日期:2026-05-29 
瀏覽數:34  2026-05-29 更新
Utilizing Pre-trained Language Models and Large Language Models for 10-K Items Segmentation

Source: Journal of Information Systems (Forthcoming)

Authors: Hsin-Min Lu, Yu-Tai Chien, Huan-Hsun Yen, Yen-Hsiu Chen

URL: https://doi.org/10.2308/ISYS-2025-005

Resources:
* Software Tools: https://github.com/hsinmin/itemseg
* Dataset: https://www.im.ntu.edu.tw/~lu/data/itemseg/itemseg10kdata.7z


Abstract:
Extracting specific items from 10-K filings is challenging because of variations in document formats and item presentation. This study aims to improve traditional rule-based approaches by introducing and comparing two advanced item segmentation methods: (1) GPT4ItemSeg, employing a novel line-ID-based prompting mechanism to utilize a large language model, ChatGPT-4o, for item segmentation, and (2) BERT4ItemSeg, combining a pre-trained language model, BERT, with a Bi-LSTM model in a hierarchical structure to overcome context window constraints. Trained and evaluated on 3,737 annotated 10-K reports, BERT4ItemSeg achieves a macro-F1 of 0.9825, surpassing GPT4ItemSeg (0.9567), conditional random field (0.9818), and rule-based methods (0.9048) for core items (1, 1A, 3, and 7). These approaches enhance item segmentation performance, improving text analytics in accounting and finance. BERT4ItemSeg offers satisfactory item segmentation performance, whereas GPT4ItemSeg can easily adapt to regulatory changes. Together, they provide an extensible framework for 10-K item segmentation that supports reliable and reproducible results.