|
Activity day:2026-05-29
Published At:2026-05-29
Views:35
2026-05-29 updated
Utilizing Pre-trained Language Models and Large Language Models for 10-K Items Segmentation
Source: Journal of Information Systems (Forthcoming) Authors: Hsin-Min Lu, Yu-Tai Chien, Huan-Hsun Yen, Yen-Hsiu Chen URL: https://doi.org/10.2308/ISYS-2025-005 Resources: * Software Tools: https://github.com/hsinmin/itemseg * Dataset: https://www.im.ntu.edu.tw/~lu/data/itemseg/itemseg10kdata.7z Abstract: Extracting specific items from 10-K filings is challenging because of variations in document formats and item presentation. This study aims to improve traditional rule-based approaches by introducing and comparing two advanced item segmentation methods: (1) GPT4ItemSeg, employing a novel line-ID-based prompting mechanism to utilize a large language model, ChatGPT-4o, for item segmentation, and (2) BERT4ItemSeg, combining a pre-trained language model, BERT, with a Bi-LSTM model in a hierarchical structure to overcome context window constraints. Trained and evaluated on 3,737 annotated 10-K reports, BERT4ItemSeg achieves a macro-F1 of 0.9825, surpassing GPT4ItemSeg (0.9567), conditional random field (0.9818), and rule-based methods (0.9048) for core items (1, 1A, 3, and 7). These approaches enhance item segmentation performance, improving text analytics in accounting and finance. BERT4ItemSeg offers satisfactory item segmentation performance, whereas GPT4ItemSeg can easily adapt to regulatory changes. Together, they provide an extensible framework for 10-K item segmentation that supports reliable and reproducible results. |