National Taiwan University Information Management Utilizing Pre-trained Language Models and Large Language Models for 10-K Items Segmentation

Announcements

Announcements

Utilizing Pre-trained Language Models and Large Language Models for 10-K Items Segmentation

Activity day:2026-05-29

Published At：2026-05-29

Views:404 2026-06-04 updated

Utilizing Pre-trained Language Models and Large Language Models for 10-K Items Segmentation

Source: Journal of Information Systems (Forthcoming)

Authors: Hsin-Min Lu, Yu-Tai Chien, Huan-Hsun Yen, Yen-Hsiu Chen

URL: https://doi.org/10.2308/ISYS-2025-005

Resources:
* Software Tools: https://github.com/hsinmin/itemseg
* Dataset: https://www.im.ntu.edu.tw/~lu/data/itemseg/itemseg10kdata.7z

Abstract:
Extracting specific items from 10-K filings is challenging because of variations in document formats and item presentation. This study aims to improve traditional rule-based approaches by introducing and comparing two advanced item segmentation methods: (1) GPT4ItemSeg, employing a novel line-ID-based prompting mechanism to utilize a large language model, ChatGPT-4o, for item segmentation, and (2) BERT4ItemSeg, combining a pre-trained language model, BERT, with a Bi-LSTM model in a hierarchical structure to overcome context window constraints. Trained and evaluated on 3,737 annotated 10-K reports, BERT4ItemSeg achieves a macro-F1 of 0.9825, surpassing GPT4ItemSeg (0.9567), conditional random field (0.9818), and rule-based methods (0.9048) for core items (1, 1A, 3, and 7). These approaches enhance item segmentation performance, improving text analytics in accounting and finance. BERT4ItemSeg offers satisfactory item segmentation performance, whereas GPT4ItemSeg can easily adapt to regulatory changes. Together, they provide an extensible framework for 10-K item segmentation that supports reliable and reproducible results.

	國立臺灣大學
	資訊管理學系暨研究所
	Information Management , National Taiwan University