Datasets:

lianghsun
/

tw-finance-159M

Name: tw-finance-159M
Creator: Huang Liang Hsun
License: https://choosealicense.com/licenses/cc-by-nc-sa-4.0/

Tasks:

Modalities:

Formats:

Languages:

Size:

Tags:

Libraries:

License:

Dataset card Data Studio Files Files and versions

xet

Community

You need to agree to share your contact information to access this dataset

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Dataset Card for tw-finance-159M

本資料集收錄臺灣金融、財經、產業類新聞與專題文章之繁體中文文本，總 token 數約 159M（159 百萬），可作為繁中模型在「臺灣財經語境」下的補充預訓練語料。

Dataset Details

Dataset Description

資料來自繁體中文公開財經、產業類報導，內容涵蓋：

房地產／社會住宅政策
產業動態（科技、餐飲、農業、能源等）
個人理財、消費議題
商業活動／企業合作

每筆樣本以 text 為主文，搭配 token_count、word_count、url、updated_at 等 metadata，方便後續清理、去重與時間追蹤。

Curated by: Huang Liang Hsun
Language(s) (NLP): Traditional Chinese
License: cc-by-nc-sa-4.0

Dataset Sources

Repository: lianghsun/tw-finance-159M
Paper: TBA

Uses

Direct Use

補強繁中模型對臺灣金融／財經議題、產業用詞、政策議題的覆蓋。
衍生財經 chatbot、產業摘要、新聞分類等下游任務之資料來源。

Out-of-Scope Use

不得作為投資建議；資料為新聞報導，不等於財務分析。
不適用作為即時資訊（股價、稅率、政策條件）之來源；資料反映報導當下狀況。
商用前請評估原報導之著作權與授權狀態。

Dataset Structure

{
  "text": "（財經類繁中報導樣本）...",
  "token_count": 355,
  "word_count": 397,
  "url": "https://example.com/...",
  "updated_at": "2024-11-04"
}

Dataset Creation

Curation Rationale

通用繁中語料中，財經類文本比例不一，且許多專業術語與政策制度高度在地化（例如：社宅、健保補充保費、ETF 配息）。本資料集針對該領域做集中蒐集，補強模型在臺灣財經語境上的覆蓋。

Source Data

Data Collection and Processing

蒐集繁體中文公開財經、產業類報導。
移除版面雜訊與廣告區塊，保留段落結構。
計算字元數與 token 數，附上原 URL 及 updated_at 時間戳。

Who are the source data producers?

原始報導由各原作者撰寫並公開於網路。

Annotations

Annotation process

無人工標註。

Who are the annotators?

無標註者。

Personal and Sensitive Information

資料來自公開新聞報導，所提及之人名與機構名稱屬已公開資訊。

Bias, Risks, and Limitations

媒體編輯選題具偏好；資料覆蓋面以財經主流媒體為主，未必代表所有產業面向。
報導內容反映特定時點的市場與政策狀況，會隨時間過時。
部分樣本為新聞稿、廣告軟文，下游使用前可考慮再做篩除。

Recommendations

建議與其他繁中通用語料一同混訓並控制比重；下游若用於財經 chatbot，請結合 RAG 取最新主管機關公告再行回答。

Citation

@misc{tw_finance_159m,
  title        = {tw-finance-159M: Traditional Chinese Finance and Industry News Corpus from Taiwan (159M tokens)},
  author       = {Huang, Liang Hsun},
  year         = {2024},
  howpublished = {\url{https://huggingface.co/datasets/lianghsun/tw-finance-159M}}
}

Dataset Card Authors

Huang Liang Hsun

Dataset Card Contact

Huang Liang Hsun

Downloads last month: 28

Total file size:

528 MB

Models trained or fine-tuned on lianghsun/tw-finance-159M

QuantFactory/Llama-3.2-Taiwan-3B-GGUF

Text Generation • 4B • Updated Jan 4, 2025 • 262 • 4

itlwas/Llama-3.2-Taiwan-3B-Q4_K_M-GGUF

Text Generation • 4B • Updated Jan 15, 2025 • 8