Pipeline #1 demo：Drugs Vol 86 Issue 6 journal-TOC JC 製作過程

2026-05-26

藥學系 journal-TOC JC demo — Drugs Vol 86 Issue 6 (2026)

Pipeline #1 for 6/3 CMU 藥學系 talk live demo。

本 demo 並陳兩本期刊，對照 fetch 模式：Drugs（Springer，付費牆，靠既有 subscriber session 跨牆）+ JFDA（食藥署，OA，純 server-side urllib 直取）。這正是 Part B 主軸「fetch 是 cloud LLM 的 hard limit」的具體對照——付費牆才需要跨牆工具，OA 不需要。

同目錄下的 drugs-2026-05-26.md 是本期 13 篇 article 自動抽取後的 mega-md 原貌（1.5 MB，含 13 篇 full-text body + abstract + figure caption），可直接打開審視 deterministic pipeline 的輸出長相。

目標

Show 藥學系學生「sandbox LLM ≠ agentic pipeline」的具體分野：

一份 publisher web 上 13 篇 review articles，不靠人手 copy-paste，10 分鐘自動產出可餵 LLM 的 mega-md + GPT 導讀 + 落點對齊 corpus。

製作過程（4 步）

Bundler 設定（一次性 onboarding，~15 min）
- journals.json 加 drugs entry：browser、toc_url、selectors、article-card 結構。
- extractor.js 加 parseDrugs_TOC：card = article.app-card-open，DOI 從 /article/10.1007/s40265-... href 取，section 從 span.c-meta__type 取，OA 看 .c-meta__item 文字含 Open access。
- 用 Chrome Beta 既有 subscriber session（已 logged-in Springer Link），不需另外手動 download。
跑 bundler（每月一次 demo cmd）
```
python3 journal-toc/scripts/journal_bundler/bundler.py drugs
```
- 從 TOC 抓 13 篇 article cards
- 每篇 sync XHR 到 Springer Link 全文頁，selector union（h1.c-article-title / div.c-article-body.main-content）抓 title / abstract / body markdown
- 組成 drugs-2026-05-26.md mega-md（1.5 MB，含 13 篇全文 / abstract / figure caption）
Quality gate（deterministic 7-criteria 自審）
```
python3 journal-toc/scripts/quality-check.py --journal drugs drugs-2026-05-26.md
```
- count_integrity：h2 count = unique DOI count = 宣告 article count
- extraction_success_rate：abstract + body 都得 ≥ expected
- section_coverage：article-type 標籤齊備
- file_size、frontmatter、peer_compare：宣告期刊規模 sanity
- 本期：7/7 PASS
餵 LLM 拿導讀 + apply to corpus
- mega-md 一鍵 push 到剪貼簿 → 貼到 GPT-Pro / Gemini Pro
- GPT 依固定 prompt 回導讀草稿
- Agent 接 draft → 寫到 agent-share/journal/Drugs/86/6/index.md（含 frontmatter + publish gate）
- launchctl kickstart deploy → CF Pages 公開 /notes/journal/drugs-86-6/

同樣模式已 cover 9 journals

journal	platform	cadence	status
AIM	acpjournals.org	weekly Tue	live cron pending
NEJM	nejm.org	weekly Thu	live cron
JAMA	jamanetwork.com	weekly Tue	live manual
Nature	nature.com	weekly Thu	live cron
Science	science.org	weekly Thu/Fri	live manual
Lancet	thelancet.com (Elsevier RDF)	weekly Sat	live cron
BMJ	bmj.com	weekly Fri	live cron
JASN	journals.lww.com/jasn	monthly	live manual
Drugs	link.springer.com	monthly	new 2026-05-26（本場 demo）

新增一個 journal 的工程量 = journals.json + extractor.js 各加一個 parser、約 30-60 min（DOM structure 是穩定的，照前例改 selector 即可）。

演講當天的 5 分鐘 live demo flow

時間	動作	期望反饋
0:00	投影：「現在我要從 Drugs 期刊本期 13 篇拿到 GPT 導讀，純自動」	學生看到 publisher 網頁 → 直接 publish-ready note
0:30	開 terminal 跑 `bundler.py drugs`	13 篇 progress bar 跑出來，~3 分鐘
3:30	跑 `quality-check.py --journal drugs ...`	7/7 PASS
4:00	mega-md → pbcopy → 貼 GPT-Pro web ui	GPT 開始回應
4:30	待 GPT 出導讀 → 貼回 → terminal apply 到 corpus → kickstart deploy	學生看 CF Pages 跑出 build log
5:00	開瀏覽器 reload `/notes/journal/drugs-86-6/`	全班看到剛產生的公開頁

重點 takeaway 給藥學系

不靠 LLM 智商，靠 deterministic pipeline 紀律。每一個 step 都是 pure function（fetch / parse / quality check）；只有「導讀寫作」是 LLM agentic 那一格。
新加一份期刊 ≠ 整重做。共用 bundler 引擎，只改 selectors。
content 全在自己硬碟 + git。Springer / Cloudflare / OpenAI 任一家明天倒了，pipeline 不動。

OA 對照組：JFDA Vol 33 Issue 4（2025）

教授辦公室建議的示範期刊（2026-05-28）。jfda-2025-12-15.md 是本期 13 篇全 OA article 的 mega-md（TOC + DOI + section + abstract + authors，全程 server-side 抓取）。

跟 Drugs 的關鍵分野：付費牆 vs OA

	Drugs（Springer）	JFDA（食藥署 bepress）
取用	付費訂閱牆	OA（CC BY-NC-ND），全文公開
跨牆工具	要既有 subscriber session（沿用 logged-in cookies）	不用——純 `urllib` server-side fetch
DOI 取得	從 article card href 解析	從 bepress article id 確定性構造（`10.38212/2224-6614.<id>`）
abstract / authors	sync XHR 進全文頁抓	landing page meta tag 直接抓
bundler cmd	`bundler.py drugs`	`bundler.py jfda --toc-only`

為什麼這個對照正中 talk 主軸

Part B 講「fetch 是 cloud LLM 的 hard limit」：cloud LLM（ChatGPT / Gemini）連付費牆內的全文都拿不到，是結構性限制；要靠 local agent + 你自己 logged-in 的 session 才跨得過去。

Drugs 示範「跨牆」：付費期刊，local agent 沿用你瀏覽器既有的 subscriber session，cloud LLM 做不到。
JFDA 示範「不需跨牆」：OA 期刊一行 urllib 就拿到 TOC + abstract + 全文 PDF link，連 session 都不用——因為 OA 本來就沒牆。

兩本並陳，學生一眼看懂：fetch 限制只在付費牆出現；判斷一個來源能不能自動化抓取，第一步是看它 OA 還是 paywall。同一個 bundler 引擎，差別只在「要不要餵 session」。

bundler 怎麼加 JFDA（server-side parser）

不同於 Drugs 的 browser-session 路徑，JFDA 走 parse_toc 的 server-side HTML 分支（與 Science / Lancet 的 server-side RSS 對稱）：

journals.json 加 jfda：toc_format: html、toc_parser: jfda、expected_oa_pct: 100。
bundler.py 加 parse_jfda_html：解析 bepress 首頁的 <h2 id> section + viewcontent.cgi?article=<id> PDF link，DOI 確定性構造，逐篇 landing meta 抓 abstract + authors。
全程無瀏覽器 session 依賴，任何機器都能跑（OA 沒有登入綁定）。

本期實測：13 篇，13/13 OA，13/13 abstract，約 12 秒，純 server-side。