Dolphinで始める文書解析とStreamlit実装入門

JS2IIU

7か月前

こんにちは、JS2IIUです。
Dolphinで画像やPDFからテキスト・表・図を抽出し、Streamlitで手早く試す手順と実装のポイントを紹介します。今回もよろしくお願いします。

概要
Dolphin の概観
内部ワークフロー
出力フォーマット例（JSON スキーマの概念例）
コード解説（主要関数とフロー）
実行時のチューニングと注意点
サンプル出力（抜粋）
前提と準備
Streamlit デモの構成要素（streamlit_app.py を参照）
1. streamlit_app.pyのサンプルコード
2. スクリーンショット
実践: ローカルで動かす手順（ハンズオン）
streamlit_app.py のポイント解説（サンプル抜粋）
1. コード変更の注意点
実践的な改良案
よくある問題と対処法
参考リンク

概要

このページでは、ACL 2025で発表されたライブラリ「Dolphin」の概要を簡潔に説明し、ローカル環境で動かすための手順と、サンプルコードで示す streamlit_app.py を用いた実例をハンズオンで解説します。最終的に、アップロードした画像/PDFからJSON・Markdown・図を出力し、ダウンロードするまでを体験できます。

Dolphin の概観

Dolphinは “Document Image Parsing via Heterogeneous Anchor Prompting” の考え方に基づき、ドキュメント画像を高精度に解析するためのマルチモーダルモデルです。特徴は大きく分けて2点あります。

2段階アプローチ（analyze → parse）: ページ全体を俯瞰して読み順や領域（段落、表、図、式など）をまず検出し、その後に各領域ごとに適切なプロンプトで個別に解析します。これにより文脈を失わずに細部の解析精度を高めます。
異種アンカープロンプト: 要素ごとに異なるタスク指示（例: 表の解析には “Parse the table”、本文には “Read text”）を与えることで、同じモデルでもタスク依存の出力を最適化します。

主要な出力形式

JSON: ページごとの要素リスト（bbox、label、text、reading_order など）を含む構造化データ
Markdown: 図は相対パスで埋め込み、本文はMarkdown形式で出力するためそのままドキュメント化に使えます
抽出図: 図要素は markdown/figures/ にPNG等で保存され、Markdownから参照されます

次節で内部ワークフローをもう少し技術的に解説します。

内部ワークフロー

ページレベル解析（layout / reading order）
- モデルに対してページ全体を渡し、自然な読み順に従った領域列（bbox とラベル）を生成します。出力はテキスト形式でレイアウト情報を表現することが多く、後段の要素処理に渡されます。
要素の切り出しと前処理
- 得られた bbox を padded image の座標系にマップし、境界調整や余白トリミング（crop_margin）を行います。prepare_image で正方化パディングすることでモデル入力のサイズを統一します。
要素ごとのプロンプト生成と並列解析
- ラベルに応じてプロンプトを決定（例: "Parse the table in the image."、"Read text in the image."）。テキスト/表はバッチ化して model.chat(prompts_list, crops_list, max_batch_size=...) のように並列的に推論します。図は画像として保存し、解析は不要な場合もあります。
出力の集約とフォーマット変換
- 要素結果を読み順でソートし、JSONファイルとして保存します。PDFの場合は全ページを統合して save_combined_pdf_results がJSONとMarkdownを生成します。

出力フォーマット例（JSON スキーマの概念例）

以下は生成される JSON の簡易例です（実際はプロジェクト固有のスキーマに従います）。

JSON

{
    "source_file": "page_1.png",
    "page_number": 1,
    "elements": [
        {"label": "p", "bbox": [10,20,400,200], "text": "これは本文の一部です", "reading_order": 0},
        {"label": "fig", "bbox": [410,20,800,500], "figure_path": "markdown/figures/page_1_figure_000.png", "reading_order": 1},
        {"label": "tab", "bbox": [10,210,800,400], "text": "|col1|col2|\n|---|---|", "reading_order": 2}
    ]
}

{
    "source_file": "page_1.png",
    "page_number": 1,
    "elements": [
        {"label": "p", "bbox": [10,20,400,200], "text": "これは本文の一部です", "reading_order": 0},
        {"label": "fig", "bbox": [410,20,800,500], "figure_path": "markdown/figures/page_1_figure_000.png", "reading_order": 1},
        {"label": "tab", "bbox": [10,210,800,400], "text": "|col1|col2|\n|---|---|", "reading_order": 2}
    ]
}

Markdown 変換では図を ![Figure](figures/xxx.png) のように埋め込みます。プロジェクト内の utils.markdown_utils.MarkdownConverter を使ってJSON→Markdown変換が行われます。

Dolphin/utils/markdown_utils.py at master · bytedance/Dolphin

コード解説（主要関数とフロー）

ここでは demo_page.py と utils/utils.py の主要関数を簡潔に説明します。詳しい実装は該当ファイルを参照してください。

process_document(document_path, model, save_dir, max_batch_size)
- 画像かPDFかを判定し、PDFなら convert_pdf_to_images でページごとに画像化した後 process_single_image を呼ぶ。単一画像は直接 process_single_image へ。
process_single_image(image, model, save_dir, image_name, max_batch_size, save_individual=True)
- ページレベルで model.chat("Parse the reading order of this document.", image) を呼び、レイアウト文字列を受け取る
- prepare_image でパディングし、process_elements に渡して要素を解析
- 結果を save_outputs（JSON + Markdown）で保存（save_individual=False の場合は保存をスキップ）
process_elements(layout_results, padded_image, dims, model, max_batch_size, save_dir=None, image_name=None)
- parse_layout_string で layout のテキストを bbox と label のリストに変換
- 図要素は save_figure_to_local で保存
- テキスト/表要素はバッチ化して model.chat(prompts_list, crops_list, max_batch_size=...) で推論し、結果を recognition_results にマージ

補助関数

prepare_image : PIL から OpenCV 形式に変換し、正方化パディングを行う（戻り値は padded_image と ImageDimensions）
save_outputs : recognition_results を JSON と Markdown に変換して保存
save_combined_pdf_results : PDF の複数ページ結果をまとめて JSON と Markdown を出力

実行時のチューニングと注意点

max_batch_size: 要素解析をバッチ化する際のサイズ。大きくすると推論回数は減るがメモリ使用量が増える。まずは 2~8 程度から試してください。
デバイス: GPU があれば高速化します。環境にCUDAが無い場合はCPUモードで動作しますが遅くなります。
大きなPDF: ページ数が多いPDFは事前に分割するか、ページごとに処理して中間結果を保存するワークフローが有効です。

サンプル出力（抜粋）

Markdown の一部例:

Markdown

### page_1

これは本文の一部です

![Figure](figures/page_1_figure_000.png)

|col1|col2|
|---|---|
|1|2|

### page_1

これは本文の一部です

![Figure](figures/page_1_figure_000.png)

|col1|col2|
|---|---|
|1|2|

JSON の抜粋は前節のスキーマ例を参照してください。

前提と準備

以下を想定しています。

リポジトリをクローン済みで config/Dolphin.yaml とモデルチェックポイントが適切なパスにあること
Python 仮想環境（venv など）が使えること

推奨インストール手順:

Bash

# 仮想環境を作成・有効化（例: venv）
python -m venv venv
source venv/bin/activate

# 依存をインストール
pip install -r requirements.txt
pip install streamlit

# 仮想環境を作成・有効化（例: venv）
python -m venv venv
source venv/bin/activate

# 依存をインストール
pip install -r requirements.txt
pip install streamlit

注意点:

GPUを使う場合はCUDAドライバと対応するPyTorchが必要です。CPUでの実行も可能ですが処理は遅くなります。
モデルのチェックポイントがないと解析は動作しません。checkpoints/ を確認してください。

Streamlit デモの構成要素（streamlit_app.py を参照）

streamlit_app.py は次の役割を持ちます。

UI: サイドバーで config_path、max_batch_size、device を指定
ファイルアップロード: 画像/PDF を一時ファイルに保存
モデル読み込み: load_model(config_path) で DOLPHIN インスタンスをキャッシュ
解析呼び出し: process_document(input_path, model, save_dir, max_batch_size) を呼ぶ
結果表示: JSON（recognition_json/）、Markdown（markdown/）、図（markdown/figures/）を表示しZIPでダウンロード可能にする

重要な実装ポイント:

モデルは重いため @st.cache_resource を使い再ロードを避けています。
出力ディレクトリ構成は setup_output_dirs(save_dir) によって整備されます。
Streamlitの画像表示は use_container_width=True を使うとレイアウト崩れが少なくなります（use_column_width は非推奨）。

streamlit_app.pyのサンプルコード

Python

"""Simple Streamlit front-end for Dolphin demo_page processing.

Supports: file upload (image/pdf) -> run Dolphin processing -> show JSON/markdown/figures -> download ZIP
"""

import json
import os
import shutil
import sys
import tempfile
from pathlib import Path

import streamlit as st
from omegaconf import OmegaConf

# Import Dolphin model and processing helper
from chat import DOLPHIN
from demo_page import process_document
from utils.utils import setup_output_dirs


st.set_page_config(page_title="Dolphin Streamlit Demo", layout="wide")


@st.cache_resource
def load_model(config_path: str):
    """Load and cache the DOLPHIN model instance."""
    cfg = OmegaConf.load(config_path)
    model = DOLPHIN(cfg)
    return model


def zip_output_dir(output_dir: str) -> str:
    """Create a zip archive of the output directory and return its path."""
    base_name = tempfile.NamedTemporaryFile(delete=False, prefix="dolphin_out_").name
    # shutil.make_archive will append extension
    archive_path = shutil.make_archive(base_name, 'zip', output_dir)
    return archive_path


def main():
    st.title("Dolphin — ファイル解析デモ")

    st.sidebar.header("設定")
    default_config = "config/Dolphin.yaml"
    config_path = st.sidebar.text_input("config path", default_config)
    max_batch_size = st.sidebar.number_input("max_batch_size", min_value=1, max_value=32, value=4)
    device = st.sidebar.selectbox("device", ["cpu", "gpu"])

    uploaded = st.file_uploader("画像またはPDFをアップロード", type=["png", "jpg", "jpeg", "pdf"])

    if uploaded is None:
        st.info("左サイドバーで設定を行い、ファイルをアップロードしてください。サンプルは demo/page_imgs にあります。")
        return

    # Save uploaded file to a temp file
    suffix = Path(uploaded.name).suffix
    tmp_dir = tempfile.mkdtemp(prefix="dolphin_streamlit_")
    tmp_path = os.path.join(tmp_dir, uploaded.name)
    with open(tmp_path, "wb") as f:
        f.write(uploaded.getbuffer())

    st.sidebar.write(f"アップロードファイル: {uploaded.name}")
    st.image(tmp_path, caption="アップロード画像", use_container_width=True)

    if st.button("処理開始"):
        try:
            with st.spinner("モデルを読み込み中...（初回は時間がかかります）"):
                model = load_model(config_path)

            # Prepare output directory for this run
            out_dir = os.path.join(tmp_dir, "outputs")
            os.makedirs(out_dir, exist_ok=True)
            setup_output_dirs(out_dir)

            with st.spinner("Dolphinで解析中..." ):
                json_path, results = process_document(tmp_path, model, out_dir, max_batch_size)

            st.success("処理が完了しました")

            # Show JSON result (if available)
            if json_path and os.path.exists(json_path):
                try:
                    with open(json_path, "r", encoding="utf-8") as f:
                        data = json.load(f)
                    st.subheader("認識結果 (JSON)")
                    st.json(data)
                except Exception:
                    st.write("認識結果: 成功しましたがJSONの読み込みに失敗しました。")
            else:
                # Fall back to results object
                st.subheader("認識結果 (オブジェクト)例")
                st.write(results)

            # Show markdown if exists
            md_dir = os.path.join(out_dir, "markdown")
            if os.path.isdir(md_dir):
                md_files = list(Path(md_dir).glob("*.md"))
                if md_files:
                    st.subheader("生成されたMarkdown")
                    for md_file in md_files:
                        st.markdown(f"### {md_file.name}")
                        try:
                            text = md_file.read_text(encoding="utf-8")
                            st.markdown(text)
                        except Exception:
                            st.write(f"{md_file.name} の読み込みに失敗しました。")

            # Show figures if present
            fig_dir = os.path.join(out_dir, "markdown", "figures")
            if os.path.isdir(fig_dir):
                figs = list(Path(fig_dir).iterdir())
                if figs:
                    st.subheader("抽出された図")
                    cols = st.columns(3)
                    for i, fig in enumerate(figs):
                        try:
                            cols[i % 3].image(str(fig), caption=fig.name)
                        except Exception:
                            cols[i % 3].write(fig.name)

            # Provide zip download
            archive = zip_output_dir(out_dir)
            if archive and os.path.exists(archive):
                with open(archive, "rb") as f:
                    st.download_button(label="結果をZIPでダウンロード", data=f, file_name=os.path.basename(archive))

        except Exception as e:
            st.error(f"処理中にエラーが発生しました: {str(e)}")


if __name__ == "__main__":
    main()

"""Simple Streamlit front-end for Dolphin demo_page processing.

Supports: file upload (image/pdf) -> run Dolphin processing -> show JSON/markdown/figures -> download ZIP
"""

import json
import os
import shutil
import sys
import tempfile
from pathlib import Path

import streamlit as st
from omegaconf import OmegaConf

# Import Dolphin model and processing helper
from chat import DOLPHIN
from demo_page import process_document
from utils.utils import setup_output_dirs


st.set_page_config(page_title="Dolphin Streamlit Demo", layout="wide")


@st.cache_resource
def load_model(config_path: str):
    """Load and cache the DOLPHIN model instance."""
    cfg = OmegaConf.load(config_path)
    model = DOLPHIN(cfg)
    return model


def zip_output_dir(output_dir: str) -> str:
    """Create a zip archive of the output directory and return its path."""
    base_name = tempfile.NamedTemporaryFile(delete=False, prefix="dolphin_out_").name
    # shutil.make_archive will append extension
    archive_path = shutil.make_archive(base_name, 'zip', output_dir)
    return archive_path


def main():
    st.title("Dolphin — ファイル解析デモ")

    st.sidebar.header("設定")
    default_config = "config/Dolphin.yaml"
    config_path = st.sidebar.text_input("config path", default_config)
    max_batch_size = st.sidebar.number_input("max_batch_size", min_value=1, max_value=32, value=4)
    device = st.sidebar.selectbox("device", ["cpu", "gpu"])

    uploaded = st.file_uploader("画像またはPDFをアップロード", type=["png", "jpg", "jpeg", "pdf"])

    if uploaded is None:
        st.info("左サイドバーで設定を行い、ファイルをアップロードしてください。サンプルは demo/page_imgs にあります。")
        return

    # Save uploaded file to a temp file
    suffix = Path(uploaded.name).suffix
    tmp_dir = tempfile.mkdtemp(prefix="dolphin_streamlit_")
    tmp_path = os.path.join(tmp_dir, uploaded.name)
    with open(tmp_path, "wb") as f:
        f.write(uploaded.getbuffer())

    st.sidebar.write(f"アップロードファイル: {uploaded.name}")
    st.image(tmp_path, caption="アップロード画像", use_container_width=True)

    if st.button("処理開始"):
        try:
            with st.spinner("モデルを読み込み中...（初回は時間がかかります）"):
                model = load_model(config_path)

            # Prepare output directory for this run
            out_dir = os.path.join(tmp_dir, "outputs")
            os.makedirs(out_dir, exist_ok=True)
            setup_output_dirs(out_dir)

            with st.spinner("Dolphinで解析中..." ):
                json_path, results = process_document(tmp_path, model, out_dir, max_batch_size)

            st.success("処理が完了しました")

            # Show JSON result (if available)
            if json_path and os.path.exists(json_path):
                try:
                    with open(json_path, "r", encoding="utf-8") as f:
                        data = json.load(f)
                    st.subheader("認識結果 (JSON)")
                    st.json(data)
                except Exception:
                    st.write("認識結果: 成功しましたがJSONの読み込みに失敗しました。")
            else:
                # Fall back to results object
                st.subheader("認識結果 (オブジェクト)例")
                st.write(results)

            # Show markdown if exists
            md_dir = os.path.join(out_dir, "markdown")
            if os.path.isdir(md_dir):
                md_files = list(Path(md_dir).glob("*.md"))
                if md_files:
                    st.subheader("生成されたMarkdown")
                    for md_file in md_files:
                        st.markdown(f"### {md_file.name}")
                        try:
                            text = md_file.read_text(encoding="utf-8")
                            st.markdown(text)
                        except Exception:
                            st.write(f"{md_file.name} の読み込みに失敗しました。")

            # Show figures if present
            fig_dir = os.path.join(out_dir, "markdown", "figures")
            if os.path.isdir(fig_dir):
                figs = list(Path(fig_dir).iterdir())
                if figs:
                    st.subheader("抽出された図")
                    cols = st.columns(3)
                    for i, fig in enumerate(figs):
                        try:
                            cols[i % 3].image(str(fig), caption=fig.name)
                        except Exception:
                            cols[i % 3].write(fig.name)

            # Provide zip download
            archive = zip_output_dir(out_dir)
            if archive and os.path.exists(archive):
                with open(archive, "rb") as f:
                    st.download_button(label="結果をZIPでダウンロード", data=f, file_name=os.path.basename(archive))

        except Exception as e:
            st.error(f"処理中にエラーが発生しました: {str(e)}")


if __name__ == "__main__":
    main()

スクリーンショット

実践: ローカルで動かす手順（ハンズオン）

仮想環境を用意し、依存をインストールする（上記コマンド参照）。
環境変数や config/Dolphin.yaml のパスを確認する。
Streamlit アプリを起動する:

Bash

streamlit run streamlit_app.py

streamlit run streamlit_app.py

ブラウザで表示されるUIから画像またはPDFをアップロードし、「処理開始」を押します。

処理の流れ（内部）:

アップロードされたファイルを一時ディレクトリに保存
モデルをロードまたはキャッシュから取得
process_document を呼び出し、画像/PDFを解析して recognition_json/ と markdown/ を生成
生成物を画面に表示し、ZIPでまとめてダウンロードできるようにする

streamlit_app.py のポイント解説（サンプル抜粋）

モデルの読み込み（キャッシュ）:

Python

@st.cache_resource
def load_model(config_path: str):
    cfg = OmegaConf.load(config_path)
    model = DOLPHIN(cfg)
    return model

@st.cache_resource
def load_model(config_path: str):
    cfg = OmegaConf.load(config_path)
    model = DOLPHIN(cfg)
    return model

この関数は初回の重いロードを1回だけ行い、以降の操作でモデルを共有します。

解析の呼び出し例（簡潔化）:

Python

out_dir = os.path.join(tmp_dir, "outputs")
setup_output_dirs(out_dir)
json_path, results = process_document(tmp_path, model, out_dir, max_batch_size)

out_dir = os.path.join(tmp_dir, "outputs")
setup_output_dirs(out_dir)
json_path, results = process_document(tmp_path, model, out_dir, max_batch_size)

処理結果の表示:

JSON は st.json() で整形表示します。
Markdown は st.markdown() で表示できます（markdown/*.md を読み込む）。
抽出図は st.image() で表示します（use_container_width=True を推奨）。

コード変更の注意点

Streamlit はバージョンによりパラメータ名が変わる場合があります。以前のサンプルで use_column_width を使っていると警告が出るため、use_container_width=True に置換してください。

実践的な改良案

以下は実運用で有効な改善案です。

非同期化: 長時間処理はワーカー/バックグラウンドで行い、進捗をポーリングで表示する設計が望ましい。
複数ファイルバッチ処理: 入力ディレクトリを受け取るモードを用意し、バッチ単位で並列処理する。
Web 配備: Docker イメージ化、GPU対応のランタイム、メモリの監視を組み込む。
UI改善: 処理中のログ表示、個別ページのプレビュー、サンプル画像選択ボタン。

よくある問題と対処法

モデルチェックポイントが見つからない: checkpoints/ の配置と config/Dolphin.yaml のパスを確認してください。
大きなPDFや高解像度画像でメモリ不足になる: ページ数を限定するか、解像度を事前に落とすプリプロセスを導入してください。
Streamlit警告 (use_column_width): streamlit_app.py 内の st.image(..., use_column_width=...) を use_container_width=True に置き換えてください。

参考リンク

bytedance/Dolphin (GitHub): https://github.com/bytedance/Dolphin
Dolphin 論文 (ACL 2025): arXiv:2505.14059
Streamlit ドキュメント: https://docs.streamlit.io/

最後まで読んでいただきありがとうございます。
ご意見、ご感想、ご質問は是非コメント欄へお願いします。