happyou.infoのブログ

ニュース収集サイトhappyou.infoのブログです。国内外のあらゆる企業と組織、団体のウェブサイトの更新を収集します。岡本将吾が運営しています。twitterは @happyou_info_ja です。

最近のテーブルパーサ

引き続き表のスクレイピングを諦めない。

最近試したテーブルパーサ

table-transformer

GitHub - microsoft/table-transformer: Table Transformer (TATR) is a deep learning model for extracting tables from unstructured documents (PDFs and images). This is also the official repository for the PubTables-1M dataset and GriTS evaluation metric. microsoftのOSS。transformer。PDFは一旦画像に落とす。1ページ内の複数テーブルの認識が怪しい。性能はまだまだ。

PyMuPDF

最近テーブルのスクレイピングに対応した。 Page - PyMuPDF 1.23.8 documentation 検出アルゴリズムがlineとtextの2種類。lineは検出(fs2と同じくらい性能が良い)。 textは “virtual” columnを発見するとのことだが性能はまだまだ。

unstructured.io

github.com なんかものすごい勢いで色々なものを寄せ集めて全部対応するらしい。テーブル対応の評価は、今やってる。

nohgat

GitHub - facebookresearch/nougat: Implementation of Nougat Neural Optical Understanding for Academic Documents 未評価。良さそうに見えるんだが。

みな考えることは同じで、悩みも同じ。解決策はない。 www.youtube.com https://www.reddit.com/r/LocalLLaMA/comments/1854d06/table_extraction_from_pdf/