GiNZAにコミットしながらNLP Libraryの勉強

環境構築

GiNZA - Japanese NLP Library | Universal Dependenciesに基づくオープンソース日本語NLPライブラリを眺めてとりあえず$ pip install -U ginza ja_ginzaを試してみると以下のエラー。

  raise VersionConflict(dist, req).with_context(dependent_req)
       pkg_resources.VersionConflict: (setuptools 49.2.1 (/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages), Requirement.parse('setuptools>=58.0'))
       [end of output]
   
   note: This error originates from a subprocess, and is likely not a problem with pip.
 error: subprocess-exited-with-error
 
 × Getting requirements to build wheel did not run successfully.
 │ exit code: 1
 ╰─> See above for output.

setuptoolsのバージョンの問題っぽかったから$ pip install setuptools --upgradeしてから$ pip install -U ginza ja_ginzaやり直しても同じエラー。

 $ pip install setuptools --upgrade
 Defaulting to user installation because normal site-packages is not writeable
 Requirement already satisfied: setuptools in /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages (49.2.1)
 Collecting setuptools
   Using cached setuptools-62.1.0-py3-none-any.whl (1.1 MB)
 Installing collected packages: setuptools
 Successfully installed setuptools-62.1.0

仕方ないから https://pypi.org/project/setuptools/#filesからダウンロードしてインストールしてみる。良さそう。

 $ sudo python3 setup.py install
 Installed /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages/setuptools-62.1.0-py3.8.egg
 Processing dependencies for setuptools==62.1.0
 Finished processing dependencies for setuptools==62.1.0
 
 /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages

元から入っていたsetuptoolsは削除する。

/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages
$ sudo rm -r setuptools

これで$ pip install -U ginza ja_ginzaは成功する。

コミット

https://github.com/megagonlabs/ginza#development-environmentに沿って環境構築をして、コードを眺めて気になったところに修正コミット。
github.com
github.com

https://github.com/megagonlabs/ginza#run-testsに沿ってテストを走らせてカバレッジを上げるコミット。
github.com

`split_mode`深堀

既存のコマンドラインのテストによるとsplit_modeにA, B, Cを選択でき、それぞれ以下のように形態素解析がされる。

("A", "機能性食品", ["機能", "性", "食品"]),
("B", "機能性食品", ["機能性", "食品"]),
("C", "機能性食品", ["機能性食品"]),

さらにコードを追っていくと、split_modeをセットしている箇所が2箇所ある。

1箇所目では、output_formatにmecabが指定されている場合、SudachiPyのSplitModeの値が設定される。

https://github.com/megagonlabs/ginza/blob/d60d3d4bf4af0c5879357a5f95b36f72ea8eb317/ginza/analyzer.py#L67-L68
https://github.com/megagonlabs/ginza/blob/d60d3d4bf4af0c5879357a5f95b36f72ea8eb317/ginza/analyzer.py#L23-L28

def try_sudachi_import(split_mode: str):
    """SudachiPy is required for Japanese support, so check for it.
    It it's not available blow up and explain how to fix it.
    split_mode should be one of these values: "A", "B", "C", None->"A"."""
    try:
        from sudachipy import dictionary, tokenizer

        split_mode = {
            None: tokenizer.Tokenizer.SplitMode.A,
            "A": tokenizer.Tokenizer.SplitMode.A,
            "B": tokenizer.Tokenizer.SplitMode.B,
            "C": tokenizer.Tokenizer.SplitMode.C,
        }[split_mode]
        tok = dictionary.Dictionary().create(mode=split_mode)
        return tok
...
def set_nlp(self) -> None:
        if self.output_format in ["2", "mecab"]:
            nlp = try_sudachi_import(self.split_mode)
...

SudachiPyの実装をさらに追うと、modeとして渡され、以下のように分岐される。

https://github.com/WorksApplications/SudachiPy/blob/6fb25bd20e206fc7e7e452b2dafdb2576dcf3a69/sudachipy/tokenizer.pyx#L127
https://github.com/WorksApplications/SudachiPy/blob/6fb25bd20e206fc7e7e452b2dafdb2576dcf3a69/sudachipy/tokenizer.pyx#L172-L180

def tokenize(self, text: str, mode=None, logger=None) -> MorphemeList:
        """ tokenize a text.

        In default tokenize text with SplitMode.C

        Args:
            text: input text
            mode: split mode
       ...
       """
...
def _split_path(self, path: List[LatticeNode], mode: SplitMode) -> List[LatticeNode]:
        if mode == self.SplitMode.C:
            return path
        new_path = []
        for node in path:
            if mode is self.SplitMode.A:
                wids = node.get_word_info().a_unit_split
            else:
                wids = node.get_word_info().b_unit_split

LatticeNode#get_word_infoはWordInfoを返す...と続いていくが詳細には追えなかった。
WordInfoのコンストラクタにa_unit_splitとb_unit_splitがあることまで確認できた。

https://github.com/WorksApplications/SudachiPy/blob/6fb25bd20e206fc7e7e452b2dafdb2576dcf3a69/sudachipy/latticenode.pyx#L79-L84

...
cdef class LatticeNode:
...
    def get_word_info(self) -> WordInfo:
        if not self._is_defined:
            return UNK
        if self.extra_word_info:
            return self.extra_word_info
        return self.lexicon.get_word_info(self.word_id)

2箇所目では、spaCyで何かしらのモデルがロードされた後、set_split_mode(nlp, self.split_mode)が呼び出される。

https://github.com/megagonlabs/ginza/blob/d60d3d4bf4af0c5879357a5f95b36f72ea8eb317/ginza/analyzer.py#L69-L86

...
def set_nlp(self) -> None:
...
        else:
            # Work-around for pickle error. Need to share model data.
            if self.model_name_or_path:
                nlp = spacy.load(self.model_name_or_path)
            else:
                try:
                    nlp = spacy.load("ja_ginza_electra")
                except IOError as e:
                    try:
                        nlp = spacy.load("ja_ginza")
                    except IOError as e:
                        raise OSError("E050", 'You need to install "ja-ginza" or "ja-ginza-electra" by executing `pip install ja-ginza` or `pip install ja-ginza-electra`.')

            if self.disable_sentencizer:
                nlp.add_pipe("disable_sentencizer", before="parser")

            if self.split_mode:
                set_split_mode(nlp, self.split_mode)
...

https://spacy.io/api/language#factoryによると@Language.factoryでLanguage.add_pipeができて...みたいな説明があった。

https://github.com/megagonlabs/ginza/blob/d60d3d4bf4af0c5879357a5f95b36f72ea8eb317/ginza/__init__.py#L48-L54
https://github.com/megagonlabs/ginza/blob/d60d3d4bf4af0c5879357a5f95b36f72ea8eb317/ginza/__init__.py#L111-L114

@Language.factory(
    "compound_splitter",
    requires=[],
    assigns=[],
    retokenizes=True,
    default_config={"split_mode": None},
)
...
def set_split_mode(nlp: Language, mode: str):
    if nlp.has_pipe("compound_splitter"):
        splitter = nlp.get_pipe("compound_splitter")
        splitter.split_mode = mode

compound_splitterはginza/compound_splitter.py at develop · megagonlabs/ginza · GitHubがあるからおそらくカスタム関数をパイプラインに追加してそうだけど、まだPipelinesがよくわかってないから勉強してから出直す。
spacy.io
www.youtube.com

とりあえずNLP Libraryの実装の雰囲気が掴めたから終わり。

経験は何よりも饒舌

10年後に真価を発揮するかもしれないブログ

GiNZAにコミットしながらNLP Libraryの勉強

環境構築

コミット

`split_mode`深堀

環境構築

コミット

split_mode深堀

`split_mode`深堀