2022-05-14

GiNZAにコミットしながらNLP Libraryの勉強

環境構築

GiNZA - Japanese NLP Library | Universal Dependenciesに基づくオープンソース日本語NLPライブラリを眺めてとりあえず$ pip install -U ginza ja_ginzaを試してみると以下のエラー。

  raise VersionConflict(dist, req).with_context(dependent_req)
       pkg_resources.VersionConflict: (setuptools 49.2.1 (/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages), Requirement.parse('setuptools>=58.0'))
       [end of output]
   
   note: This error originates from a subprocess, and is likely not a problem with pip.
 error: subprocess-exited-with-error
 
 × Getting requirements to build wheel did not run successfully.
 │ exit code: 1
 ╰─> See above for output.

setuptoolsのバージョンの問題っぽかったから$ pip install setuptools --upgradeしてから$ pip install -U ginza ja_ginzaやり直しても同じエラー。

 $ pip install setuptools --upgrade
 Defaulting to user installation because normal site-packages is not writeable
 Requirement already satisfied: setuptools in /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages (49.2.1)
 Collecting setuptools
   Using cached setuptools-62.1.0-py3-none-any.whl (1.1 MB)
 Installing collected packages: setuptools
 Successfully installed setuptools-62.1.0

仕方ないから https://pypi.org/project/setuptools/#filesからダウンロードしてインストールしてみる。良さそう。

 $ sudo python3 setup.py install
 Installed /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages/setuptools-62.1.0-py3.8.egg
 Processing dependencies for setuptools==62.1.0
 Finished processing dependencies for setuptools==62.1.0
 
 /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages

元から入っていたsetuptoolsは削除する。

/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages
$ sudo rm -r setuptools

これで$ pip install -U ginza ja_ginzaは成功する。

コミット

https://github.com/megagonlabs/ginza#development-environmentに沿って環境構築をして、コードを眺めて気になったところに修正コミット。
github.com
github.com

https://github.com/megagonlabs/ginza#run-testsに沿ってテストを走らせてカバレッジを上げるコミット。
github.com

`split_mode`深堀

既存のコマンドラインのテストによるとsplit_modeにA, B, Cを選択でき、それぞれ以下のように形態素解析がされる。

("A", "機能性食品", ["機能", "性", "食品"]),
("B", "機能性食品", ["機能性", "食品"]),
("C", "機能性食品", ["機能性食品"]),

さらにコードを追っていくと、split_modeをセットしている箇所が2箇所ある。

1箇所目では、output_formatにmecabが指定されている場合、SudachiPyのSplitModeの値が設定される。

https://github.com/megagonlabs/ginza/blob/d60d3d4bf4af0c5879357a5f95b36f72ea8eb317/ginza/analyzer.py#L67-L68
https://github.com/megagonlabs/ginza/blob/d60d3d4bf4af0c5879357a5f95b36f72ea8eb317/ginza/analyzer.py#L23-L28

def try_sudachi_import(split_mode: str):
    """SudachiPy is required for Japanese support, so check for it.
    It it's not available blow up and explain how to fix it.
    split_mode should be one of these values: "A", "B", "C", None->"A"."""
    try:
        from sudachipy import dictionary, tokenizer

        split_mode = {
            None: tokenizer.Tokenizer.SplitMode.A,
            "A": tokenizer.Tokenizer.SplitMode.A,
            "B": tokenizer.Tokenizer.SplitMode.B,
            "C": tokenizer.Tokenizer.SplitMode.C,
        }[split_mode]
        tok = dictionary.Dictionary().create(mode=split_mode)
        return tok
...
def set_nlp(self) -> None:
        if self.output_format in ["2", "mecab"]:
            nlp = try_sudachi_import(self.split_mode)
...

SudachiPyの実装をさらに追うと、modeとして渡され、以下のように分岐される。

https://github.com/WorksApplications/SudachiPy/blob/6fb25bd20e206fc7e7e452b2dafdb2576dcf3a69/sudachipy/tokenizer.pyx#L127
https://github.com/WorksApplications/SudachiPy/blob/6fb25bd20e206fc7e7e452b2dafdb2576dcf3a69/sudachipy/tokenizer.pyx#L172-L180

def tokenize(self, text: str, mode=None, logger=None) -> MorphemeList:
        """ tokenize a text.

        In default tokenize text with SplitMode.C

        Args:
            text: input text
            mode: split mode
       ...
       """
...
def _split_path(self, path: List[LatticeNode], mode: SplitMode) -> List[LatticeNode]:
        if mode == self.SplitMode.C:
            return path
        new_path = []
        for node in path:
            if mode is self.SplitMode.A:
                wids = node.get_word_info().a_unit_split
            else:
                wids = node.get_word_info().b_unit_split

LatticeNode#get_word_infoはWordInfoを返す...と続いていくが詳細には追えなかった。
WordInfoのコンストラクタにa_unit_splitとb_unit_splitがあることまで確認できた。

https://github.com/WorksApplications/SudachiPy/blob/6fb25bd20e206fc7e7e452b2dafdb2576dcf3a69/sudachipy/latticenode.pyx#L79-L84

...
cdef class LatticeNode:
...
    def get_word_info(self) -> WordInfo:
        if not self._is_defined:
            return UNK
        if self.extra_word_info:
            return self.extra_word_info
        return self.lexicon.get_word_info(self.word_id)

2箇所目では、spaCyで何かしらのモデルがロードされた後、set_split_mode(nlp, self.split_mode)が呼び出される。

https://github.com/megagonlabs/ginza/blob/d60d3d4bf4af0c5879357a5f95b36f72ea8eb317/ginza/analyzer.py#L69-L86

...
def set_nlp(self) -> None:
...
        else:
            # Work-around for pickle error. Need to share model data.
            if self.model_name_or_path:
                nlp = spacy.load(self.model_name_or_path)
            else:
                try:
                    nlp = spacy.load("ja_ginza_electra")
                except IOError as e:
                    try:
                        nlp = spacy.load("ja_ginza")
                    except IOError as e:
                        raise OSError("E050", 'You need to install "ja-ginza" or "ja-ginza-electra" by executing `pip install ja-ginza` or `pip install ja-ginza-electra`.')

            if self.disable_sentencizer:
                nlp.add_pipe("disable_sentencizer", before="parser")

            if self.split_mode:
                set_split_mode(nlp, self.split_mode)
...

https://spacy.io/api/language#factoryによると@Language.factoryでLanguage.add_pipeができて...みたいな説明があった。

https://github.com/megagonlabs/ginza/blob/d60d3d4bf4af0c5879357a5f95b36f72ea8eb317/ginza/__init__.py#L48-L54
https://github.com/megagonlabs/ginza/blob/d60d3d4bf4af0c5879357a5f95b36f72ea8eb317/ginza/__init__.py#L111-L114

@Language.factory(
    "compound_splitter",
    requires=[],
    assigns=[],
    retokenizes=True,
    default_config={"split_mode": None},
)
...
def set_split_mode(nlp: Language, mode: str):
    if nlp.has_pipe("compound_splitter"):
        splitter = nlp.get_pipe("compound_splitter")
        splitter.split_mode = mode

compound_splitterはginza/compound_splitter.py at develop · megagonlabs/ginza · GitHubがあるからおそらくカスタム関数をパイプラインに追加してそうだけど、まだPipelinesがよくわかってないから勉強してから出直す。
spacy.io
www.youtube.com

とりあえずNLP Libraryの実装の雰囲気が掴めたから終わり。

2022-04-05

TOEIC 675 から 825 に上げた

英語

1年3ヶ月前は市販模試を3回分やって受けて675(305/370)だった。
今回は真面目に対策をして受けてみることにした。勉強期間は半日(3時間くらい)を1ヶ月半くらい。

リスニング対策は「極めろ！リスニング解答力TOIEC L&R TEST」をした。
692ページあるから忍耐力が必要だった。

極めろ! リスニング解答力 TOEIC® L&R TEST

Amazon

リーディングのPart5対策は「TOIEC L&Rテスト文法でる1000問」をした。
1000問解く必要があったのかはわからない。

TOEIC L&Rテスト文法問題でる1000問

作者:TEX加藤
アスク

Amazon

Part6,7 は英文を読むことに慣れる必要がありそうだったから「英文解釈教室」を読んだ。
文法のことが大体わかる良書。

英文解釈教室〈新装版〉

作者:伊藤和夫
研究社

Amazon

長文を読むことに慣れるために「英文詳説世界史 WORLD HISTORY for High School」を読んだ。
日本語の教科書があったから英語->日本語(流し読み)をした。世界史の復習にもなるから一石二鳥。

英文詳説世界史 WORLD HISTORY for High School

山川出版社

Amazon

あとはSRE本の英語版が公開されているからこれも日本語と合わせて読んだ。

sre.google

仕上げに市販の模試「新メガ模試1200問」を解いた。
スコアは1問5店換算で750(390/360) -> 745(425/320) -> 810(425/385) -> 790(400/390) -> 845(420/425) -> 800(415/385) だったから本番で思ったより取れてた。

新メガ模試1200問 TOEIC® L&R テスト VOL.2

Amazon

リーディングは時間が余って自信があったけどそこまで伸びていなかった。
リスニングは単に問題の形式とか傾向に慣れて伸びた部分が多いと思う。
ライティングとスピーキングの学習動機としてTOEFLを検討してるけど3万円に見合うかどうか...

2022-04-01

株式会社はてなに入社しました

イベント

株式会社はてなに入社しました
2.5年目

株式会社はてなに入社しました - hitode909の日記

2022-03-27

create-react-appからJestの実行場所を探す

React

create-react-appで作成されたプロジェクトにおいてTesting LibraryとJestの関係性を整理しようとした時、Testing LibraryはJest, AVA, Chai...の並びでテストランナーなのではないかという誤解が生じた。そしてtesting-library/jest-domはテストランナーとしてのTesting LibraryをJestに移行するためのツールとして理解しようとした。
誤解が生じた1番の原因は、package.jsonのdependenciesやscriptsにjestが見当たらなかったからだと思う。
けど、Testing LibraryはLibraryといってるくらいだから、さすがにテストランナーではないよな...って感じでもう少し調べてみることにした。

create-react-app が裏で何をやっているか理解する - Qiita
これでほぼ解決した。
npm run ejectするとdependenciesにjestがある！

だけどまだnpm run testでreact-scripts testが実行されているのでJestの姿がはっきりと見えない。
というわけでcreate-react-app/packages/react-scriptsを眺める。
/bin/react-scripts.js#27にpackage.jsonのscriptsの一覧がある。
/scripts/test.js#L129 のjest.run(argv);でJestが走る。

というわけでTesting LibraryはEnzymeと同じ並びでDOM構造をテストするためのライブラリで、jest-domはTesting LibraryをJestで使用する際に必要なMatcherを提供している。

2022-03-13

DenoでDatabase Design and Implementation 3章を実装する

TypeScript Deno

「Database Design and Implementation」の内容はこの記事で紹介されているので省いて、とりあえずDenoで実装を始めてみたというメモ。
tarovel4842.hatenablog.com

JavaもC++も書いたことがないので一番慣れてるJavaScrptで実装したいなーと思い、せっかくだからDenoでやろうというモチベーション。
とりあえずこのコミットで3章の大枠は実装できている。
合ってるかわからないけど合ってなかったらこの先わかるだろうという完成度。
C++の実装は他言語で実装にするにあたってスター1個じゃ足りないくらいとても参考になっている。

Bufferを扱うけれど、今までBufferを深く追求したことがなかったのでNodeの記事で入門してDenoの記事も読んだ。
はじめてのNode.js：Node.js内でバイナリデータを扱うための「Buffer」クラス | OSDN Magazine
Buffers in Deno | The JS runtimes
Buffer.from: Deno’s equivalent of Node.js | The JS runtimes

最初はstd/io/bufferを使おうと思ったけどoffsetを指定して書きこんだりするのでそれができるかまだじっくり調べてなくてとりあえずstd/node/bufferを使っている。

また、ファイルの存在を確かめる箇所があって、ビルトインにはなさそうで、std/fsでは全体がunstableでexistがdeprecatedになっていたからこれもstd/node/fsを使っている。
今気づいたけどその流れでDeno.mkdirではなくfs.mkdirを使っていた。

こんな感じでドキュメント等の情報量もstd/nodeを使う方が多くて楽だからビルトインを積極的に使ってなくてそこに関してはNodeでもいいのではって感じだけど、node_moduleがなくてスクラッチで作ってる感満載の時点でDenoは開発体験がいい。

2022-01-25

Notes on the breaking change that occurred in Axios v0.25.0

OSS活動

Axios v0.25.0 was released on January 18, 2022.

One of the breaking changes is adding error handling when missing url, which returns an Error if the Request Config url specified in the argument is Falsy.
This was introduced to clarify bugs and errors, but there have been issues where the URL is intentionally set to Falsy, such as when using the baseURL of Request Config or when sending a request to GraphQL.

Some say that in order to upgrade to Node14, need the aborted event handler that was added in the same release, but can't upgrade Axios because of this change, and I, as the implementor of this breaking change, think this is an issue that should be addressed quickly.

This issue proposes the following fixes: "Allow empty strings", "Create a new option", and "Change the location of the error handling".
Maintainer jasonsaayman answers with 「It would be nice to keep everyone happy, so I would like to give this some more thought come up with a solid solution to this, I will revert back after the weekend with my ideas.」

PS January 27, 2022.
PR to be reverted is merged and will be released soon.
https://github.com/axios/axios/issues/4407#issuecomment-1022894805

PS Feburary 18, 2022.
Fix has already released.
Release v0.26.0 · axios/axios · GitHub

2022-01-25

Axios v0.25.0 で生じた breaking change の注意点

OSS活動

English version:
Notes on the breaking change that occurred in Axios v0.25.0 - 経験は何よりも饒舌

2022年1月18日に Axios v0.25.0がリリースされました。

その中の breaking change の1つに Adding error handling when missing url があります。
これは、引数で指定するRequest ConfigのurlがFalsyである場合にErrorを返すという変更です。
バグやエラーの内容を明確にすることを目的に導入されましたが、Request ConfigのbaseURLを活用する際や、GraphQLにリクエストを送る際など、意図的にFalsyにしている場合があるというissueが立てられています。

Node14にアップグレードするために、同じリリースで追加されたaborted event handlerが必要だが、この変更のためにAxiosをアップグレードできないという発言もあり、このbreaking changeの実装者である私も速やかに対応すべき問題だと考えています。

このissueでは「空文字列を許容する」、「新たにoptionを作る」、「エラー処理の場所を変える」という修正案が出されており、メンテナーである@jasonsaaymanは
「It would be nice to keep everyone happy, so I would like to give this some more thought come up with a solid solution to this, I will revert back after the weekend with my ideas.」
と回答している状態です。

2022/1/27 追記
revertするPRがマージされ、もうじきリリースされます
https://github.com/axios/axios/issues/4407#issuecomment-1022894805

2022/2/18 追記
修正がリリースされています
Release v0.26.0 · axios/axios · GitHub

環境構築

コミット

split_mode深堀

`split_mode`深堀