mecab_text_cleaner package

mecab_text_cleaner.to_ascii_clean(text: str, reading_type: ~typing.Literal['orth', 'pron', 'kana'] = 'pron', add_atype: bool = True, add_blank_between_words: bool = True, tagger: ~fugashi.fugashi.Tagger = <fugashi.fugashi.Tagger object>, remove_multiple_spaces: bool = True) → str[source]

Convert text to reading, then to ascii.

Parameters:

text (str) – The text to convert.
reading_type (Literal["orth", "pron",) –
"kana"] – Reading type, by default “pron” “pron” is the pronunciation (発音形), “orth” is the orthography (書字形), “kana” is the kana(仮名) form of orthography
optional – Reading type, by default “pron” “pron” is the pronunciation (発音形), “orth” is the orthography (書字形), “kana” is the kana(仮名) form of orthography
add_atype (bool, optional) – Whether to consider aType (アクセント型) and add “]” to the reading, by default True
add_blank_between_words (bool, optional) – Whether to add a blank between words, by default True
when_unknown (Literal["passthrough", , optional) – What to do when the reading is unknown (“補助記号” and “一般”), by default “passthrough” “passthrough” will pass the original text, “*” will pass “*”, “unidecode” will use unidecode, and a callable will be called with the original text
tagger (fugashi.Tagger, optional) – The tagger to use, by default fugashi.Tagger()
remove_multiple_spaces (bool, optional) – Whether to remove multiple spaces created by unidecode, by default True

Returns:

The ascii-cleaned text

Return type:

str

Raises:

ImportError – When unidecode is not installed

Examples

>>> from mecab_text_cleaner import to_reading
>>> to_reading("     空、雲。\n雨！（")
'so]ra, ku]mo. \na]me!('

mecab_text_cleaner.to_reading(text: str, reading_type: Literal['orth', 'pron', 'kana'] = 'pron', add_atype: bool = True, add_blank_between_words: bool = True, when_unknown: Literal['passthrough', '*', 'unidecode'] | Callable[[str], str] = 'passthrough', tagger: fugashi.Tagger = <fugashi.fugashi.Tagger object>) → str[source]

Convert text to reading. Note that MeCab interprets spaces as word boundaries, and will be removed. Lines (n only) are restored later.

Parameters:

text (str) – The text to convert.
reading_type (Literal["orth", "pron",) –
"kana"] – Reading type, by default “pron” “pron” is the pronunciation (発音形), “orth” is the orthography (書字形), “kana” is the kana(仮名) form of orthography
optional – Reading type, by default “pron” “pron” is the pronunciation (発音形), “orth” is the orthography (書字形), “kana” is the kana(仮名) form of orthography
add_atype (bool, optional) – Whether to consider aType (アクセント型) and add “]” to the reading, by default True
add_blank_between_words (bool, optional) – Whether to add a blank between words, by default True
when_unknown (Literal["passthrough", , optional) – What to do when the reading is unknown (“補助記号” and “一般”), by default “passthrough” “passthrough” will pass the original text, “*” will pass “*”, “unidecode” will use unidecode, and a callable will be called with the original text
tagger (fugashi.Tagger, optional) – The tagger to use, by default fugashi.Tagger()

Returns:

The reading

Return type:

str

Raises:

ImportError – When when_unknown=”unidecode” and unidecode is not installed

Examples

>>> from mecab_text_cleaner import to_reading
>>> to_reading("     空、雲。\n雨！（")
'ソ]ラ、 ク]モ。\nア]メ！（'

mecab_text_cleaner package

Submodules

mecab_text_cleaner.cli module