mecab_text_cleaner package
- mecab_text_cleaner.to_ascii_clean(text: str, reading_type: ~typing.Literal['orth', 'pron', 'kana'] = 'pron', add_atype: bool = True, add_blank_between_words: bool = True, tagger: ~fugashi.fugashi.Tagger = <fugashi.fugashi.Tagger object>, remove_multiple_spaces: bool = True) str[source]
Convert text to reading, then to ascii.
- Parameters:
text (str) – The text to convert.
reading_type (Literal["orth", "pron",) –
"kana"] – Reading type, by default “pron” “pron” is the pronunciation (発音形), “orth” is the orthography (書字形), “kana” is the kana(仮名) form of orthography
optional – Reading type, by default “pron” “pron” is the pronunciation (発音形), “orth” is the orthography (書字形), “kana” is the kana(仮名) form of orthography
add_atype (bool, optional) – Whether to consider aType (アクセント型) and add “]” to the reading, by default True
add_blank_between_words (bool, optional) – Whether to add a blank between words, by default True
when_unknown (Literal["passthrough", , optional) – What to do when the reading is unknown (“補助記号” and “一般”), by default “passthrough” “passthrough” will pass the original text, “*” will pass “*”, “unidecode” will use unidecode, and a callable will be called with the original text
tagger (fugashi.Tagger, optional) – The tagger to use, by default fugashi.Tagger()
remove_multiple_spaces (bool, optional) – Whether to remove multiple spaces created by unidecode, by default True
- Returns:
The ascii-cleaned text
- Return type:
str
- Raises:
ImportError – When unidecode is not installed
Examples
>>> from mecab_text_cleaner import to_reading >>> to_reading(" 空、雲。\n雨!(") 'so]ra, ku]mo. \na]me!('
- mecab_text_cleaner.to_reading(text: str, reading_type: Literal['orth', 'pron', 'kana'] = 'pron', add_atype: bool = True, add_blank_between_words: bool = True, when_unknown: Literal['passthrough', '*', 'unidecode'] | Callable[[str], str] = 'passthrough', tagger: fugashi.Tagger = <fugashi.fugashi.Tagger object>) str[source]
Convert text to reading. Note that MeCab interprets spaces as word boundaries, and will be removed. Lines (n only) are restored later.
- Parameters:
text (str) – The text to convert.
reading_type (Literal["orth", "pron",) –
"kana"] – Reading type, by default “pron” “pron” is the pronunciation (発音形), “orth” is the orthography (書字形), “kana” is the kana(仮名) form of orthography
optional – Reading type, by default “pron” “pron” is the pronunciation (発音形), “orth” is the orthography (書字形), “kana” is the kana(仮名) form of orthography
add_atype (bool, optional) – Whether to consider aType (アクセント型) and add “]” to the reading, by default True
add_blank_between_words (bool, optional) – Whether to add a blank between words, by default True
when_unknown (Literal["passthrough", , optional) – What to do when the reading is unknown (“補助記号” and “一般”), by default “passthrough” “passthrough” will pass the original text, “*” will pass “*”, “unidecode” will use unidecode, and a callable will be called with the original text
tagger (fugashi.Tagger, optional) – The tagger to use, by default fugashi.Tagger()
- Returns:
The reading
- Return type:
str
- Raises:
ImportError – When when_unknown=”unidecode” and unidecode is not installed
Examples
>>> from mecab_text_cleaner import to_reading >>> to_reading(" 空、雲。\n雨!(") 'ソ]ラ、 ク]モ。\nア]メ!('