mecab_text_cleaner package

mecab_text_cleaner.to_ascii_clean(text: str, reading_type: ~typing.Literal['orth', 'pron', 'kana'] = 'pron', add_atype: bool = True, add_blank_between_words: bool = True, tagger: ~fugashi.fugashi.Tagger = <fugashi.fugashi.Tagger object>, remove_multiple_spaces: bool = True) str[source]

Convert text to reading, then to ascii.

Parameters:
  • text (str) – The text to convert.

  • reading_type (Literal[&quot;orth&quot;, &quot;pron&quot;,) –

  • &quot;kana&quot;] – Reading type, by default “pron” “pron” is the pronunciation (発音形), “orth” is the orthography (書字形), “kana” is the kana(仮名) form of orthography

  • optional – Reading type, by default “pron” “pron” is the pronunciation (発音形), “orth” is the orthography (書字形), “kana” is the kana(仮名) form of orthography

  • add_atype (bool, optional) – Whether to consider aType (アクセント型) and add “]” to the reading, by default True

  • add_blank_between_words (bool, optional) – Whether to add a blank between words, by default True

  • when_unknown (Literal[&quot;passthrough&quot;, , optional) – What to do when the reading is unknown (“補助記号” and “一般”), by default “passthrough” “passthrough” will pass the original text, “*” will pass “*”, “unidecode” will use unidecode, and a callable will be called with the original text

  • tagger (fugashi.Tagger, optional) – The tagger to use, by default fugashi.Tagger()

  • remove_multiple_spaces (bool, optional) – Whether to remove multiple spaces created by unidecode, by default True

Returns:

The ascii-cleaned text

Return type:

str

Raises:

ImportError – When unidecode is not installed

Examples

>>> from mecab_text_cleaner import to_reading
>>> to_reading("     空、雲。\n雨!(")
'so]ra, ku]mo. \na]me!('
mecab_text_cleaner.to_reading(text: str, reading_type: Literal['orth', 'pron', 'kana'] = 'pron', add_atype: bool = True, add_blank_between_words: bool = True, when_unknown: Literal['passthrough', '*', 'unidecode'] | Callable[[str], str] = 'passthrough', tagger: fugashi.Tagger = <fugashi.fugashi.Tagger object>) str[source]

Convert text to reading. Note that MeCab interprets spaces as word boundaries, and will be removed. Lines (n only) are restored later.

Parameters:
  • text (str) – The text to convert.

  • reading_type (Literal[&quot;orth&quot;, &quot;pron&quot;,) –

  • &quot;kana&quot;] – Reading type, by default “pron” “pron” is the pronunciation (発音形), “orth” is the orthography (書字形), “kana” is the kana(仮名) form of orthography

  • optional – Reading type, by default “pron” “pron” is the pronunciation (発音形), “orth” is the orthography (書字形), “kana” is the kana(仮名) form of orthography

  • add_atype (bool, optional) – Whether to consider aType (アクセント型) and add “]” to the reading, by default True

  • add_blank_between_words (bool, optional) – Whether to add a blank between words, by default True

  • when_unknown (Literal[&quot;passthrough&quot;, , optional) – What to do when the reading is unknown (“補助記号” and “一般”), by default “passthrough” “passthrough” will pass the original text, “*” will pass “*”, “unidecode” will use unidecode, and a callable will be called with the original text

  • tagger (fugashi.Tagger, optional) – The tagger to use, by default fugashi.Tagger()

Returns:

The reading

Return type:

str

Raises:

ImportError – When when_unknown=”unidecode” and unidecode is not installed

Examples

>>> from mecab_text_cleaner import to_reading
>>> to_reading("     空、雲。\n雨!(")
'ソ]ラ、 ク]モ。\nア]メ!('

Submodules

mecab_text_cleaner.cli module