テキストから日本語を抜き出す正規表現。失敗パターン、成功パターン

8/21/2014 - 3:21 PM

テキストから日本語を抜き出す正規表現。失敗パターン、成功パターン

require 'pp'

Japanese = %r/[
  \p{Hiragana}
  \p{InKatakana}
  \p{Han}
  \p{InCJKSymbolsAndPunctuation}
  \p{InCJKUnifiedIdeographs}
]{2,}/x # `+` -> `{2,}`

str = <<EOF
text in English
日本語のテキスト
にほんご　の　テキスト
EOF

pp str.scan(Japanese)
# => ["\n日本語のテキスト\nにほんご　の　テキスト\n"]

regex_take_01.rb

require 'pp'

Japanese = %r/[
  \p{Hiragana}
  \p{InKatakana}
  \p{Han}
  \p{InCJKSymbolsAndPunctuation}
  \p{InCJKUnifiedIdeographs}
]+/x

str = <<EOF
text in English
日本語のテキスト
にほんご　の　テキスト
EOF

pp str.scan(Japanese)
# => [" ", " ", "\n日本語のテキスト\nにほんご　の　テキスト\n"]

pp str.scan(Japanese).select{|x| !x.include? ' '}
# => ["\n日本語のテキスト\nにほんご　の　テキスト\n"]

Cacher is the code snippet organizer for pro developers

We empower you and your team to get more done, faster

テキストから日本語を抜き出す正規表現。失敗パターン、成功パターン