MCPcopy
hub / github.com/fxsjy/jieba / tokenize

Method tokenize

jieba/__init__.py:476–507  ·  view source on GitHub ↗

Tokenize a sentence and yields tuples of (word, start, end) Parameter: - sentence: the str(unicode) to be segmented. - mode: "default" or "search", "search" is for finer segmentation. - HMM: whether to use the Hidden Markov Model.

(self, unicode_sentence, mode="default", HMM=True)

Source from the content-addressed store, hash-verified

474 return freq
475
476 def tokenize(self, unicode_sentence, mode="default", HMM=True):
477 """
478 Tokenize a sentence and yields tuples of (word, start, end)
479
480 Parameter:
481 - sentence: the str(unicode) to be segmented.
482 - mode: "default" or "search", "search" is for finer segmentation.
483 - HMM: whether to use the Hidden Markov Model.
484 """
485 if not isinstance(unicode_sentence, text_type):
486 raise ValueError("jieba: the input parameter should be unicode.")
487 start = 0
488 if mode == 'default':
489 for w in self.cut(unicode_sentence, HMM=HMM):
490 width = len(w)
491 yield (w, start, start + width)
492 start += width
493 else:
494 for w in self.cut(unicode_sentence, HMM=HMM):
495 width = len(w)
496 if len(w) > 2:
497 for i in xrange(len(w) - 1):
498 gram2 = w[i:i + 2]
499 if self.FREQ.get(gram2):
500 yield (gram2, start + i, start + i + 2)
501 if len(w) > 3:
502 for i in xrange(len(w) - 2):
503 gram3 = w[i:i + 3]
504 if self.FREQ.get(gram3):
505 yield (gram3, start + i, start + i + 3)
506 yield (w, start, start + width)
507 start += width
508
509 def set_dictionary(self, dictionary_path):
510 with self.lock:

Callers 6

__call__Method · 0.80
cuttestFunction · 0.80
cuttestFunction · 0.80
testTokenizeMethod · 0.80
testTokenize_NOHMMMethod · 0.80
demo.pyFile · 0.80

Calls 1

cutMethod · 0.95

Tested by 4

cuttestFunction · 0.64
cuttestFunction · 0.64
testTokenizeMethod · 0.64
testTokenize_NOHMMMethod · 0.64