CKIP Classic NLP Tools

Introduction

A Linux Python wrapper for CKIP classic tools — CKIP Word Segmentation and CKIP Parser.

Attention

Please use CKIPNLP for structured data types and pipeline drivers.

Attention

For Python 2 users, please use PyCkip 0.4.2 instead.

Contributers

Requirements

Note that one should have CKIPWS/CKIPParser for this project:

Installation

Attention

  • Offline version: CKIPWS (Academic/Commercial License) and CKIPParser (Commercial License).

  • Online version: CKIPParser (Academic License).

Offline Version

Download CKIPWS and/or CKIPParser from above links. Denote <ckipws-linux-root> as the folder containing CKIPWS, and <ckipparser-linux-root> as the folder containing CKIPParser.

pip install --force-reinstall --upgrade ckip-classic \
   --install-option='--ws' \
   --install-option='--ws-dir=<ckipws-linux-root>' \
   --install-option='--parser' \
   --install-option='--parser-dir=<ckipparser-linux-root>'

Ignore ws/parser options if one doesn’t have CKIPWS/CKIPParser.

Attention

Please use absolute paths.

Online Version

Register an account at http://parser.iis.sinica.edu.tw/v1/reg.exe

pip install --upgrade ckip-classic

Installation Options

Option

Detail

Default Value

--[no-]ws

Enable/disable CKIPWS.

False

--[no-]parser

Enable/disable CKIPParser.

False

--ws-dir=<ws-dir>

CKIPWS root directory.

--ws-lib-dir=<ws-lib-dir>

CKIPWS libraries directory

<ws-dir>/lib

--ws-share-dir=<ws-share-dir>

CKIPWS share directory

<ws-dir>

--parser-dir=<parser-dir>

CKIPParser root directory.

--parser-lib-dir=<parser-lib-dir>

CKIPParser libraries directory

<parser-dir>/lib

--parser-share-dir=<parser-share-dir>

CKIPParser share directory

<parser-dir>

--data2-dir=<data2-dir>

“Data2” directory

<ws-share-dir>/Data2

--rule-dir=<rule-dir>

“Rule” directory

<parser-share-dir>/Rule

--rdb-dir=<rdb-dir>

“RDB” directory

<parser-share-dir>/RDB

Usage

See https://ckip-classic.readthedocs.io/ for API details.

CKIPWS

CKIP Word Segmentation offline driver.

import ckip_classic.ws
print(ckip_classic.__name__, ckip_classic.__version__)

ws = ckip_classic.ws.CkipWs(logger=False)
print(ws('中文字喔'))
for l in ws.apply_list(['中文字喔', '啊哈哈哈']): print(l)

ws.apply_file(ifile='sample/sample.txt', ofile='output/sample.tag', uwfile='output/sample.uw')
with open('output/sample.tag') as fin:
    print(fin.read())
with open('output/sample.uw') as fin:
    print(fin.read())

CKIPParser

CKIP Parser offline driver.

import ckip_classic.parser
print(ckip_classic.__name__, ckip_classic.__version__)

ps = ckip_classic.parser.CkipParser(logger=False)
print(ps('中文字喔'))
for l in ps.apply_list(['中文字喔', '啊哈哈哈']): print(l)

ps.apply_file(ifile='sample/sample.txt', ofile='output/sample.tree')
with open('output/sample.tree') as fin:
    print(fin.read())

CKIPParserClient

CKIP Parser online client.

import ckip_classic.client
print(ckip_classic.__name__, ckip_classic.__version__)

ps = ckip_classic.client.CkipParserClient(username='USERNAME', password='PASSWORD')
print(ps('中文字(Na) 耶(T) ,(COMMACATEGORY)'))
for l in ps.apply_list(['中文字(Na) 耶(T) ,(COMMACATEGORY)', '啊(I) 哈(D) 哈(D) 哈(D) 。(PERIODCATEGORY)']): print(l)

FAQ

Danger

Due to C code implementation, both CkipWs and CkipParser can only be instance once.


Warning

CKIPParser fails if input text contains special characters such as ()+-:|. One may replace these characters by

text = text
   .replace('(', '(')
   .replace(')', ')')
   .replace('+', '+')
   .replace('-', '-')
   .replace(':', ':')
   .replace('|', '|')

Tip

fatal error: Python.h: No such file or directory”. What should I do?

Install Python development package

sudo apt-get install python3-dev

Tip

The CKIPWS throws “what(): locale::facet::_S_create_c_locale name not valid”. What should I do?

Install locale data.

apt-get install locales-all

Tip

The CKIPParser throws “ImportError: libCKIPParser.so: cannot open shared object file: No such file or directory”. What should I do?

Add below command to ~/.bashrc:

export LD_LIBRARY_PATH=<ckipparser-linux-root>/lib:$LD_LIBRARY_PATH

License

GPL-3.0

Copyright (c) 2018-2020 CKIP Lab under the GPL-3.0 License.

ckip_classic package

Subpackages

ckip_classic.client package

class ckip_classic.client.CkipParserClient(*, username=None, password=None)[source]

Bases: object

The CKIP sentence parsing client.

Parameters
  • username (str) – the username (default to the environment variable $CKIPPARSER_USERNAME).

  • password (str) – the password (default to the environment variable $CKIPPARSER_PASSWORD).

Note

One may register an account at http://parser.iis.sinica.edu.tw/v1/reg.exe

apply(text)[source]

Parse a sentence.

Parameters

text (str) – the input sentence.

Returns

str – the output sentence.

Hint

One may also call this method as __call__().

apply_list(ilist)[source]

Parse a list of sentences.

Parameters

ilist (List[str]) – the list of input sentences.

Returns

List[str] – the list of output sentences.

Submodules

ckip_classic.ini module

ckip_classic.ini.create_ws_lex(*lex_list)[source]

Generate CKIP word segmentation lexicon file.

Parameters

*lex_list (Tuple[str, str]) – the lexicon word and its POS-tag.

Returns

  • lex_file (str) – the name of the lexicon file.

  • f_lex (TextIO) – the file object.

Attention

Remember to close f_lex manually.

ckip_classic.ini.create_ws_ini(*, data2_dir=None, lex_file=None, new_style_format=False, show_category=True, sentence_max_word_num=80, **options)[source]

Generate CKIP word segmentation config.

Parameters
  • data2_dir (str) – the path to the folder “Data2/”.

  • lex_file (str) – the path to the user-defined lexicon file.

  • new_style_format (bool) – split sentences by newline characters (“\n”) rather than punctuations.

  • show_category (bool) – show part-of-speech tags.

  • sentence_max_word_num (int) – maximum number of words per sentence.

Returns

  • ini_file (str) – the name of the config file.

  • f_ini (TextIO) – the file object.

Attention

Remember to close f_ini manually.

ckip_classic.ini.create_parser_ini(*, ws_ini_file, rule_dir=None, rdb_dir=None, do_ws=True, do_parse=True, do_role=True, sentence_delim=',,;。!?', **options)[source]

Generate CKIP parser config.

Parameters
  • rule_dir (str) – the path to “Rule/”.

  • rdb_dir (str) – the path to “RDB/”.

  • do_ws (bool) – do word segmentation.

  • do_parse (bool) – do parsing.

  • do_role (bool) – do role.

  • sentence_delim (str) – the sentence delimiters.

Returns

  • ini_file (str) – the name of the config file.

  • f_ini (TextIO) – the file object.

Attention

Remember to close f_ini manually.

ckip_classic.parser module

class ckip_classic.parser.CkipParser(*, logger=False, ini_file=None, ws_ini_file=None, lex_list=None, **kwargs)[source]

Bases: object

The CKIP sentence parsing driver.

Parameters
Other Parameters

Danger

Never instance more than one object of this class!

apply(text)[source]

Parse a sentence.

Parameters

text (str) – the input sentence.

Returns

str – the output sentence.

Hint

One may also call this method as __call__().

apply_list(ilist)[source]

Parse a list of sentences.

Parameters

ilist (List[str]) – the list of input sentences.

Returns

List[str] – the list of output sentences.

apply_file(ifile, ofile)[source]

Parse a file.

Parameters
  • ifile (str) – the input file.

  • ofile (str) – the output file (will be overwritten).

ckip_classic.ws module

class ckip_classic.ws.CkipWs(*, logger=False, ini_file=None, lex_list=None, **kwargs)[source]

Bases: object

The CKIP word segmentation driver.

Parameters
Other Parameters

** – the configs for CKIPWS, passed to ckip_classic.ini.create_ws_ini(), ignored if ini_file is set.

Danger

Never instance more than one object of this class!

apply(text)[source]

Parse a sentence.

Parameters

text (str) – the input sentence.

Returns

str – the output sentence.

Hint

One may also call this method as __call__().

apply_list(ilist)[source]

Parse a list of sentences.

Parameters

ilist (List[str]) – the list of input sentences.

Returns

List[str] – the list of output sentences.

apply_file(ifile, ofile, uwfile='')[source]

Segment a file.

Parameters
  • ifile (str) – the input file.

  • ofile (str) – the output file (will be overwritten).

  • uwfile (str) – the unknown word file (will be overwritten).

Index

Module Index