CKIP Classic NLP Tools¶
Introduction¶
A Linux Python wrapper for CKIP classic tools — CKIP Word Segmentation and CKIP Parser.
Attention
Please use CKIPNLP for structured data types and pipeline drivers.
Attention
For Python 2 users, please use PyCkip 0.4.2 instead.
Contributers¶
Wei-Yun Ma at CKIP (Maintainer)
External Links¶
Requirements¶
Note that one should have CKIPWS/CKIPParser for this project:
CKIP Word Segmentation Linux version 20190524+
CKIP Parser Linux version 20190725+
Academic License (Online Version)
Installation¶
Attention
Offline version: CKIPWS (Academic/Commercial License) and CKIPParser (Commercial License).
Online version: CKIPParser (Academic License).
Offline Version¶
Download CKIPWS and/or CKIPParser from above links. Denote <ckipws-linux-root>
as the folder containing CKIPWS, and <ckipparser-linux-root>
as the folder containing CKIPParser.
pip install --force-reinstall --upgrade ckip-classic \
--install-option='--ws' \
--install-option='--ws-dir=<ckipws-linux-root>' \
--install-option='--parser' \
--install-option='--parser-dir=<ckipparser-linux-root>'
Ignore ws/parser options if one doesn’t have CKIPWS/CKIPParser.
Attention
Please use absolute paths.
Online Version¶
Register an account at http://parser.iis.sinica.edu.tw/v1/reg.exe
pip install --upgrade ckip-classic
Installation Options¶
Option |
Detail |
Default Value |
---|---|---|
|
Enable/disable CKIPWS. |
False |
|
Enable/disable CKIPParser. |
False |
|
CKIPWS root directory. |
|
|
CKIPWS libraries directory |
|
|
CKIPWS share directory |
|
|
CKIPParser root directory. |
|
|
CKIPParser libraries directory |
|
|
CKIPParser share directory |
|
|
“Data2” directory |
|
|
“Rule” directory |
|
|
“RDB” directory |
|
Usage¶
See https://ckip-classic.readthedocs.io/ for API details.
CKIPWS¶
CKIP Word Segmentation offline driver.
import ckip_classic.ws
print(ckip_classic.__name__, ckip_classic.__version__)
ws = ckip_classic.ws.CkipWs(logger=False)
print(ws('中文字喔'))
for l in ws.apply_list(['中文字喔', '啊哈哈哈']): print(l)
ws.apply_file(ifile='sample/sample.txt', ofile='output/sample.tag', uwfile='output/sample.uw')
with open('output/sample.tag') as fin:
print(fin.read())
with open('output/sample.uw') as fin:
print(fin.read())
CKIPParser¶
CKIP Parser offline driver.
import ckip_classic.parser
print(ckip_classic.__name__, ckip_classic.__version__)
ps = ckip_classic.parser.CkipParser(logger=False)
print(ps('中文字喔'))
for l in ps.apply_list(['中文字喔', '啊哈哈哈']): print(l)
ps.apply_file(ifile='sample/sample.txt', ofile='output/sample.tree')
with open('output/sample.tree') as fin:
print(fin.read())
CKIPParserClient¶
CKIP Parser online client.
import ckip_classic.client
print(ckip_classic.__name__, ckip_classic.__version__)
ps = ckip_classic.client.CkipParserClient(username='USERNAME', password='PASSWORD')
print(ps('中文字(Na) 耶(T) ,(COMMACATEGORY)'))
for l in ps.apply_list(['中文字(Na) 耶(T) ,(COMMACATEGORY)', '啊(I) 哈(D) 哈(D) 哈(D) 。(PERIODCATEGORY)']): print(l)
FAQ¶
Danger
Due to C code implementation, both CkipWs
and CkipParser
can only be instance once.
Warning
CKIPParser fails if input text contains special characters such as ()+-:|
. One may replace these characters by
text = text
.replace('(', '(')
.replace(')', ')')
.replace('+', '+')
.replace('-', '-')
.replace(':', ':')
.replace('|', '|')
Tip
fatal error: Python.h: No such file or directory”. What should I do?
Install Python development package
sudo apt-get install python3-dev
Tip
The CKIPWS throws “what(): locale::facet::_S_create_c_locale name not valid”. What should I do?
Install locale data.
apt-get install locales-all
Tip
The CKIPParser throws “ImportError: libCKIPParser.so: cannot open shared object file: No such file or directory”. What should I do?
Add below command to ~/.bashrc
:
export LD_LIBRARY_PATH=<ckipparser-linux-root>/lib:$LD_LIBRARY_PATH
ckip_classic package¶
Subpackages
ckip_classic.client package¶
-
class
ckip_classic.client.
CkipParserClient
(*, username=None, password=None)[source]¶ Bases:
object
The CKIP sentence parsing client.
- Parameters
username (str) – the username (default to the environment variable
$CKIPPARSER_USERNAME
).password (str) – the password (default to the environment variable
$CKIPPARSER_PASSWORD
).
Note
One may register an account at http://parser.iis.sinica.edu.tw/v1/reg.exe
Submodules
ckip_classic.ini module¶
-
ckip_classic.ini.
create_ws_lex
(*lex_list)[source]¶ Generate CKIP word segmentation lexicon file.
- Parameters
*lex_list (Tuple[str, str]) – the lexicon word and its POS-tag.
- Returns
lex_file (str) – the name of the lexicon file.
f_lex (TextIO) – the file object.
Attention
Remember to close f_lex manually.
-
ckip_classic.ini.
create_ws_ini
(*, data2_dir=None, lex_file=None, new_style_format=False, show_category=True, sentence_max_word_num=80, **options)[source]¶ Generate CKIP word segmentation config.
- Parameters
data2_dir (str) – the path to the folder “Data2/”.
lex_file (str) – the path to the user-defined lexicon file.
new_style_format (bool) – split sentences by newline characters (“\n”) rather than punctuations.
show_category (bool) – show part-of-speech tags.
sentence_max_word_num (int) – maximum number of words per sentence.
- Returns
ini_file (str) – the name of the config file.
f_ini (TextIO) – the file object.
Attention
Remember to close f_ini manually.
-
ckip_classic.ini.
create_parser_ini
(*, ws_ini_file, rule_dir=None, rdb_dir=None, do_ws=True, do_parse=True, do_role=True, sentence_delim=',,;。!?', **options)[source]¶ Generate CKIP parser config.
- Parameters
rule_dir (str) – the path to “Rule/”.
rdb_dir (str) – the path to “RDB/”.
do_ws (bool) – do word segmentation.
do_parse (bool) – do parsing.
do_role (bool) – do role.
sentence_delim (str) – the sentence delimiters.
- Returns
ini_file (str) – the name of the config file.
f_ini (TextIO) – the file object.
Attention
Remember to close f_ini manually.
ckip_classic.parser module¶
-
class
ckip_classic.parser.
CkipParser
(*, logger=False, ini_file=None, ws_ini_file=None, lex_list=None, **kwargs)[source]¶ Bases:
object
The CKIP sentence parsing driver.
- Parameters
logger (bool) – enable logger.
lex_list (Iterable) – passed to
ckip_classic.ini.create_ws_lex()
, overridden lex_file forckip_classic.ini.create_ws_ini()
.ini_file (str) – the path to the INI file.
ws_ini_file (str) – the path to the INI file for CKIPWS.
- Other Parameters
** – the configs for CKIPParser, passed to
ckip_classic.ini.create_parser_ini()
, ignored if ini_file is set.** – the configs for CKIPWS, passed to
ckip_classic.ini.create_ws_ini()
, ignored if ws_ini_file is set.
Danger
Never instance more than one object of this class!
-
apply
(text)[source]¶ Parse a sentence.
- Parameters
text (str) – the input sentence.
- Returns
str – the output sentence.
Hint
One may also call this method as
__call__()
.
ckip_classic.ws module¶
-
class
ckip_classic.ws.
CkipWs
(*, logger=False, ini_file=None, lex_list=None, **kwargs)[source]¶ Bases:
object
The CKIP word segmentation driver.
- Parameters
logger (bool) – enable logger.
lex_list (Iterable) – passed to
ckip_classic.ini.create_ws_lex()
overridden lex_file forckip_classic.ini.create_ws_ini()
.ini_file (str) – the path to the INI file.
- Other Parameters
** – the configs for CKIPWS, passed to
ckip_classic.ini.create_ws_ini()
, ignored if ini_file is set.
Danger
Never instance more than one object of this class!
-
apply
(text)[source]¶ Parse a sentence.
- Parameters
text (str) – the input sentence.
- Returns
str – the output sentence.
Hint
One may also call this method as
__call__()
.