빅데이터 프로그래밍/Python

[Python] 29. [Scraping] KoNLPy 자연어 처리 패키지, JPype 설치, 명사 분리 추출 후, 단어 사용 빈도 계산하기

밍글링글링 2017. 8. 16.

728x90

[01] KoNLPy 자연어 처리 패키지

- 공식 페이지: http://konlpy.org/ko/latest/
- NLP (Natural Language Processing, 자연어처리)는 텍스트에서 의미있는 정보를 분석, 추출하고
이해하는 일련의 기술집합입니다.
. http://konlpy.org/ko/v0.4.3/start/

KoNLPy: 파이썬 한국어 NLP — KoNLPy 0.6.0 documentation

KoNLPy: 파이썬 한국어 NLP KoNLPy(“코엔엘파이”라고 읽습니다)는 한국어 정보처리를 위한 파이썬 패키지입니다. 설치법은 이 곳을 참고해주세요. NLP를 처음 시작하시는 분들은 시작하기 에서 가

konlpy.org

NLP란 무엇인가요? — KoNLPy 0.4.3 documentation

NLP란 무엇인가요? NLP (Natural Language Processing, 자연어처리)는 텍스트에서 의미있는 정보를 분석, 추출하고 이해하는 일련의 기술집합입니다. 우리 일상에도 다양한 NLP 응용사례가 있습니다. 가령:

konlpy.org

1. 설치

F:\201701_python\ws_python\scraping\crawler1>pip install konlpy
Collecting konlpy
  Downloading konlpy-0.4.4-py2.py3-none-any.whl (22.5MB)
    100% ■■■■■■■■■■■■■■■■■■■■ 22.5MB 37kB/s
Installing collected packages: konlpy
Successfully installed konlpy-0.4.4

[02] JPype 설치
- 공식 페이지: http://jpype.sourceforge.net/
- Python이 JVM을 띄어서 자바 클래스를 사용 할 수 있도록 지원하는 패키지

1. 설치
http://www.lfd.uci.edu/~gohlke/pythonlibs/#jpype
JPype1-0.6.2-cp36-cp36m-win_amd64.whl 다운로드

Archived: Python Extension Packages for Windows - Christoph Gohlke

Archived: Python Extension Packages for Windows - Christoph Gohlke by Christoph Gohlke. Updated on 26 June 2022 at 07:27 UTC. This page provides 32 and 64-bit Windows binaries of many scientific open-source extension packages for the official CPython

www.lfd.uci.edu

JPype - Java to Python integration

JPype - JPype is an effort to allow python programs full access to java class libraries. This is achieved not through re-implementing Python, as Jython/JPython has done, but rather through interfacing at the native level in both Virtual Machines. Eventuall

jpype.sourceforge.net

F:\201701_python\setup>pip install JPype1-0.6.2-cp36-cp36m-win_amd64.whl
Processing f:\201701_python\setup\jpype1-0.6.2-cp36-cp36m-win_amd64.whl
Installing collected packages: JPype1
Successfully installed JPype1-0.6.2

[03] Numpy 설치
- 수치 계산을위한 Python 패키지

F:\201701_python\ws_python\scraping\crawler1>pip install numpy
Collecting numpy
  Downloading numpy-1.12.1-cp36-none-win_amd64.whl (7.7MB)
    100% ■■■■■■■■■■■■■■■■■■■■ 7.7MB 114kB/s
Installing collected packages: numpy
Successfully installed numpy-1.12.1

[04] ImportError: DLL load failed: 지정된 모듈을 찾을 수 없습니다.

- visual c++ redistributable for visual studio 2015 설치
  https://www.microsoft.com/ko-kr/download/details.aspx?id=48145에 접속하여
'vc_redist.x64.exe' 다운로드 및 설치


2. 실습

▷ crawler1.Statistic.py
-------------------------------------------------------------------------------------
# -*- coding: utf-8 -*-

import sys
from konlpy.tag import Twitter
from collections import Counter

# 형태소별 카운트 처리
def get_tags(text, ntags=50):
    spliter = Twitter()
    nouns = spliter.nouns(text)
    count = Counter(nouns)
    return_list = []
    for tag, cnt in count.most_common(ntags):
        temp = {'tag': tag, 'count': cnt} # Dictionary
        return_list.append(temp)
    return return_list

def main(argv):
    if len(argv) != 4:
        '''
        statistic.py Hani_대통령.txt 20 Hani_대통령Res.txt
        statistic.py DongA_대통령.txt 20 DongA_대통령Res.txt
        
        '''
        print('python [모듈 이름] [텍스트 파일명.txt] [산출한 단어 개수] [결과파일명.txt]')
        return
    input_file_name = argv[1]   # DongA_대통령.txt
    noun_count = int(argv[2])   # 산출할 단어의 갯수를 20개로 제한
    output_file_name = argv[3] # 결과파일명.txt
    input_file = open(input_file_name, 'r')
    text = input_file.read() # 소스 읽기
    input_file.close()  # 읽은 파일 닫기
        
    tags = get_tags(text, noun_count) # list

    output_file = open(output_file_name, 'w') # 쓰기 파일 준비
    for tag in tags:
        print(str(tag))  # Dctionary
        noun = tag['tag']
        count = tag['count']
        output_file.write('{0}, {1}\n'.format(noun, count))
    output_file.close()

if __name__ == '__main__':
    main(sys.argv)
         

-------------------------------------------------------------------------------------

3. 결과 파일

▷ /crawler1/DongA_대통령Res.txt
-------------------------------------------------------------------------------------
이란 22
대통령 11
한국 10
방문 6
것 6
검토 5
대사 5
경제 4
박 4
타 4
헤리 4
안 4
사업 4
북핵 3
주한 3
인터뷰 3
제재 3
기업 3
문제 3
관련 3

-------------------------------------------------------------------------------------

▷ /crawler1/Hani_대통령Res.txt
-------------------------------------------------------------------------------------
후보 41
일 35
것 34
이 27
대통령 22
전 22
교육 20
이전 15
안 13
를 13
검찰 13
고 12
보수 12
서울 11
수 11
년 10
등 10
체제 10
정당 10
대선 9

-------------------------------------------------------------------------------------

728x90

'빅데이터 프로그래밍 > Python' 카테고리의 다른 글

[Python] 31. [Scraping] Selenium 모듈을 이용한 폼과 로그인 인증 통과 테스트 (0)	2017.08.16
[Python] 30. [Scraping] XKCD.com 이미지 다운받기 -- CHECK CHECK CHECK CHECK CHECK (0)	2017.08.16
[Python] 28. [Scraping] 한겨레 신문 뉴스, Naver 뉴스, 동아 일보 뉴스 검색 drawling (1)	2017.08.16
[Python] 27. [Scraping] Web Scraping 기초, 한글 처리, BeautifulSoup 설치, 기본 트리 운행, 정규 표현식 이용 (0)	2017.08.05
[Python] 26. [Scraping] 재귀 호출 함수, Lamda 함수 이용 , random 난수 발생, LX (0)	2017.08.05

[Python] 29. [Scraping] KoNLPy 자연어 처리 패키지, JPype 설치, 명사 분리 추출 후, 단어 사용 빈도 계산하기

'빅데이터 프로그래밍 > Python' 카테고리의 다른 글

댓글

티스토리툴바