빅데이터 프로그래밍/Python

[Python] 28. [Scraping] 한겨레 신문 뉴스, Naver 뉴스, 동아 일보 뉴스 검색 drawling

밍글링글링 2017. 8. 16.

728x90

[01] 한겨례 신문 뉴스 목록 클로링

- 정렬: 최신순, 범위: 뉴스, 검색어: 대통령

http://search.hani.co.kr/Search?command=query&keyword=%EB%8C%80%ED%86%B5%EB%A0%B9&sort=d&period=all&media=news

- 1 페이지: http://search.hani.co.kr/Search?command=query&keyword=%EB%8C%80%ED%86%B5%EB%A0%B9&media=news&sort=d&period=all&datefrom=2000.01.01&dateto=2017.04.25&pageseq=0

- 2 페이지: http://search.hani.co.kr/Search?command=query&keyword=%EB%8C%80%ED%86%B5%EB%A0%B9&media=news&sort=d&period=all&datefrom=2000.01.01&dateto=2017.04.25&pageseq=1

- 3 페이지: http://search.hani.co.kr/Search?command=query&keyword=%EB%8C%80%ED%86%B5%EB%A0%B9&media=news&sort=d&period=all&datefrom=2000.01.01&dateto=2017.04.25&pageseq=2

2. 대통령 검색하기
- str(item.find_all(text=True)): 태그에서 Text 만 추출

[실행 화면]

['주한미군의 지상전력 지휘부인 미8군사령부가 25일 서울 용산기지에서 경기 평택기지로 본격적인 이전을 시작했다. ',

'\n 미 8군사령부는 이날 오전 토머스 밴달 미8군사령관이 주관하는 월튼 워커(1889∼1950)

▷ crawler1.Hani.py
-------------------------------------------------------------------------------------
# -*- coding: utf-8 -*-

import urllib, sys
from urllib.request import urlopen
from urllib.parse import quote
from bs4 import BeautifulSoup
import re

TARGET_URL_BEFORE_KEWORD = 'http://search.hani.co.kr/Search?command=query&keyword='
TARGET_URL_BEFORE_UNTIL_DATE = '&media=news&sort=d&period=all&datefrom=2000.01.01&dateto='
TARGET_URL_REST = '&pageseq='

# 뉴스 목록 취득
def get_link_from_news_title(page_num, URL, output_file):
    for i in range(page_num): # 3: 0 ~ 2  
        URL_with_page_num = URL + str(i)
        source_code_from_URL = urllib.request.urlopen(URL_with_page_num)
        bs = BeautifulSoup(source_code_from_URL, 'lxml', from_encoding='utf-8')
        for item in bs.select('dt > a'):   # <dt><a>...</a>
            article_URL = item['href']    # 제목 연결된 url 취득
            get_text(article_URL, output_file)

# 뉴스 목록에 따른 내용 취득
def get_text(URL, output_file):
    source_code_from_url = urllib.request.urlopen(URL)
    bs = BeautifulSoup(source_code_from_url, 'lxml', from_encoding='utf-8')
    content_of_article = bs.select('div.text') # <div class="text"...
    for item in content_of_article:
        string_item = str(item.find_all(text=True))
        output_file.write(string_item)


def main(argv):
    if len(sys.argv) != 5:
        # Hani.py 대통령 1 2017.04.30 Hani_대통령.txt
        print("python [모듈이름] [키워드] [가져올 페이지 숫자] [가져올 기사의 최근 날짜] [결과 파일명.txt]")
        return
    
    keyword = argv[1]         # 대통령
    page_num = int(argv[2]) # 10
    until_date = argv[3]       # 2017.04.30   
    output_file_name = argv[4] # Hani_대통령.txt
    target_URL = TARGET_URL_BEFORE_KEWORD + quote(keyword) \
                 + TARGET_URL_BEFORE_UNTIL_DATE + until_date + TARGET_URL_REST
    output_file = open(output_file_name, 'w')
    get_link_from_news_title(page_num, target_URL, output_file)
    output_file.close()


if __name__ == '__main__':
    main(sys.argv)
    

-------------------------------------------------------------------------------------

[02] Naver 뉴스 crawling

- https://www.naver.com/

- 뉴스 링크의 첫번째 기사
http://news.naver.com/main/read.nhn?oid=008&sid1=100&aid=0003865685&mid=shm&mode=LSD&nh=20170430113132

1. Naver news 읽어 오기

- soup.find_all('div', id='articleBodyContents'): id가 'articleBodyContents'인 DIV 태그를 찾아 list로 리턴

- str(item.find_all(text=True)): 태그에서 Text 만 추출

[실행 화면] output.txt

본문 내용 플레이어 플레이어오류를 우회하기 위한 함수 추가 서울연합뉴스 안홍석 기자 이달 말부터 근로자의날·석가탄신일·어린이날 등이 몰린 5월 초 사이에 200만명에 가까운 여객이 인천국제공항을 이용할 것으로 보인다.....

▷ crawler1.naver.py
-------------------------------------------------------------------------------------
# -*- coding: utf-8 -*-

import urllib
from urllib.request import urlopen
from urllib.parse import quote
from bs4 import BeautifulSoup
import datetime
import random
import re

# 클리닝 함수
def clean_text(text):
    cleaned_text = re.sub('[a-zA-Z]', '', text) # a ~ z, A ~ Z 삭제
    cleaned_text = re.sub('[\{\}\[\]\/?.,;:|\)*~`!^\-_+<>@\#$%&\\\=\(\'\"]',
                          '', cleaned_text) # 특수 문자 제거
    return cleaned_text

# 출력 파일 명
OUTPUT_FILE_NAME = 'naver.txt'
# 긁어 올 URL
URL="http://news.naver.com/main/read.nhn?oid=008&sid1=100&aid=0003865685&mid=shm&mode=LSD&nh=20170430113132"

# 크롤링 함수
def get_text(URL):
    source_code_from_URL = urllib.request.urlopen(URL)
    bs = BeautifulSoup(source_code_from_URL, 'lxml', from_encoding='utf-8')
    text = ''
    # <div id="articleBodyContents" style="-webkit-tap-highlight-color: rgba(0, 0, 0, 0);">
    item = bs.find_all('div', id='articleBodyContents')[0]
    text = text + str(item.find_all(text=True))
    return clean_text(text)


# 메인 함수
def main():
    open_output_file = open(OUTPUT_FILE_NAME, 'w')
    result_text = get_text(URL)
    open_output_file.write(result_text)
    open_output_file.close()
    

if __name__ == '__main__':
    main()
    print('실행 종료됨')



-------------------------------------------------------------------------------------

[03] 동아 일보 뉴스 목록 클로링

- http://www.donga.com

- 정렬: 최신순, 범위: 동아일보, 검색어: 대통령

http://news.donga.com/search?check_news=1&more=1&sorting=1&range=1&search_date=&query=%EB%8C%80%ED%86%B5%EB%A0%B9

- 1 페이지: http://news.donga.com/search?p=1&query=%EB%8C%80%ED%86%B5%EB%A0%B9&check_news=1&more=1&sorting=1&search_date=1&v1=&v2=&range=1

- 2 페이지: http://news.donga.com/search?p=16&query=%EB%8C%80%ED%86%B5%EB%A0%B9&check_news=1&more=1&sorting=1&search_date=1&v1=&v2=&range=1

- 3 페이지: http://news.donga.com/search?p=31&query=%EB%8C%80%ED%86%B5%EB%A0%B9&check_news=1&more=1&sorting=1&search_date=1&v1=&v2=&range=1

2. 대통령 검색하기

[실행 화면]

['\n', '한국 대통령 처음… 경협-북핵 논의', '\r\n“이란 로하니 대통령 한국 답방 검토”… 주한 이란대사 본보 인터뷰서 밝혀', '\n

박근혜 대통령이 핵협상 타결로 국제사회의 경제 제재가 해제된 뒤 호황을 맞고 있는 이란을 방문할 예정이다.

▷ crawler1.DongA.py
-------------------------------------------------------------------------------------
# -*- coding: utf-8 -*-

import urllib, sys
from urllib.request import urlopen
from urllib.parse import quote
from bs4 import BeautifulSoup
import datetime
import random
import re

TARGET_URL_BEFORE_PAGE_NUM = "http://news.donga.com/search?p="
TARGET_URL_BEFORE_KEWORD = '&query='
TARGET_URL_REST = '&check_news=1&more=1&sorting=1&search_date=1&v1=&v2=&range=3'

# 뉴스 검색 목록
def get_link_from_news_title(page_num, URL, output_file):
    for i in range(page_num):
        current_page_num = 1 + i*15 # 1 페이지는 15개의 기사로 구성
        position = URL.index('=')
        URL_with_page_num = URL[: position+1] + str(current_page_num)  + URL[position+1 :]
        source_code_from_URL = urllib.request.urlopen(URL_with_page_num)
        bs = BeautifulSoup(source_code_from_URL, 'lxml', from_encoding='utf-8')
        # <p class="tit"> <p> 태그의 class가 'tit'인 list에서 첫번째 p 태그 추출
        title  = bs.find_all('p', 'tit')[0]   
        title_link = title.select('a')           # <a> 태그 추출
        
        print(str(type(title_link))) # <class 'list'>
        
        article_URL = title_link[0]['href']  # 첫번째 <a>태그에서 링크 값 추출
        get_text(article_URL, output_file)


# 내용 읽기
def get_text(URL, output_file):
    source_code_from_url = urllib.request.urlopen(URL)
    bs = BeautifulSoup(source_code_from_url, 'lxml', from_encoding='utf-8')
    # <div class="article_txt" id="articleBody" style="font-size: 18px;">
    content_of_article = bs.select('div.article_txt')
    item = content_of_article[0]
    string_item = str(item.find_all(text=True))
    output_file.write(string_item)


# 메인함수
def main(argv):
    if len(argv) != 4:
        # DongA.py 대통령 10 DongA_대통령.txt
        print("python [모듈이름] [키워드] [가져올 페이지 숫자] [결과 파일명]")
        return
    keyword = argv[1]
    page_num = int(argv[2])
    output_file_name = argv[3]
    target_URL = TARGET_URL_BEFORE_PAGE_NUM + TARGET_URL_BEFORE_KEWORD \
                 + quote(keyword) + TARGET_URL_REST
    output_file = open(output_file_name, 'w')
    get_link_from_news_title(page_num, target_URL, output_file)
    output_file.close()


if __name__ == '__main__':
    main(sys.argv)

728x90

'빅데이터 프로그래밍 > Python' 카테고리의 다른 글

[Python] 30. [Scraping] XKCD.com 이미지 다운받기 -- CHECK CHECK CHECK CHECK CHECK (0)	2017.08.16
[Python] 29. [Scraping] KoNLPy 자연어 처리 패키지, JPype 설치, 명사 분리 추출 후, 단어 사용 빈도 계산하기 (0)	2017.08.16
[Python] 27. [Scraping] Web Scraping 기초, 한글 처리, BeautifulSoup 설치, 기본 트리 운행, 정규 표현식 이용 (0)	2017.08.05
[Python] 26. [Scraping] 재귀 호출 함수, Lamda 함수 이용 , random 난수 발생, LX (0)	2017.08.05
[Python] 25. Google Gmail SMTP 서버를 이용한 Mail 전송 (2)	2017.08.05

[Python] 28. [Scraping] 한겨레 신문 뉴스, Naver 뉴스, 동아 일보 뉴스 검색 drawling

'빅데이터 프로그래밍 > Python' 카테고리의 다른 글

댓글

티스토리툴바