“ABC News”新闻文本分析

📒需求

最近写毕业论文,需要研究分析“ABC News”对于新疆的报道内容,搜了一下,从2008年至今大约有300篇报道,于是想着通过Python爬下来再慢慢分析,包括自动翻译,关键词提取,词频计算,情感分析,主观性分析。

📕分析


首先通过ABC News内置的搜索栏搜索,发现搜索结果直接返回Json格式的数据,这些数据可直接获知每篇新闻的url,标题,分类,来源,发布日期等。

翻下一页,观察url变化为10,也就是每页返回10个结果,后面可直接据此构造url。

📗实现

内容爬取

需要用到的库

from newspaper import Article
from textblob import TextBlob
import requests
import json
import re
import os
import requests
import pymongo
import time
import datetime
import smtplib
from lxml import etree
import jieba
from nltk import FreqDist
from monkeylearn import MonkeyLearn

根据分析,构造搜索的url,通过得到的Json提取每篇新闻的url和新闻标题。

# 填入数据库信息
database = "ABC_News"
collection = "ABCNews_title_url_deduplicate_by_url"

# 定义数据库插入函数
def pymongo_insert(e):
    myclient = pymongo.MongoClient("mongodb://root:[email protected]:27017/")
    mydb = myclient[database]
    mycol = mydb[collection]
    each_title = r['item'][e]['title']
    each_url = r['item'][e]['link']
    insert_results = {'success': 0, 'failure': 0}
    mydict = {"01_title": each_title, "02_url": each_url}
    print(mydict)
    if mycol.find_one({"02_url": each_url}) is None:
        result = mycol.insert_one(mydict)
        insert_results['success'] = insert_results['success'] + 1
    else:
        insert_results['failure'] = insert_results['failure'] + 1
    print(
        f'{datetime.datetime.now():%Y-%m-%d %H:%M:%S} 插入成功{insert_results["success"]}条,{insert_results["failure"]}条数据已存在')

# 爬取并遍历Json数据
for i in range(1, 32):
    n = i - 1
    url_1 = 'https://abcnews.go.com/meta/api/search?q=xinjiang&limit=10&sort=date-asc&type=&section=&totalrecords' \
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
                      "Chrome/72.0.3626.96 Safari/537.36",
    }
    response = requests.get(url_1, headers=headers)
    r = response.json()
    last = len(r['item'])
    for e in range(0, last):
        pymongo_insert(e)

爬取结果:

添加更多属性,日期、时间、星期、版权、分类、类型:

# 分解出日期
def convert_to_date(date):
    YMD = re.search(r'\d+\s\S+\s\d+', date).group()
    week = re.search(r'\w*', date).group()
    HMS = re.search(r'([01]?\d|2[0-3]):[0-5]?\d:[0-5]?\d', date).group()
    sp = YMD.split()
    ori = {'Jan': '01', 'Feb': '02', 'Mar': '03', 'Apr': '04', 'May': '05', 'Jun': '06', 'Jul': '07', 'Aug': '08',
           'Sep': '09', 'Oct': '10', 'Nov': '11', 'Dec': '12'}
    for i in ori:
        if sp[1] == i:
            sp[1] = ori[i]
    year = sp[2]
    month = sp[1]
    day = sp[0]
    return year, month, day, week, HMS

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/72.0.3626.96 Safari/537.36",
}

myclient = pymongo.MongoClient("mongodb://root:[email protected]:27017/")
mydb = myclient[database]
mycol = mydb[collection]

# 添加日期、时间、星期、版权、分类、类型↓
for i in range(1, 31):
    n = i - 1
    url_1 = 'https://abcnews.go.com/meta/api/search?q=xinjiang&limit=10&sort=date-asc&type=&section=&totalrecords' \
            '=true&offset=' + str(n * 10)
    print(url_1)
    response = requests.get(url_1, headers=headers)
    r = response.json()
    last = len(r['item'])
    for e in range(0, last):
        each_url = r['item'][e]['link']
        each_date = r['item'][e]['pubDate']
        each_dcRights = r['item'][e]['dcRights']
        each_category = r['item'][e]['category']['text']
        each_dcType = r['item'][e]['dcType']
        dcType = "/".join(each_dcType)
        year, month, day, week, HMS = convert_to_date(each_date)
        condition = {'02_url': each_url}  # 以每篇文章的url作为匹配依据
        ABC = mycol.find_one(condition)
        ABC['03_year'] = year
        ABC['04_month'] = month
        ABC['05_day'] = day
        ABC['06_week'] = week
        ABC['07_HMS'] = HMS
        ABC['03_dcRights'] = each_dcRights
        ABC['03_category'] = each_category
        ABC['03_dcType'] = dcType
        result = mycol.update_one(condition, {'$set': ABC})

完成效果:

获取新闻正文,这里使用newspaper自动提取正文。

# 使用newspaper获取新闻正文
def get_text(each_url):
    try:
        article = Article(each_url)
        article.download()
        article.parse()
        return article.text
    except:
        lost_url_list.append(each_url)  # 如果获取失败,把url添加到失败列表
        return ''

for x in mycol.find():
    # 添加新闻正文内容↓
    article = get_text(x['02_url'])
    condition = {'02_url': x['02_url']}
    ABC = mycol.find_one(condition)
    ABC['08_article_contents'] = article
    result = mycol.update_one(condition, {'$set': ABC})

实现效果:

自动翻译

这里使用的是TextBlob带的翻译方法,此操作必须在特殊网络下完成。它貌似调用的非官方谷歌翻译API,现已弃用该方法。

TextBlob.translate() and TextBlob.detect_language are deprecated. Use the official Google Translate API instead (#215).

新闻标题的翻译:

# 自动翻译标题↓
results = mycol.find({"01_title_unreliable_translation": {"$exists": False}})
for r in results:
    if r['01_title'] != '':
        condition = {'02_url': r['02_url']}
        ABC = mycol.find_one(condition)
        translation = TextBlob(r['01_title'])
        r = translation.translate(from_lang="en", to='zh-CN')
        ABC['01_title_unreliable_translation'] = str(r)
        result = mycol.update_one(condition, {'$set': ABC})
    else:
        print('The article is empty: ' + r['02_url'])

实现效果:

新闻正文的翻译:

# 自动翻译正文↓
results = mycol.find({"08_article_contents_unreliable_translation": {"$exists": False}})
for r in results:
    if r['08_article_contents'] != '':
        condition = {'02_url': r['02_url']}
        ABC = mycol.find_one(condition)
        translation = TextBlob(r['08_article_contents'])
        r = translation.translate(from_lang="en", to='zh-CN')
        ABC['08_article_contents_unreliable_translation'] = str(r)
        result = mycol.update_one(condition, {'$set': ABC})
    else:
        print('The article is empty: ' + r['02_url'])

实现效果:

文本分析

从新闻正文中提取关键词,计算词频。

# 获取词频
def get_frequency(contents):
    contents = contents.lower()  # 小写处理
    contents = strip_punctuation(contents)
    res = contents.split()
    for i in dic:
        while i in res:
            res.remove(i)
    fdist1 = FreqDist(res)  # 生成词频的字典,格式(“词1”:数量,“词2”:数量...)
    f = zip(fdist1.keys(), fdist1.values())  # 将字典压缩成list[('词1',num1),('词2',num2),...]
    sortf = list(sorted(f, key=lambda s: s[1], reverse=True))  # f按照num排序
    return sortf

# 添加关键词词频↓
for x in mycol.find():
    words_frequency = get_frequency(x['08_article_contents'])
    if len(words_frequency) > 9:  # 过滤关键词个数小于10的文章
        condition = {'02_url': x['02_url']}
        ABC = mycol.find_one(condition)
        n = 8
        for i in range(10):
            b = i + 1
            a = b + n
            if a < 10:
                a = '0' + str(a)
            else:
                a = str(a)
            ABC[a + '_keyword_' + str(b)] = words_frequency[i][0]
            ABC[a + '_keyword_freq_' + str(b)] = words_frequency[i][1]
            result = mycol.update_one(condition, {'$set': ABC})
    else:
        print('less than 10 keywords')
        for i in range(len(words_frequency)):
            print(words_frequency[i])

实现效果:

使用textblob.sentiments模块对新闻正文进行情感分析。textblob.sentiments包含两种情感分析实现方式,一种是PatternAnalyzer(默认),另一种是基于电影评论语料库训练出的 NaiveBayesAnalyzer。

利用textblob.classifiers模块可创建自定义情感分类器,详见这里

# TextBlob正面负面情感分析、主观客观性分析(正文)
for x in mycol.find():
    if x['08_article_contents'] != '':
        test = TextBlob(x['08_article_contents'])
        condition = {'02_url': x['02_url']}
        ABC = mycol.find_one(condition)
        ABC['20_polarity'] = test.sentiment.polarity
        ABC['21_subjectivity'] = test.sentiment.subjectivity
        result = mycol.update_one(condition, {'$set': ABC})
    else:
        print('The article is empty: ' + x['02_url'])

实现效果:

polarity极性:在 [-1,1]范围内的浮点数,其中1表示肯定陈述,-1表示否定陈述。
subjectivity主观性:是指个人的意见和感受如何影响某人的判断力。主观性表示为浮点值,其范围为[0,1]。

使用MonkeyLearn的情感分析API对新闻正文进行分析,以作为参照。

# MonkeyLearn情感分析↓
for x in mycol.find():
    ml = MonkeyLearn('34088daed813186e7a00xxxxxxxxxxxxxxxxx')
    data = [x['08_article_contents']]
    model_id = 'cl_pi3C7JiL'
    result = ml.classifiers.classify(model_id, data)
    if not result.body[0]['error']:
        condition = {'02_url': x['02_url']}
        ABC = mycol.find_one(condition)
        ABC['19_Sentiment'] = result.body[0]['classifications'][0]['tag_name']
        ABC['19_Sentiment_Confidence'] = result.body[0]['classifications'][0]['confidence']
        result = mycol.update_one(condition, {'$set': ABC})
    else:
        print('error: '+x['02_url'])
        pass

实现效果:

📘总结

TextBlob实际上是基于NLTKpattern,但它让自然语言处理(NLP)更简单,更适合NLP入门使用。

TextBlob功能是比较丰富的:名词短语提取,词性标记,情感分析,朴素贝叶斯和决策树分类器,标记化(将文本分为单词和句子),单词和短语的频率,语法解析(Parsing),n元语法(n-grams),单词变形(复数和单数)和词形还原,拼写校正,通过扩展添加新模块或语言,WordNet集成。

📙参考

TextBlob: Simplified Text Processing
Exploratory Data Analysis for Natural Language Processing.