【Python爬虫——批量爬取北大法宝网(pkulaw)的法律判决书】看似简单的任务为何如此麻烦?Selenium破解滑块验证|Xpath解析HTML|Re正则表达式文本分析

需求

本项目是法律相关的科研项目,需要通过深圳市律师协会官方网站 (szlawyers.com)进入北大法宝的案例库下载商标法相关的判决书,并提取相关的结构化数据,以供进一步学术研究。

所以,主要包括两个数据需求:

一、判决书文本:2019年和2013年商标法

1、搜索最新版本的商标法的相关判决书,搜索“商标法”,点击“中华人民共和国商标法(2019修正)”

2、查看第五十八条,并点击司法案例(有4203 篇,公开了2116篇,通过图表,可以看到 2019(335篇)、2020(941篇)、2021(703篇)、2022(109篇))

图形用户界面, 文本  描述已自动生成

3、下载2019年11月1日(新商标法实施日期)至今2022年5月1日的全部判决书,这是最新商标法的判决书


4、搜索2013年版本的商标法的相关判决书,搜索“商标法”,点击“中华人民共和国商标法(2013修正)”

5、查看第五十八条,并点击司法案例(有4204 篇,公开了2072篇,通过图表,可以看到 2017(452篇)、2018(596篇)、2019(523篇))

img

6、下载2017年5月1日至2019年10月31日的全部判决书,这是旧版商标法的判决书

二、通过判决书文本提取结构化字段。

关于目标字段,分为两个部分:一部分是可以通过HTML解析到的;一部分是需要进一步通过文本分析得到的。

目前已经提取到的数据字段如下:

code 法宝引证码 city 法院城市
title 判决书名称 start_date 受理日期
href 原文链接 subject_1 当事人1
case_reason 案由 subject_1_rep 当事人1_代表人
case_no 案号 subject_2 当事人2
judges 全部审判法官 subject_2_rep 当事人2_代表人
instrument_type 文书类型 subject_text 当事人部分完整文本
completion_date 审结日期 subject_1_dict 当事人字典(多个当事人情况)
case_type 案件类型 subject_2_dict 当事人字典(多个当事人情况)
proceeding 审理程序 subject_1_list_true 所有的当事人1
law_firms 代理律师/律所 subject_1_rep_list_true_set 所有的当事人1_代表人
keywords 权责关键词 subject_2_list_true 所有的当事人2
legal_count 法律依据_条数 subject_2_rep_list_true 所有的当事人2_代表人
legal 法律依据 case_result 完整的裁判结果文本
text 完整的正文文本 eco_fee 被告向原告赔偿费
court 审理法院 eco_fee_beigao 赔偿人_被告
court_level 法院级别 eco_fee_yuangao 被赔偿人_被告
province 法院省份

还有一部分字段由于尚缺乏判断规则还没进行提取

  • 受理费及负担方
  • ToB or ToC
  • 是否Internet-based

获取判决书

关于获取判决书的文本,官网提供了TXT、HTML等下载格式,为了便于字段解析,笔者判断以HTML格式存储最为方便。有两种思路:

  • 思路1:通过爬虫获取URL,逐个访问URL解析数据

  • 思路2:通过正规渠道下载HTML,基于本地HTML解析数据

思路1:爬虫

Selenium破解滑块验证,获取URL

笔者首先尝试使用爬虫来获取数据,通过分析北大法宝的页面(笔者这里是i通过律协的账号登录,所以url的头缀是szlx,正常通过pkulaw官网登录时,url是www),发现这个网站在超过20页时,设置了反爬,每次翻页的时候都需要使用滑块验证,所以需要使用到selenium模拟自动化登录,再配合图像识别破解滑块验证。

以抓取2019商标法的判决书为例,大概的思如下:

1、模拟登录,输入账号密码

2、模拟点击,进入到要下载的页面

3、前20页翻页不需要滑块验证,从21页开始借鉴了CSDN上的文章破解了滑块验证

4、每次翻页的时候,抓取每个页面上10条判决书的URL(后面在基于所抓取的URL,一个一个进行访问解析并获取数据)

此部分:「模拟登录,破解滑块验证,获取每个判决书URL」的完整代码如下:(注意:使用selenium需要下载对应版本的chromedriver)

# -*- coding: utf-8 -*-
import random
import six
import os,base64
import time, re
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
from PIL import Image
import requests
import io
from io import BytesIO
import cv2


class qjh(object):

    def __init__(self):

        chrome_option = webdriver.ChromeOptions()
        self.driver = webdriver.Chrome(executable_path=r"/Users/fubo/Desktop/Git/Law_projects/chromedriver", chrome_options=chrome_option)
        self.driver.set_window_size(1440, 900)

        # 在这里修改要爬取的页面
        self.allurl=[int(i) for i in range(1,20)]

    # 核心函数
    def core(self):
        url = "https://szlx.pkulaw.com/clink/pfnl/chl/937235cafaf2a66fbdfb/58_0_0_0.html"
        self.driver.get(url)

        # 模拟登录,输入账号密码
        zhanghao = WebDriverWait(self.driver, 10).until(
            EC.presence_of_element_located((By.XPATH, '//*[@id="email-phone"]'))
        )
        zhanghao.send_keys('XXX') # 输入账号XXX,需要替换成自己的账号
        mima = WebDriverWait(self.driver, 10).until(
            EC.presence_of_element_located((By.XPATH, '//*[@id="password"]'))
        )
        mima.send_keys('XXX') # 输入密码XXX,需要替换成自己的密码
        time.sleep(1)  # 等待加载
        denglu = WebDriverWait(self.driver, 10).until(
            EC.element_to_be_clickable((By.XPATH, '//a[@class="btn-submit"]')))
        denglu.click()
        time.sleep(2)  # 等待加载

        #进入法宝推荐界面
        tuijian = WebDriverWait(self.driver, 10).until(
            EC.element_to_be_clickable((By.XPATH, '//a[@title="法宝推荐 (2095)"]')))
        tuijian.click()
        more = WebDriverWait(self.driver, 10).until(
            EC.element_to_be_clickable((By.XPATH, '//a[i[@class="c-icon c-eye"]]')))
        more.click()

        # 开始翻页
        # for i in self.allurl:
        # time.sleep(2)
        # WebDriverWait(self.driver, 1000).until(
        #     ec.element_to_be_clickable((By.ID, 'qr_submit_id'))
        #
        # )

        for i in self.allurl:
            input = WebDriverWait(self.driver,100) .until(
            EC.presence_of_element_located((By.XPATH,'//*[@name="jumpToNum"]')))
            input.send_keys(i)

            # element1 = self.driver.find_element('div/a[@class="jumpBtn"]')
            # self.driver.execute_script("arguments[0].click();", element1)

            button = WebDriverWait(self.driver,100).until(EC.element_to_be_clickable ((By.XPATH,'//a[@class="jumpBtn"]')))
            self.driver.execute_script("arguments[0].click();", button)
            # button.click()
            time.sleep(4)#等待加载

            # -----------------
            # # 破解滑块,后21页需要
            # cut_image_url, cut_location = self.get_image_url('//div[@class="cut_bg"]')
            # cut_image = self.mosaic_image(cut_image_url, cut_location)
            # cut_image.save("cut.jpg")
            #
            # self.driver.execute_script(
            #     "var x=document.getElementsByClassName('xy_img_bord')[0];"
            #     "x.style.display='block';"
            # )
            # cut_fullimage_url = self.get_move_url('//div[@class="xy_img_bord"]')
            # resq = base64.b64decode(cut_fullimage_url[22::])
            #
            # file1 = open('2.jpg', 'wb')
            # file1.write(resq)
            # file1.close()
            #
            # res=self.FindPic("cut.jpg", "2.jpg")
            #
            # res=res[3][0]#x坐标
            #
            # self.start_move(res)
            # time.sleep(2)  # 等待加载
            #-----------------

            arts=[]
            # 开始循环并保存文章
            elems = self.driver.find_elements_by_xpath("//ul/li/div/div/h4/a[1]")
            for elem in elems:
                arts.append(''+elem.get_attribute("href"))
            print(arts)
            time.sleep(2)  # 等待加载


    def FindPic(self,target, template):
        """
        找出图像中最佳匹配位置
        :param target: 目标即背景图
        :param template: 模板即需要找到的图
        :return: 返回最佳匹配及其最差匹配和对应的坐标
        """
        target_rgb = cv2.imread(target)
        target_gray = cv2.cvtColor(target_rgb, cv2.COLOR_BGR2GRAY)
        template_rgb = cv2.imread(template, 0)
        res = cv2.matchTemplate(target_gray, template_rgb, cv2.TM_CCOEFF_NORMED)
        value = cv2.minMaxLoc(res)
        return value

    def get_move_url(self, xpath):
        #得到滑块的位置
        link = re.compile(
            'background-image: url\("(.*?)"\); ')  # 严格按照格式包括空格
        elements = self.driver.find_elements_by_xpath(xpath)
        image_url = None

        for element in elements:
            style = element.get_attribute("style")
            groups = link.search(style)
            url = groups[1]
            image_url = url
        return image_url


    def get_image_url(self, xpath):
        #得到背景图片
        link = re.compile('background-image: url\("(.*?)"\); width: 30px; height: 100px; background-position: (.*?)px (.*?)px;')#严格按照格式包括空格
        elements = self.driver.find_elements_by_xpath(xpath)
        image_url = None
        location = list()
        for element in elements:

            style = element.get_attribute("style")
            groups = link.search(style)

            url = groups[1]
            x_pos = groups[2]
            y_pos = groups[3]

            location.append((int(x_pos), int(y_pos)))
            image_url = url
        return image_url, location

    # 拼接图片
    def mosaic_image(self, image_url, location):

        resq = base64.b64decode(image_url[22::])
        file1 = open('1.jpg', 'wb')
        file1.write(resq)
        file1.close()
        with open("1.jpg", 'rb') as f:
            imageBin = f.read()
        # 很多时候,数据读写不一定是文件,也可以在内存中读写。
        # BytesIO实现了在内存中读写bytes,创建一个BytesIO,然后写入一些bytes:
        buf = io.BytesIO(imageBin)
        img = Image.open(buf)
        image_upper_lst = []
        image_down_lst = []
        count=1
        for pos in location:
            if pos[0]>0:
                if count <= 10:
                    if pos[1] == 0:

                        image_upper_lst.append(img.crop((abs(pos[0]-300), 0, abs(pos[0]-300) + 30,100)))
                    else:
                        image_upper_lst.append(img.crop((abs(pos[0] - 300), 100, abs(pos[0] - 300) + 30, 200)))
                else:
                    if pos[1] == 0:

                        image_down_lst.append(img.crop((abs(pos[0]-300), 0, abs(pos[0]-300) + 30,100)))
                    else:
                        image_down_lst.append(img.crop((abs(pos[0] - 300), 100, abs(pos[0] - 300) + 30, 200)))
            elif pos[0]<=-300:

                if count <= 10:
                    if pos[1] == 0:

                        image_upper_lst.append(img.crop((abs(pos[0]+300), 0, abs(pos[0]+300) + 30,100)))
                    else:
                        image_upper_lst.append(img.crop((abs(pos[0] +300), 100, abs(pos[0]+ 300) + 30, 200)))
                else:
                    if pos[1] == 0:

                        image_down_lst.append(img.crop((abs(pos[0] + 300), 0, abs(pos[0] + 300) + 30, 100)))
                    else:
                        image_down_lst.append(img.crop((abs(pos[0] + 300), 100, abs(pos[0] + 300) + 30, 200)))
            else:
                if count <= 10:
                    if pos[1] == 0:

                        image_upper_lst.append(img.crop((abs(pos[0] ), 0, abs(pos[0] ) + 30, 100)))
                    else:
                        image_upper_lst.append(img.crop((abs(pos[0] ), 100, abs(pos[0] ) + 30, 200)))
                else:
                    if pos[1] == 0:

                        image_down_lst.append(img.crop((abs(pos[0] ), 0, abs(pos[0] ) + 30, 100)))
                    else:
                        image_down_lst.append(img.crop((abs(pos[0] ), 100, abs(pos[0] ) + 30, 200)))
            count+=1
        x_offset = 0
        # 创建一张画布,x_offset主要为新画布使用
        new_img = Image.new("RGB", (300, 200))
        for img in image_upper_lst:
            new_img.paste(img, (x_offset, 0))
            x_offset += 30

        x_offset = 0
        for img in image_down_lst:
            new_img.paste(img, (x_offset, 100))
            x_offset += 30

        return new_img
    def start_move(self, distance):
        element = self.driver.find_element_by_xpath('//div[@class="handler handler_bg"]')
        # 按下鼠标左键
        ActionChains(self.driver).click_and_hold(element).perform()
        time.sleep(0.5)
        while distance > 0:
            if distance > 10:
                # 如果距离大于10,就让他移动快一点
                span = random.randint(10,15)
            else:
                # 快到缺口了,就移动慢一点
                span = random.randint(2, 3)
            ActionChains(self.driver).move_by_offset(span, 0).perform()
            distance -= span
            time.sleep(random.randint(10, 50) / 100)

        ActionChains(self.driver).move_by_offset(distance, 1).perform()
        ActionChains(self.driver).release(on_element=element).perform()

if __name__ == "__main__":

    h = qjh()
    h.core()

Request遍历URL,解析HTML

但通过爬虫获取URL再逐个访问URL解析数据的思路,这个过程中遇到了两个难题:

1、笔者通过上述程序将2019年商标法2000多条公开判决书的URL都爬下来了。但其中不知为何存在多个重复的URL【待解决】,去重后只剩下1400多条。这是一个很大的问题,因为如果无法保证数据完整性,就影响了研究的准确性。

image-20220729165325534

2、笔者通过所爬取的URL,通过request和cookies的方式逐个访问URL,然后解析数据。笔者编程尝试了一下,大部分字段都可以完整爬取到,下图是完成的几百个测试数据截图:

image-20220729165858890

image-20220729165931889

但是代码还没测试完,这个账号就被封了…说明了法宝的风险机制还是比较严格,要爬取的数据量比较大,request请求的思路不太行得通

image-20220729165516674

通过request访问URL解析数据的不完整代码如下:

# -*- encoding:utf-8 -*-
import csv
import random
import requests
from lxml import etree
import re
from bs4 import BeautifulSoup
from pathlib import Path

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
cookies ={'Cookie':
'activeRegisterSource=; authormes=28f756df11b64b43ca9e7b75b4360ba071f17e640dad3fd063b0147764b4420a75d6e28a98d67008bdfb; xCloseNew=24; KC_ROOT_LOGIN=1; KC_ROOT_LOGIN_LEGACY=1; pkulaw_v6_sessionid=3onaddwlj3g4bec2dx04wmhh; Hm_lvt_6f2378a3818b877bcf6e084e77f97ae1=1658551068,1658565609,1658583356,1658624464; Hm_lvt_8266968662c086f34b2a3e2ae9014bf8=1658551068,1658565609,1658580776,1658624464; userislogincookie=true; Hm_up_8266968662c086f34b2a3e2ae9014bf8=%7B%22uid_%22%3A%7B%22value%22%3A%2207363b75-8f34-eb11-b390-00155d3c0709%22%2C%22scope%22%3A1%7D%2C%22ysx_yhqx_20220602%22%3A%7B%22value%22%3A%220%22%2C%22scope%22%3A1%7D%2C%22ysx_yhjs_20220602%22%3A%7B%22value%22%3A%221%22%2C%22scope%22%3A1%7D%2C%22ysx_hy_20220527%22%3A%7B%22value%22%3A%2206%22%2C%22scope%22%3A1%7D%7D; Hm_lpvt_6f2378a3818b877bcf6e084e77f97ae1=1658628608; Hm_lpvt_8266968662c086f34b2a3e2ae9014bf8=1658628608'
          }

url_list = []
def open_txt():
    with open("list1.txt", "r", encoding="utf-8") as f:
        for line in f.readlines():
            url_list.append(line.replace('\n', ''))

def send_request(url):
    resp = requests.get(url, cookies = cookies, headers=headers)
    resp.encoding = 'utf-8'
    return resp.text

def save(film_content, ):
    with open('const.csv', 'a+', newline='', encoding='gb18030') as f:
        writer = csv.writer(f)
        writer.writerow(film_content)
    print('保存文件成功')

open_txt()
items = []

# xpath解析
num_select = 10
for num_text, url in enumerate(url_list[:50:]):
    # if num_text+1==565:
    #     num_select=9
    title = '';case_reason = '';case_no = '';judges = '';instrument_type = '';court = '';completion_date = '';case_type = '';proceeding = '';law_firms = '';keywords = '';court_level = '';province = ''
    # e = etree.HTML(send_request(url))  # 类型转换 将str类型转换成class 'lxml.etree._Element'
    e = etree.parse('url', etree.HTMLParser())
    text = e.xpath("//div[@class='fulltext']//text()")
    text = " ".join(text).strip()
    # print(text)
    # print(send_request(url))
    if e.xpath(f"//div[@class='fields']/ul//li[1]//text()"):
        try:
            url_one = 'https://szlx.pkulaw.com/AssociationMore/pfnl/' + url.split('/')[-1].split('.')[0] + '/19_0.html'
            url_text = requests.get(url=url_one, headers=headers)

            obj3 = re.compile(r"本案法律依据 (.*?)<", re.S)
            legal_count = obj3.findall(url_text.text)[0]
            list11 = etree.HTML(url_text.text)
            url_text_title = list11.xpath('//*[@id="rightContent"]/div/div/div[2]/ul/li')
            legal = []
            for i in url_text_title:
                url_text_title1 = i.xpath('./div/div/h4/a//text()')
                legal.append(url_text_title1)
                # print(url_text_title1)
        except:
            url_text_title1 = ''
            legal_count = ''
            legal = ''

        # 当事人
        text = e.xpath('//div[@id="divFullText"]//text()')
        wor = str(text)
        word = wor.split('审理经过')[0]
        try:
            if ('原告:' in word):
                obj1 = re.compile(r"原告:(.*?),", re.S)
                obj2 = re.compile(r"被告:(.*?),", re.S)
                subject_1 = obj1.findall(word)[0]
                subject_2 = obj2.findall(word)[0]
                obj66 = re.compile(r"原告:.*?,(.*?)。", re.S)
                obj65 = re.compile(r"被告:.*?,(.*?)。", re.S)
                yuangao_adress = obj66.findall(word)[0].replace("\\u3000", "")
                beigao_adress = obj65.findall(word)[0].replace("\\u3000", "")
            elif ('上诉人' in word):
                obj1 = re.compile(r"上诉人(.*?),", re.S)
                obj2 = re.compile(r"被上诉人(.*?),", re.S)
                subject_1 = obj1.findall(word)[0]
                subject_2 = obj2.findall(word)[0]
                obj66 = re.compile(r"上诉人.*?,(.*?)。", re.S)
                obj65 = re.compile(r"被上诉人.*?,(.*?)。", re.S)
                yuangao_adress = obj66.findall(word)[0].replace("\\u3000", "")
                beigao_adress = obj65.findall(word)[0].replace("\\u3000", "")
            elif ('再审申请人' in word):
                obj1 = re.compile(r"再审申请人:(.*?),", re.S)
                obj2 = re.compile(r"被申请人:(.*?),", re.S)
                subject_1 = obj1.findall(word)[0]
                subject_2 = obj2.findall(word)[0]
                obj66 = re.compile(r"再审申请人:.*?,(.*?)。", re.S)
                obj65 = re.compile(r"被申请人:.*?,(.*?)。", re.S)
                yuangao_adress = obj66.findall(word)[0].replace("\\u3000", "")
                beigao_adress = obj65.findall(word)[0].replace("\\u3000", "")
        except:
            subject_1 = ''
            subject_2 = ''
            yuangao_adress = ''
            beigao_adress = ''

        obj4 = re.compile(r"判决如下:(.*?)落款", re.S)
        result_1 = obj4.findall(word)
        case_result = ''.join(result_1).replace("'", "").replace("\\u3000", "").replace(',', '').strip()
        code = e.xpath('//*[@id="gridleft"]/div/div[1]/div[1]/span/a/text()')[0]
        box_list = str(e.xpath('//*[@id="gridleft"]/div/div[1]/div[3]//text()'))

        instrument_type = re.compile(r"文书类型(.*?)公开类型", re.S)
        instrument_type = ''.join(instrument_type.findall(box_list)).replace('\\r\\n ', '').replace('\\t', '').replace(
            ',', '').replace("'",
                             '').replace(
            ' ', '')
        for i in range(2, 6):
            contant = e.xpath(f"//div[@class='fields']/ul//li[{i}]//text()")
            # print(contant)
            # print("".join(contant).split())
            if "".join(contant).split()[0] == '审理法官:':
                num = 11
                break
            else:
                num = 10
        for i in range(2, 10):
            contant = e.xpath(f"//div[@class='fields']/ul//li[{i}]//text()")
            if "".join(contant).split()[0] == '代理律师/律所:':
                # print("".join(contant).split()[1::])
                flag = 1
                break
            else:
                flag = 0
        # print(flag)
        if flag == 0:
            num -= 1
        code = e.xpath("//div[@class='info']/span/a/text()")
        code = "".join(code)
        title = e.xpath("//div/h2/text()")
        title = "".join(title).strip()
        case_reason = e.xpath("//div[@class='fields']/ul/li[@class='row'][1]/div[@class='box']/a/text()")
        case_reason = "".join(case_reason)
        case_no = e.xpath("//div[@class='fields']/ul[1]//li//@title")[1]
        text = e.xpath("//div[@class='fulltext']/div//text()")
        text = " ".join(text).strip()
        # print(num)
        for i in range(2, num):

            contant = e.xpath(f"//div[@class='fields']/ul//li[{i}]//text()")
            data = "".join(contant).split()[0]
            # print("".join(contant).split()[0][0:5:])
            # print("".join(contant).split())
            if data == '审理法官:':

                judges = "".join(contant).split()[1::]
                # print(judges)
                judges = ",".join(judges)
            elif data == '文书类型:':
                instrument_type = "".join(contant).split()[1]
            elif data == '公开类型:':
                instrument_type = "".join(contant).split()[1]
            elif data == '审理法院:':
                court = "".join(contant).split()[1]
                if ("最高" in court):
                    court_level = '最高'
                    province = '空'
                    city = '空'
                elif ("高级" in court):
                    court_level = '高级'
                    if ("省" in court):
                        province = court.split('省')[0]
                        if ("市" in court):
                            city = court.split("省")[1].split('市')[0]
                        else:
                            city = '空'
                    elif ("自治州" in court):
                        province = court.split('自治州')[0]
                        if ("市" in court):
                            city = court.split("自治州")[1].split('市')[0]
                    elif ("市" in court):
                        province = court.split('市')[0]
                        city = '空'
                    else:
                        city = '空'
                        province = '空'
                elif ("中级" in court):
                    court_level = '中级'
                    if ("省" in court):
                        province = court.split('省')[0]
                        if ("市" in court):
                            city = court.split("省")[1].split('市')[0]
                    elif ("自治州" in court):
                        province = court.split('自治州')[0]
                        if ("市" in court):
                            city = court.split("自治州")[1].split('市')[0]
                    elif ("市" in court):
                        province = court.split('市')[0]
                        city = '空'
                    else:
                        city = '空'
                        province = '空'
                elif ("知识产权" in court):
                    court_level = '专门'
                    city = '空'
                    province = '空'
                elif ("区" or "县" in court):
                    court_level = '基层'
                    if ("省" in court):
                        province = court.split('省')[0]
                        if ("市" in court):
                            city = court.split("省")[1].split('市')[0]
                    elif ("自治州" in court):
                        province = court.split('自治州')[0]
                        if ("市" in court):
                            city = court.split("自治州")[1].split('市')[0]
                    elif ("市" in court):
                        province = court.split('市')[0]
                        city = '空'
                    else:
                        city = '空'
                        province = '空'
                else:
                    court_level = ''
                    city = '空'
                    province = '空'
            elif data[0:5:] == '审结日期:':
                completion_date = data[5::]
            elif data == '案件类型:':
                case_type = "".join(contant).split()[1]
            elif data == '审理程序:':
                proceeding = "".join(contant).split()[1]
            elif data == '代理律师/律所:':
                law_firms = "".join(contant).split()[1::]
                law_firms = ",".join(law_firms).replace(',,', '').replace(',;', '')
            elif data == '权责关键词:':
                keywords = "".join(contant).split()[1::]
                keywords = ",".join(keywords)
        item = [num_text + 1, code, title, url, case_reason, case_no, judges, instrument_type, court, court_level,
                province, city, completion_date, case_type,
                proceeding, law_firms, keywords,instrument_type,subject_1,subject_2,case_result,legal_count,legal, text]
        print(f"第{num_text + 1}个爬取成功,titile是{title}")
        save(item)



基于上述难题,笔者想能否通过以下可能的方式解决问题:

(1)【不确定是否可做】关于封号问题,可能可以通过购买IP代理池,不断变换IP爬取数据,解决封号问题(未尝试过)

(2)【可做】通过正规渠道直接下载HTML文件到本地,再编程利用文本分析提取字段。接下来笔者将尝试这个方式。

思路2:正规渠道

笔者首先通过律协的渠道联系到了北大法宝的负责经理,开通了北大法宝的下载账号,但每天限制最多下载200篇,且批量下载一次只能下载10篇,相当于总共4000多篇判决书要20多天,点击400多次才可以完成… 笔者对于这种枯燥的人工点击极其排斥,即使点击次数也没有特别多,所以有两个解决方式:

解决方式1:联系了法宝的经理,咨询是否可以后台批量下载,至今一周过去了,中间催促了一两次,但仍在等待回复中…

image-20220729160600497

解决方式2:如果没办法通过后台直接下载,那只能通过自动化程序编程来完成这种枯燥的工作了,目前的思路是使用pyautogui完成自动化点击,笔者曾使用这个工具下载wind的数据,解放了双手。但目前由于法宝是网页端,似乎使用selenium就可以完成自动化,pyautogui更多是用来解决非网页端的自动化工具。

image-20220729161303599

但这样还是被限制每天下载200篇,下载20天,需要尝试是否有办法搞多几个账号…

在等待反馈的过程中,笔者手动先下载了250多个HTML到本地,然后利用编程读取字段。

image-20220729171012661

解析HTML获取字段

获取了本地HTML后,需要遍历文件夹中的HTML,进行解析,获取需要的字段数据,这个需求听着简单,但实际操作中却很麻烦,主要是因为要获取的文本没有太多结构化的规律,所以需要自己找规律,不断尝试,以覆盖多种情况。主要的方法思路如下:

1、对每个字段观察判决书中的情况,总结多种情况,得到提取字段的判断规则和规律。

2、使用Xpath提取HTML的字段,使用正则表达式完成对于字段规律的编程,这里可以借助八爪鱼、Regex正则生成代码JS正则工具等工具辅助写正则表达式。

3、获取后使用python相关的数据处理工具进行数据处理。


最后得到了前文所说的字段数据,基本完成了数据需求的编程。

image-20220729171255760


完整代码如下:

# -*- encoding:utf-8 -*-
import csv
import os
import re
from lxml import etree
from itertools import zip_longest

# ————————————————————————————————
# 文件保存
def save(film_content):
    with open('../content/text17.csv', 'a+', newline='', encoding='gb18030') as f:
        writer = csv.writer(f)
        writer.writerow(film_content)
    print('保存文件成功')

# 循环遍历本地文件夹中的HTML
def get_html():
    path = 'all_html'
    all_html = os.listdir(path)
    os.chdir(path)
    return all_html

# 逐行遍历TXT文件中的URL链接
# def open_txt():
#     with open("url.txt", "r", encoding="utf-8")as f:
#         for line in f.readlines():
#             url_list.append(line.replace('\n', ''))
# url_list = []

items = []
num_select = 10
all_html = get_html()

# ————————————————————————————————

# 循环遍历
for num_text, html in enumerate(all_html):

    # 定义所有用到的字段
    title = ''
    case_reason = ''
    case_no = ''
    judges = ''
    judges_first = ''
    instrument_type = ''
    court = ''
    completion_date = ''
    case_type = ''
    proceeding = ''
    law_firms = ''
    keywords = ''
    court_level = ''
    province = ''
    subject_1 = ''
    subject_2 = ''
    subject_1_rep = ''
    subject_2_rep = ''
    legal_count = ''
    legal = ''
    subject_1_list_true = []
    subject_1_rep_list_true = []
    subject_2_list_true = []
    subject_2_rep_list_true = []
    city = ''
    word = ''
    start_date = ''
    eco_fee = ''
    eco_fee_yuangao = ''
    eco_fee_beigao = ''

    # 完整的正文文本 text
    text = e.xpath("//div[@class='fulltext']//text()")
    text = " ".join(text).strip()

    # 法宝引证码code
    code = e.xpath("//div[@class='logo']/div/text()")
    code = "".join(code)

    # 判决书名称title
    title = e.xpath("//div/h2/text()")
    title = "".join(title).strip()

    # 原文链接 href
    url = e.xpath("//div[@class='link']/a/@href")
    url = "".join(url)

    # 案由 case_reason 
    case_reason = e.xpath("//div[@class='fields']/ul/li[@class='row'][1]/div[@class='box']/a/text()")
    case_reason = "".join(case_reason)

    # 案号 case_no
    try:
        case_no = e.xpath("//div[@class='fields']/ul[1]//li//@title")[1]
    except:
        continue

    # 落款中的审判长 judge_first
    if ("二〇") in text:
        obj_9 = re.compile("(?<=判决如下)(.+?)(?=二〇)", re.S).findall(text)
    else:
        obj_9 = re.compile("(?<=判决如下)(.+?)(?=二○)", re.S).findall(text)
    if("审 判 长" ) in str(obj_9):
        if("人民陪审员" ) in str(obj_9):
            obj_10 = re.compile(r"(?<=审 判 长)(.+?)(?=人民陪审员)", re.S).findall(str(obj_9))
            judges_first = ''.join(obj_10).replace("\\u3000", '').replace(' ', '').replace("<>style=\"font-family:仿宋;font-size:15pt;-aw-import:spaces\">",'')
        else:
            obj_10 = re.compile(r"(?<=审 判 长)(.+?)(?=审 判 员)", re.S).findall(str(obj_9))
            judges_first = ''.join(obj_10).replace("\\u3000", '').replace(' ', '').replace("<>style=\"font-family:仿宋;font-size:15pt;-aw-import:spaces\">",'')
    elif("审判长" ) in str(obj_9):
        if ("人民陪审员") in str(obj_9):
            obj_10 = re.compile(r"(?<=审判长)(.+?)(?=人民陪审员)", re.S).findall(str(obj_9))
            judges_first = ''.join(obj_10).replace("\\u3000", '').replace(' ', '').replace("<>style=\"font-family:仿宋;font-size:15pt;-aw-import:spaces\">", '')
        else:
            obj_10 = re.compile(r"(?<=审判长)(.+?)(?=审判员)", re.S).findall(str(obj_9))
            judges_first = ''.join(obj_10).replace("\\u3000", '').replace(' ', '').replace("<>style=\"font-family:仿宋;font-size:15pt;-aw-import:spaces\">",'')
    elif("审 判 长" or "审判长") not in str(obj_9):
        if("审判员") in str(obj_9):
            obj_10 = re.compile(r"(?<=审判员 )(.+?)(?= '\])", re.S).findall(str(obj_9))
            judges_first = ''.join(obj_10).replace("\\u3000", '').replace(' ', '').replace("<>style=\"font-family:仿宋;font-size:15pt;-aw-import:spaces\">", '')
        elif("审 判 员") in str(obj_9):
            obj_10 = re.compile(r"(?<=审 判 员 )(.+?)(?= '\])", re.S).findall(str(obj_9))
            judges_first = ''.join(obj_10).replace("\\u3000", '').replace(' ', '').replace("<>style=\"font-family:仿宋;font-size:15pt;-aw-import:spaces\">", '')

    # 受理日期 start_date
    obj11 = re.compile(r"(?<=本院于)(.+?)(?=后)", re.S)
    obj111 = str(obj11.findall(text))
    obj12 = re.compile(r"\d{4}年\d{1,2}月\d{1,2}日",re.S)
    start_date_list = obj12.findall(obj111)
    if start_date_list:
        start_date = start_date_list[0]

    # 解析HTML页面:当事人+法律依据相关字段
    e = etree.parse(html, etree.HTMLParser())
    if e.xpath("//div[@class='fields']/ul//li[1]//text()"):

        # 法律依据legal,法律依据条数legal_count
        legal = e.xpath("//div[@class='correlation-info'][1]/ul/li/a/text()")
        legal = "".join(legal).replace('\r\n', '').replace('侵权责任法 ', '侵权责任法').replace(') ', ')').split()
        legal_count = len(legal)
        legal = ";".join(legal)

        # 当事人,当事人完整字段word
        text = e.xpath('//div[@id="divFullText"]//text()')
        wor = str(text)
        word = wor.split('审理经过')[0].strip()
        word = str(word)

        # 辅助变量
        subject_1_list = []
        subject_1_rep_list = []
        subject_2_list = []
        subject_2_rep_list = []

        # 根据关键词正则提取当事人1、当事人2的信息
        if ('再审申请人' in word):
            obj1 = re.compile(r"再审申请人(.+?)(?=',)", re.S)
            subject_1_list = obj1.findall(word)
            if subject_1_list:
                for i in range(0,len(subject_1_list)):
                      subject_1_list_true.append('再审申请人' + str(subject_1_list[i]))
                subject_1 = subject_1_list_true[0]
            obj66 = re.compile(r"再审申请人.*?,(.*?)。", re.S)
            subject_1_rep_list= obj66.findall(word)
            if subject_1_rep_list:
                for i in range(0, len(subject_1_rep_list)):
                    subject_1_rep_list_true.append(subject_1_rep_list[i].replace(" \'\\u3000\\u3000", ''))
                subject_1_rep = str(subject_1_rep_list_true[0])
            obj2 = re.compile(r"被申请人(.+?)(?=',)", re.S)
            subject_2_list = obj2.findall(word)
            if subject_2_list:
                for i in range(0,len(subject_2_list)):
                    subject_2_list_true.append('被申请人' + str(subject_2_list[i]))
                subject_2 = subject_2_list_true[0]
            obj65 = re.compile(r"被申请人.*?,(.*?)。", re.S)
            subject_2_rep_list = obj65.findall(word)
            if subject_2_rep_list:
                for i in range(0, len(subject_2_rep_list)):
                    subject_2_rep_list_true.append(subject_2_rep_list[i].replace(" \'\\u3000\\u3000", ''))
                subject_2_rep = str(subject_2_rep_list_true[0])

        elif ('上诉人' in word):
            obj1 = re.compile(r"3000上诉人\((.+?)(?=', ')", re.S)
            subject_1_list = obj1.findall(word)
            if subject_1_list:
                for i in range(0,len(subject_1_list)):
                      subject_1_list_true.append('上诉人(' + subject_1_list[i])
                subject_1 = subject_1_list_true[0]
            obj66 = re.compile(r"3000上诉人\(.*?,(.*?)。", re.S)
            subject_1_rep_list= obj66.findall(word)
            # print(subject_1_rep_list)
            if subject_1_rep_list:
                for i in range(0, len(subject_1_rep_list)):
                    subject_1_rep_list_true.append(subject_1_rep_list[i].replace(" \'\\u3000\\u3000", ''))
                subject_1_rep = str(subject_1_rep_list_true[0])
            obj2 = re.compile(r"\\u3000被上诉人\((.+?)(?=',)", re.S)
            subject_2_list = obj2.findall(word)
            if subject_2_list:
                for i in range(0, len(subject_2_list)):
                    subject_2_list_true.append('被上诉人'  + subject_2_list[i])
                subject_2 = subject_2_list_true[0]
            obj65 = re.compile(r"被上诉人\(.*?,(.*?)。", re.S)
            subject_2_rep_list = obj65.findall(word)
            # print(subject_2_rep_list)
            if subject_2_rep_list:
                for i in range(0, len(subject_2_rep_list)):
                    subject_2_rep_list_true.append(subject_2_rep_list[i].replace(" \'\\u3000\\u3000", ''))
                subject_2_rep = str(subject_2_rep_list_true[0])

        elif ('原告:' in word):
            obj1 = re.compile(r"原告:(.+?)(?=',)", re.S)
            subject_1_list = obj1.findall(word)
            if subject_1_list:
                for i in range(0, len(subject_1_list)):
                      subject_1_list_true.append('原告:' + str(subject_1_list[i]))
                subject_1 = subject_1_list_true[0]
            obj66 = re.compile(r"原告:.*?,(.*?)。", re.S)
            # print(word)
            subject_1_rep_list = obj66.findall(word)
            # print(subject_1_rep_list)
            if subject_1_rep_list:
                for i in range(0, len(subject_1_rep_list)):
                    subject_1_rep_list_true.append(subject_1_rep_list[i].replace(" \'\\u3000\\u3000", ''))
                subject_1_rep = str(subject_1_rep_list_true[0])
            obj2 = re.compile(r"被告:(.+?)(?=',)", re.S)
            subject_2_list = obj2.findall(word)
            if subject_2_list:
                for i in range(0, len(subject_2_list)):
                    subject_2_list_true.append('被告:'  + str(subject_2_list[i]))
                subject_2 = subject_2_list_true[0]
            obj65 = re.compile(r"被告:.*?,(.*?)。", re.S)
            subject_2_rep_list = obj65.findall(word)
            print(subject_2_rep_list)
            if subject_2_rep_list:
                for i in range(0, len(subject_2_rep_list)):
                    subject_2_rep_list_true.append(subject_2_rep_list[i].replace(" \'\\u3000\\u3000", ''))
                subject_2_rep = str(subject_2_rep_list_true[0])

        # 当事人1列表去重复
        subject_1_rep_list_true_set = sorted(set(subject_1_rep_list_true), key=subject_1_rep_list_true.index)

        # 使用zip将两个当事人list合并成dict
        subject_1_dict = dict(zip_longest(subject_1_list_true, subject_1_rep_list_true_set))
        subject_2_dict = dict(zip_longest(subject_2_list_true, subject_2_rep_list_true))

        # 辅助打印测试当事人情况
        # print("-------------------")
        # print(subject_1_dict)
        # print(subject_2_dict)
        # print("subject_1"+ subject_1)
        # print("subject_1_list:",str(subject_1_list_true))
        # print("subject_1_rep"+ subject_1_rep)
        # print("subject_1_rep_list",subject_1_rep_list_true_set)
        # print("subject_2"+ subject_2)
        # print("subject_2_list",subject_2_list_true)
        # print("subject_2_rep"+subject_2_rep)
        # print("subject_2_rep_list",subject_2_rep_list_true)

    # 完整判决结果文本
    if ('落款' in wor):
        obj4 = re.compile("判决如下(.*?)落款", re.S)
        result_1 = obj4.findall("".join(text))
    else:
        obj4 = re.compile("判决如下(.*?)二〇", re.S)
        result_1 = obj4.findall("".join(text))
    case_result = ''.join(result_1).replace("'", "").replace("\\u3000", "").replace(',', '').strip()

    # 经济赔偿损失,原告与被告负担情况
    obj_1 = re.compile(r"(?<=[一|二|三|四|五|六|七|八|九|十]、)(?=被告)(.+?)(?=[于本判决|应于本判决])", re.S).findall(case_result)
    if obj_1:
        eco_fee_beigao =''.join(str(obj_1[0])).replace("被告", '')
    obj_2 = re.compile(r"(?<=赔偿原告)(.+?)(?=经济损失)",re.S).findall(case_result)
    eco_fee_yuangao=''.join(obj_2)
    obj_3 = re.compile(r"(?<=十日内赔偿原告)(.+?)(?=元)",re.S).findall(case_result)
    obj_4 = re.compile(r"\d+\.?\d*", re.S).findall(str(obj_3))
    eco_fee =''.join(obj_4)
    # #受理费及负担情况(待完成)
    # if("二审" in case_result):
    #     if("一审" in case_result):
    #         obj_5 = re.compile(r"(?<=[一审案件受理费|一审受理费])(.+?)(?=元)", re.S).findall(case_result)
    #         obj_6 = re.compile(r"\d+\.?\d*", re.S).findall(str(obj_3))
    #         shouli_1_fee = ''.join(obj_6)


    # 方框中的基本信息 
    try:
        for i in range(2, 12):
            contant = e.xpath(f"//div[@class='fields']/ul//li[{i}]//text()")
            data = "".join(contant).split()[0]
            if data == '审理法官:':
                judges = "".join(contant).split()[1::]
                judges = ",".join(judges)
            elif data == '文书类型:':
                instrument_type = "".join(contant).split()[1]
            elif data == '审理法院:':
                court = "".join(contant).split()[1]

                # 提取法院的基本信息,court_level 法院级别;province 法院省份; city 法院城市 
                try:
                    if ("最高" in court):
                        court_level = '最高'
                        province = '空'
                        city = '空'
                    elif ("高级" in court):
                        court_level = '高级'
                        if ("省" in court):
                            province = court.split('省')[0] + '省'

                            if ("市" in court):

                                city = court.split("省")[1].split('市')[0] + '市'
                            else:
                                city = '空'
                        elif ("自治州" in court):
                            province = court.split('自治州')[0] + '自治州'

                            if ("市" in court):
                                city = court.split("自治州")[1].split('市')[0] + '市'
                        elif ("市" in court):
                            province = court.split('市')[0] + '市'
                            city = '空'
                        else:
                            city = '空'
                            province = '空'
                    elif ("中级" in court):
                        court_level = '中级'
                        if ("省" in court):
                            province = court.split('省')[0] + '省'
                            if ("市" in court):
                                city = court.split("省")[1].split('市')[0] + '市'
                        elif ("自治州" in court):
                            province = court.split('自治州')[0] + '自治州'
                            if ("市" in court):
                                city = court.split("自治州")[1].split('市')[0] + '市'
                        elif ("市" in court):
                            province = court.split('市')[0] + '市'
                            city = '空'
                        else:
                            city = '空'
                            province = '空'
                    elif ("知识产权" in court):
                        court_level = '专门'
                        city = '空'
                        province = '空'
                    elif ("区" or "县" in court):
                        court_level = '基层'
                        if ("省" in court):
                            province = court.split('省')[0] + '省'
                            if ("市" in court):
                                city = court.split("省")[1].split('市')[0] + '市'
                        elif ("自治州" in court):
                            province = court.split('自治州')[0] + '自治州'
                            if ("市" in court):
                                city = court.split("自治州")[1].split('市')[0] + '市'
                        elif ("市" in court):
                            province = court.split('市')[0] + '市'
                            city = '空'
                        else:
                            city = '空'
                            province = '空'
                    else:
                        court_level = ''
                        city = '空'
                        province = '空'
                except:
                    pass

            elif data[0:5:] == '审结日期:':
                completion_date = data[5::]
            elif data == '案件类型:':
                case_type = "".join(contant).split()[1]
            elif data == '审理程序:':
                proceeding = "".join(contant).split()[1]
            elif data == '代理律师/律所:':
                law_firms = "".join(contant).split()[1::]
                law_firms = ",".join(law_firms).replace(',,', '').replace(',;', '')
            elif data == '权责关键词:':
                keywords = "".join(contant).split()[1::]
                keywords = ",".join(keywords)

        # 保存字段
        item = [code, title, url, case_reason, case_no, judges, judges_first,instrument_type,completion_date, case_type,proceeding, law_firms, keywords,legal_count, legal,text,
                court, court_level, province, city,start_date,
                subject_1, subject_1_rep, subject_2, subject_2_rep,
                word, subject_1_dict, subject_2_dict, subject_1_list_true, subject_1_rep_list_true_set,subject_2_list_true, subject_2_rep_list_true,
                case_result,eco_fee,eco_fee_beigao,eco_fee_yuangao]
        print(f"第{num_text + 1}个爬取成功,titile是{title}")
        save(item)
    except:
        item = [code, title, url, case_reason, case_no, judges, judges_first,instrument_type,completion_date, case_type,proceeding, law_firms, keywords,legal_count, legal,text,
                court, court_level, province, city,start_date,
                subject_1, subject_1_rep, subject_2, subject_2_rep,
                word, subject_1_dict, subject_2_dict, subject_1_list_true, subject_1_rep_list_true_set,subject_2_list_true, subject_2_rep_list_true,
                case_result,eco_fee,eco_fee_beigao,eco_fee_yuangao]
        print(f"第{num_text + 1}个爬取成功,titile是{title}")
        save(item)
        pass


总结

本次项目看着需求简单,但过程却花费了很多时间,特别是正则过程中的条件判断很麻烦,需要找到并总结多种情况的规律再去编程,但好歹初步解决了问题。

针对此次项目接下来还需做的:

1、当事人相关字段基本都提取出来,但有一些情况仍然没有覆盖到,这些特例需要人工修改(其中标注灰色的变量是后面修改可能用到的数据);

2、金额相关的经济赔偿费和受理费,情况太多,仍然需要完善;

3、ToB、ToC等字段规则确定后,仍需编程实现;

4、等待北大法宝的经理给予反馈,如果可以直接下载4000多HTML,拿到后只需要跑一遍程序即可;如果不行就需要花时间手动下载;

5、代码效率仍然不高,有空需要提高;

暂无评论

发送评论 编辑评论


				
|´・ω・)ノ
ヾ(≧∇≦*)ゝ
(☆ω☆)
(╯‵□′)╯︵┴─┴
 ̄﹃ ̄
(/ω\)
∠( ᐛ 」∠)_
(๑•̀ㅁ•́ฅ)
→_→
୧(๑•̀⌄•́๑)૭
٩(ˊᗜˋ*)و
(ノ°ο°)ノ
(´இ皿இ`)
⌇●﹏●⌇
(ฅ´ω`ฅ)
(╯°A°)╯︵○○○
φ( ̄∇ ̄o)
ヾ(´・ ・`。)ノ"
( ง ᵒ̌皿ᵒ̌)ง⁼³₌₃
(ó﹏ò。)
Σ(っ °Д °;)っ
( ,,´・ω・)ノ"(´っω・`。)
╮(╯▽╰)╭
o(*////▽////*)q
>﹏<
( ๑´•ω•) "(ㆆᴗㆆ)
😂
😀
😅
😊
🙂
🙃
😌
😍
😘
😜
😝
😏
😒
🙄
😳
😡
😔
😫
😱
😭
💩
👻
🙌
🖕
👍
👫
👬
👭
🌚
🌝
🙈
💊
😶
🙏
🍦
🍉
😣
Source: github.com/k4yt3x/flowerhd
颜文字
Emoji
小恐龙
花!
上一篇
下一篇