python实现爬虫统计学校BBS男女比例之数据处理（三）

摘要：本文主要介绍了数据处理方面的内容，希望大家仔细阅读。一、数据分析得到了以下列字符串开头的文本数据，我们需要进行处理二、回滚我们需要对http...

本文主要介绍了数据处理方面的内容，希望大家仔细阅读。

一、数据分析

python实现爬虫统计学校BBS男女比例之数据处理（三）1

得到了以下列字符串开头的文本数据，我们需要进行处理

python实现爬虫统计学校BBS男女比例之数据处理（三）2

二、回滚

我们需要对httperror的数据进行再处理

因为代码的原因，具体可见本系列文章（二），会导致文本里面同一个id连续出现几次httperror记录：

//httperror265001_266001.txt 265002 httperror 265002 httperror 265002 httperror 265002 httperror 265003 httperror 265003 httperror 265003 httperror 265003 httperror

所以我们在代码里要考虑这种情形，不能每一行的id都进行处理，是判断是否重复的id。

java里面有缓存方法可以避免频繁读取硬盘上的文件，python其实也有，可以见这篇文章。

def main(): reload(sys) sys.setdefaultencoding('utf-8') global sexRe,timeRe,notexistRe,url1,url2,file1,file2,file3,file4,startNum,endNum,file5 sexRe = re.compile(u'em>u6027u522b</em>(.*?)</li') timeRe = re.compile(u'em>u4e0au6b21u6d3bu52a8u65f6u95f4</em>(.*?)</li') notexistRe = re.compile(u'(p>)u62b1u6b49uff0cu60a8u6307u5b9au7684u7528u6237u7a7au95f4u4e0du5b58u5728<') url1 = 'http://rs.xidian.edu.cn/home.php?mod=space&uid=%s' url2 = 'http://rs.xidian.edu.cn/home.php?mod=space&uid=%s&do=profile' file1 = 'ruisicorrect_re.txt' file2 = 'ruisierrTime_re.txt' file3 = 'ruisinotexist_re.txt' file4 = 'ruisiunkownsex_re.txt' file5 = 'ruisihttperror_re.txt' #遍历文件夹里面以httperror开头的文本 for filename in os.listdir(r'E:pythonProjectruisi'): if filename.startswith('httperror'): count = 0 newName = 'E:pythonProjectruisi%s' % (filename) readFile = open(newName,'r') oldLine = '0' for line in readFile: #newLine 用来比较是否是重复的id newLine = line if (newLine != oldLine): nu = newLine.split()[0] oldLine = newLine count += 1 searchWeb((int(nu),)) print "%s deal %s lines" %(filename, count)

本代码为了简便，没有再把httperror的那些id分类，直接存储为下面这5个文件里

file1 = 'ruisicorrect_re.txt' file2 = 'ruisierrTime_re.txt' file3 = 'ruisinotexist_re.txt' file4 = 'ruisiunkownsex_re.txt' file5 = 'ruisihttperror_re.txt'

可以看下输出Log记录，总共处理了多少个httperror的数据。

"D:Program FilesPython27python.exe" E:/pythonProject/webCrawler/reload.py httperror132001-133001.txt deal 21 lines httperror2001-3001.txt deal 4 lines httperror251001-252001.txt deal 5 lines httperror254001-255001.txt deal 1 lines

三、单线程统计unkownsex 数据

代码简单，我们利用单线程统计一下unkownsex（由于权限原因无法获取、或者该用户没有填写）的用户。另外，经过我们检查，没有性别的用户也是没有活动时间的。

数据格式如下：

253042 unkownsex 253087 unkownsex 253102 unkownsex 253118 unkownsex 253125 unkownsex 253136 unkownsex 253161 unkownsex import os,time sumCount = 0 startTime = time.clock() for filename in os.listdir(r'E:pythonProjectruisi'): if filename.startswith('unkownsex'): count = 0 newName = 'E:pythonProjectruisi%s' % (filename) readFile = open(newName,'r') for line in open(newName): count += 1 sumCount +=1 print "%s deal %s lines" %(filename, count) print '%s unkowns sex' %(sumCount) endTime = time.clock() print "cost time " + str(endTime - startTime) + " s"

处理速度很快，输出如下：

unkownsex1-1001.txt deal 204 lines unkownsex100001-101001.txt deal 50 lines unkownsex10001-11001.txt deal 206 lines #...省略中间输出信息 unkownsex99001-100001.txt deal 56 lines unkownsex_re.txt deal 1085 lines 14223 unkowns sex cost time 0.0813142301261 s

四、单线程统计 correct 数据

数据格式如下：

31024 男 2014-11-11 13:20 31283 男 2013-3-25 19:41 31340 保密 2015-2-2 15:17 31427 保密 2014-8-10 09:17 31475 保密 2013-7-2 08:59 31554 保密 2014-10-17 17:02 31621 男 2015-5-16 19:27 31872 保密 2015-1-11 16:49 31915 保密 2014-5-4 11:01 31997 保密 2015-5-16 20:14

代码如下，实现思路就是一行一行读取，利用line.split()获取性别信息。sumCount 是统计一个多少人，boycount 、girlcount 、secretcount 分别统计男、女、保密的人数。我们还是利用unicode进行正则匹配。

import os,sys,time reload(sys) sys.setdefaultencoding('utf-8') startTime = time.clock() sumCount = 0 boycount = 0 girlcount = 0 secretcount = 0 for filename in os.listdir(r'E:pythonProjectruisi'): if filename.startswith('correct'): newName = 'E:pythonProjectruisi%s' % (filename) readFile = open(newName,'r') for line in readFile: sexInfo = line.split()[1] sumCount +=1 if sexInfo == u'u7537' : boycount += 1 elif sexInfo == u'u5973': girlcount +=1 elif sexInfo == u'u4fddu5bc6': secretcount +=1 print "until %s, sum is %s boys; %s girls; %s secret;" %(filename, boycount,girlcount,secretcount) print "total is %s; %s boys; %s girls; %s secret;" %(sumCount, boycount,girlcount,secretcount) endTime = time.clock() print "cost time " + str(endTime - startTime) + " s"

注意，我们输出的是截止某个文件的统计信息，而不是单个文件的统计情况。输出结果如下：

until correct1-1001.txt, sum is 110 boys; 7 girls; 414 secret; until correct100001-101001.txt, sum is 125 boys; 13 girls; 542 secret; #...省略 until correct99001-100001.txt, sum is 11070 boys; 3113 girls; 26636 secret; until correct_re.txt, sum is 13937 boys; 4007 girls; 28941 secret; total is 46885; 13937 boys; 4007 girls; 28941 secret; cost time 3.60047888495 s

五、多线程统计数据

为了更快统计，我们可以利用多线程。

作为对比，我们试下单线程需要的时间。

# encoding: UTF-8 import threading import time,os,sys #全局变量 SUM = 0 BOY = 0 GIRL = 0 SECRET = 0 NUM =0 #本来继承自threading.Thread，覆盖run()方法，用start()启动线程 #这和java里面很像 class StaFileList(threading.Thread): #文本名称列表 fileList = [] def __init__(self, fileList): threading.Thread.__init__(self) self.fileList = fileList def run(self): global SUM, BOY, GIRL, SECRET #可以加上个耗时时间，这样多线程更加明显，而不是顺序的thread-1,2,3 #time.sleep(1) #acquire获取锁 if mutex.acquire(1): self.staFiles(self.fileList) #release释放锁 mutex.release() #处理输入的files列表，统计男女人数 #注意这儿数据同步问题，global使用全局变量 def staFiles(self, files): global SUM, BOY, GIRL, SECRET for name in files: newName = 'E:pythonProjectruisi%s' % (name) readFile = open(newName,'r') for line in readFile: sexInfo = line.split()[1] SUM +=1 if sexInfo == u'u7537' : BOY += 1 elif sexInfo == u'u5973': GIRL +=1 elif sexInfo == u'u4fddu5bc6': SECRET +=1 # print "thread %s, until %s, total is %s; %s boys; %s girls;" # " %s secret;" %(self.name, name, SUM, BOY,GIRL,SECRET) def test(): #files保存多个文件，可以设定一个线程处理多少个文件 files = [] #用来保存所有的线程，方便最后主线程等待所以子线程结束 staThreads = [] i = 0 for filename in os.listdir(r'E:pythonProjectruisi'): #没获取10个文本，就创建一个线程 if filename.startswith('correct'): files.append(filename) i+=1 #一个线程处理20个文件 if i == 20 : staThreads.append(StaFileList(files)) files = [] i = 0 #最后剩余的files，很可能长度不足10个 if files: staThreads.append(StaFileList(files)) for t in staThreads: t.start() # 主线程中等待所有子线程退出，如果不加这个，速度更快些？ for t in staThreads: t.join() if __name__ == '__main__': reload(sys) sys.setdefaultencoding('utf-8') startTime = time.clock() mutex = threading.Lock() test() print "Multi Thread, total is %s; %s boys; %s girls; %s secret;" %(SUM, BOY,GIRL,SECRET) endTime = time.clock() print "cost time " + str(endTime - startTime) + " s"

输出

Multi Thread, total is 46885; 13937 boys; 4007 girls; 28941 secret; cost time 0.132137192794 s

我们发现时间和单线程差不多。因为这儿涉及到线程同步问题，获取锁和释放锁都是需要时间开销的，线程间切换保存中断和恢复中断也都是需要时间开销的。

六、较多数据的单线程和多线程对比

我们可以对correct、errTime 、unkownsex的文本都进行处理。

单线程代码

# coding=utf-8 import os,sys,time reload(sys) sys.setdefaultencoding('utf-8') startTime = time.clock() sumCount = 0 boycount = 0 girlcount = 0 secretcount = 0 unkowncount = 0 for filename in os.listdir(r'E:pythonProjectruisi'): # 有性别、活动时间 if filename.startswith('correct') : newName = 'E:pythonProjectruisi%s' % (filename) readFile = open(newName,'r') for line in readFile: sexInfo =line.split()[1] sumCount +=1 if sexInfo == u'u7537' : boycount += 1 elif sexInfo == u'u5973': girlcount +=1 elif sexInfo == u'u4fddu5bc6': secretcount +=1 # print "until %s, sum is %s boys; %s girls; %s secret;" %(filename, boycount,girlcount,secretcount) #没有活动时间，但是有性别 elif filename.startswith("errTime"): newName = 'E:pythonProjectruisi%s' % (filename) readFile = open(newName,'r') for line in readFile: sexInfo =line.split()[1] sumCount +=1 if sexInfo == u'u7537' : boycount += 1 elif sexInfo == u'u5973': girlcount +=1 elif sexInfo == u'u4fddu5bc6': secretcount +=1 # print "until %s, sum is %s boys; %s girls; %s secret;" %(filename, boycount,girlcount,secretcount) #没有性别，也没有时间，直接统计行数 elif filename.startswith("unkownsex"): newName = 'E:pythonProjectruisi%s' % (filename) # count = len(open(newName,'rU').readlines()) #对于大文件用循环方法，count 初始值为 -1 是为了应对空行的情况，最后+1得到0行 count = -1 for count, line in enumerate(open(newName, 'rU')): pass count += 1 unkowncount += count sumCount += count # print "until %s, sum is %s unkownsex" %(filename, unkowncount) print "Single Thread, total is %s; %s boys; %s girls; %s secret; %s unkownsex;" %(sumCount, boycount,girlcount,secretcount,unkowncount) endTime = time.clock() print "cost time " + str(endTime - startTime) + " s"

输出为

Single Thread, total is 61111; 13937 boys; 4009 girls; 28942 secret; 14223 unkownsex;

cost time 1.37444645628 s

多线程代码

__author__ = 'admin' # encoding: UTF-8 #多线程处理程序 import threading import time,os,sys #全局变量 SUM = 0 BOY = 0 GIRL = 0 SECRET = 0 UNKOWN = 0 class StaFileList(threading.Thread): #文本名称列表 fileList = [] def __init__(self, fileList): threading.Thread.__init__(self) self.fileList = fileList def run(self): global SUM, BOY, GIRL, SECRET if mutex.acquire(1): self.staManyFiles(self.fileList) mutex.release() #处理输入的files列表，统计男女人数 #注意这儿数据同步问题 def staCorrectFiles(self, files): global SUM, BOY, GIRL, SECRET for name in files: newName = 'E:pythonProjectruisi%s' % (name) readFile = open(newName,'r') for line in readFile: sexInfo = line.split()[1] SUM +=1 if sexInfo == u'u7537' : BOY += 1 elif sexInfo == u'u5973': GIRL +=1 elif sexInfo == u'u4fddu5bc6': SECRET +=1 # print "thread %s, until %s, total is %s; %s boys; %s girls;" # " %s secret;" %(self.name, name, SUM, BOY,GIRL,SECRET) def staManyFiles(self, files): global SUM, BOY, GIRL, SECRET,UNKOWN for name in files: if name.startswith('correct') : newName = 'E:pythonProjectruisi%s' % (name) readFile = open(newName,'r') for line in readFile: sexInfo = line.split()[1] SUM +=1 if sexInfo == u'u7537' : BOY += 1 elif sexInfo == u'u5973': GIRL +=1 elif sexInfo == u'u4fddu5bc6': SECRET +=1 # print "thread %s, until %s, total is %s; %s boys; %s girls;" # " %s secret;" %(self.name, name, SUM, BOY,GIRL,SECRET) #没有活动时间，但是有性别 elif name.startswith("errTime"): newName = 'E:pythonProjectruisi%s' % (name) readFile = open(newName,'r') for line in readFile: sexInfo = line.split()[1] SUM +=1 if sexInfo == u'u7537' : BOY += 1 elif sexInfo == u'u5973': GIRL +=1 elif sexInfo == u'u4fddu5bc6': SECRET +=1 # print "thread %s, until %s, total is %s; %s boys; %s girls;" # " %s secret;" %(self.name, name, SUM, BOY,GIRL,SECRET) #没有性别，也没有时间，直接统计行数 elif name.startswith("unkownsex"): newName = 'E:pythonProjectruisi%s' % (name) # count = len(open(newName,'rU').readlines()) #对于大文件用循环方法，count 初始值为 -1 是为了应对空行的情况，最后+1得到0行 count = -1 for count, line in enumerate(open(newName, 'rU')): pass count += 1 UNKOWN += count SUM += count # print "thread %s, until %s, total is %s; %s unkownsex" %(self.name, name, SUM, UNKOWN) def test(): files = [] #用来保存所有的线程，方便最后主线程等待所以子线程结束 staThreads = [] i = 0 for filename in os.listdir(r'E:pythonProjectruisi'): #没获取10个文本，就创建一个线程 if filename.startswith("correct") or filename.startswith("errTime") or filename.startswith("unkownsex"): files.append(filename) i+=1 if i == 20 : staThreads.append(StaFileList(files)) files = [] i = 0 #最后剩余的files，很可能长度不足10个 if files: staThreads.append(StaFileList(files)) for t in staThreads: t.start() # 主线程中等待所有子线程退出 for t in staThreads: t.join() if __name__ == '__main__': reload(sys) sys.setdefaultencoding('utf-8') startTime = time.clock() mutex = threading.Lock() test() print "Multi Thread, total is %s; %s boys; %s girls; %s secret; %s unkownsex" %(SUM, BOY,GIRL,SECRET,UNKOWN) endTime = time.clock() print "cost time " + str(endTime - startTime) + " s" endTime = time.clock() print "cost time " + str(endTime - startTime) + " s"

输出为

Multi Thread, total is 61111; 13937 boys; 4009 girls; 28942 secret;

cost time 1.23049112201 s

可以看出多线程还是优于单线程的，由于使用的同步，数据统计是一直的。

注意python在类内部经常需要加上self，这点和java区别很大。

def __init__(self, fileList): threading.Thread.__init__(self) self.fileList = fileList def run(self): global SUM, BOY, GIRL, SECRET if mutex.acquire(1): #调用类内部方法需要加self self.staFiles(self.fileList) mutex.release()

total is 61111; 13937 boys; 4009 girls; 28942 secret; 14223 unkownsex;

cost time 1.25413238673 s

以上就是本文的全部内容，希望对大家的学习有所帮助。

【python实现爬虫统计学校BBS男女比例之数据处理（三）】相关文章：

★ python实现倒计时的示例

★ 从零学python系列之数据处理编程实例（二）

★ 使用python实现strcmp函数功能示例

★ python实现爬虫下载漫画示例

★ python getopt 参数处理小示例

★ python实现bitmap数据结构详解

★ python网络编程学习笔记(八)：XML生成与解析（DOM、ElementTree）

★ python网络编程学习笔记(九)：数据库客户端 DB-API

★ python实现哈希表

★ Python实现端口复用实例代码

上一篇： python实现爬虫统计学校BBS男女比例之多线程爬虫（二）

下一篇：基于python的Tkinter实现一个简易计算器

学习工具