如何快捷的收集活动帖子做成汇总？ - 2022年4月10日文学峸存档

3 年多

楼主 (文学峸)

美语世界的妖妖灵（她是版主么？）来询问邻兄如何汇总活动帖子的。邻兄于是做了两件事：

要求她称呼邻兄＇虎哥＇ (she did)。一不做二不休，服务大家，于是把这个程序公开于此。邻兄是Java和Excel的绝世高手（别来请教我，我确实没时间解答问题），但邻兄不是Python的高手，才学Python，是故现在做啥都写Python，以熟悉之。为她把邻兄的Python 程序加了许多说明程序文件原名 ParseSXZJ＿html.py You can copy/paste the codes below into a Python program. If you have Python 3 installed on your computer, you can then follow the instructions below to make 活动帖子的收集基本全自动. Good Luck! 拒不解答后续问题! # Author: 书香之家版主 nearby, March 2022 # # Usage of this Python program: # 0. Make sure that you have Internet access and Python 3 installed on your computer (or use Cloud)! # 1. Place this file in a folder. Say, in a folder named "wxc" # 2. Create a sub-folder named "data" inside "wxc" in which all you data files will be generated # 3. Go to your '论坛', search for your '活动' title. You will get one or more pages. Remember how many pages there are. # If you do not know how to do this, just skip this step, I will then assume that there are 3 pages (150 entries, which is more than usual) # 4. execute this program, you will be prompted (asked for) the name of your activity, and # the number of pages you obtained in step 3 (if you do not know the number of pages, just hit ENTER) # Example: # 春天的畅想 # 3 (or Hit ENTER key) # 5. The result is stored inside 'data/sxzj-out.html'. You can then copy/paste the source code of # 'data/sxzj-out.html' into your WXC new page. Done! # # # Note: By default the entries are organized in reverse chronological order. # Should you need them to be placed in chronological order, please do: # Comment out the statement: mylist.reverse() by placing # in front of it, like: #mylist.reverse() # # import requests notargets = ['跟帖', '输入关键词', '内容查询', 'input name', '当前', '首页', '上一页', '尾页', '尾页', '下一页'] notargets.append('archive') # This is how SXZJ (书香之家) works. When 无忧 starts an activity, she always marks her activity like this. notargets.append('##活动##') # notargets.append('汇总') def isInside(line, notargets_array): for t in notargets_array: if t in line: return True return False # END # the line looks like <a href="/sxsj/76799.html" target="_blank">【春天的畅想】春天属于女人</a> # I need it to be <a href="https://bbs.wenxuecity.com/sxsj/76799.html" target="_blank">【春天的畅想】春天属于女人</a> def addHttp(line): at = line.split('href="') line2 = '<a href="https://bbs.wenxuecity.com' + at[1] return line2 # END def processOneFile(target, html, mylist): # split the text by newline character to get an array of string all = html.text.split('\n') length = len(all) i = 0 while i < length: line = all[i] if (target in line) and (not isInside(line, notargets)): line = addHttp(line) print(line) i = i + 1 line2 = all[i] i = i + 1 line3 = all[i] line += line2 + " " + line3 mylist.append(line) i = i + 1 # END of FUNCTIONS # ---- main starts here ---- print() print('# Author: 书香之家版主 nearby, March 2022') print() print('Usage of this Python program:') print('\t0. Make sure that you have Internet access and Python 3 installed on your computer (or use Cloud)!') print('\t1. Place this file in a folder. Say, in a folder named "wxc"') print('\t2. Create a sub-folder named "data" inside "wxc" in which all you data files will be generated') print('\t3. Go to your "论坛", search for your "活动" title. You will get one or more pages. Remember how many pages there are.') print('\t\t If you do not know how to do this, just skip this step, I will then assume that there are 3 pages (150 entries), which is more than usual)') print('\t4. execute this program, you will be prompted (asked for) the name of your activity, and') print('\t\tthe number of pages you obtained in step 3') print('\t\tExample:') print('\t\t\t春天的畅想') print('\t\t\t3') print('\t5. The result is stored inside "data/sxzj-out.html". You can then copy/paste the source code of') print('\t\t"data/sxzj-out.html" into your WXC new page. Done!') print('Note, by default the entries are organized in reverse chronological order.') print('Should you need them to be placed in chronological order, please do:') print('\t Change the statement: mylist.reverse() to be:') print('\t\t#mylist.reverse()') print("\n\n") target = input('What is the title of your activity (活动)?: ') pages = 3 # default, means there are maximum 150 entries temp = input('How many pages there are when you search for the activity in WXC? (If you do not know, just Hit ENTER): ') if temp != '': pages = int(temp) mylist = [] # this is the output file. html2 = open('data/sxzj-out.html', 'w', encoding='utf-8') url = 'https://bbs.wenxuecity.com/bbs/archive.php?SubID=sxsj&pos=bbs&keyword=' + target + '&username=' f = requests.get(url) processOneFile(target, f, mylist) for i in range(1, pages): url = 'https://bbs.wenxuecity.com/bbs/archive.php?page=' + str(i) + '&SubID=sxsj&pos=bbs&keyword=' + target + '&username=' f = requests.get(url) processOneFile(target, f, mylist) mylist.reverse() for li in mylist: html2.write("" + li+"\n") html2.close() print(str(len(mylist)) + " entries") Good Luck! 拒不解答后续问题!

WXCTEATIME

3 年多

赞！

可

可能成功的P

3 年多

赞！

尘

尘凡无忧

3 年多

赞邻兄，分享的精神可嘉。。。也赞邻兄的智慧，比如坚决不说话。。。：）

尘