作者:小项-怪物猪
分类:Python
利用google 翻译写的python的命令行翻译脚本
使用方法:请看下面的评论
学习知识点: 利用httplib 进行数据提交,并返回结果
复 习: 利用对jion split 进行 列表 字符串整理.
代码请看全文:
-
-
-
-
-
- import httplib, urllib;
- import sys,getopt;
-
- opts,argv = getopt.getopt(sys.argv[1:],'',['']);
- c = " ";
- argv = c.join(argv);
- print "你输入的是:",argv;
- params = urllib.urlencode({'sl':'en',
- 'tl':'zh-CN',
- 'text':argv,
- 'client':'t'});
- headers = {"User-Agent": "Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html)"}
- conn = httplib.HTTPConnection("translate.google.cn");
- conn.request("GET","/translate_a/t?" + params,headers=headers);
- data = conn.getresponse();
- data1 = data.read();
- conn.close();
- print "翻译结果为:",data1;
作者:小项-怪物猪
分类:Python
现在学习python 做为练手写了个网站内容采集脚本.
实现功能如下:
1.根据配置文件进行采集.
2.支持命令行参数 如: python corn.py --config=urls.ini
3.根据规则生成url列表集(只支持数字,能倒序采集).
4.根据规则获取列表页面特定位置,从而进一步分析缩小范围分析内容页面url.
5.将内容部分url存储到文件,每行一个url,并且在写入的时候进行判断是否已经存在相同url.
6.Bug太多,我慢慢完善.
使用方法: ubuntu环境 终端运行 python xxx.py --config=xxx.ini
windows环境 修改#!/usr/bin/python 为你的python.exe目录 在命令行运行 python xxx.py --config=xxx.ini
下面给出代码 保存成.py文件:
-
-
-
-
-
-
-
-
- import sys;
- import getopt;
- import re;
- import urllib;
- import ConfigParser;
- import time;
- import MySQLdb as mysql;
-
-
- if __name__ == "__main__":
-
- try:
- opts,argv = getopt.getopt(sys.argv[1:],'c:',['config=']);
- except getopt.GetoptError:
- Help()
-
- for keys,cut in opts:
-
-
- if keys in ('-c','--config'):
- cut
-
- try:
- conf = ConfigParser.ConfigParser();
- conf.readfp(open(cut));
-
-
-
- starturl = conf.get("urllibs","starturl");
- startpage = int(conf.get("urllibs","startpage"));
- endpage = int(conf.get("urllibs","endpage"));
- urltemp = starturl + conf.get("urllibs","urltemp");
- filelist = conf.get("urllibs","urllist");
- dellist = conf.get("urllibs","dellist");
-
- Stops = int(conf.get("countcfg","Stops"));
-
- Divurl = conf.get("countcfg","Divurl");
- Urlls = conf.get("countcfg","Urlls");
-
- Title = conf.get("countcfg","Title");
- Keywords = conf.get("countcfg","Keywords");
- Description = conf.get("countcfg","Description");
-
-
- url = [ urltemp % page for page in range(startpage,endpage) ];
- for url in url:
- urllist = urllib.urlopen(url).read();
- urls = re.findall(Divurl,urllist);
-
- c = "";
- urls = c.join(urls);
- urlls = re.findall(Urlls,urls);
- urlfile = file(filelist,'r+a');
- outurl = urlfile.readlines();
-
- for urlls in set(urlls):
-
- curls = urlls + "\n";
-
- if curls in outurl:
- print urlls + "页面重复跳过";
- continue;
- urlfile.write(urlls + '\n');
- urlfile.close();
-
- print "所有url列表获取完成,存入",filelist,"文件中";
-
- time.sleep(Stops);
-
-
- listurl = open(filelist,'r');
- mun = len(listurl.readlines())+1;
- listurl.seek(0);
-
-
- User = 'root';
- Passwd = '970207';
- Host = 'localhost';
- Db = 'testcorn';
- contents = mysql.connect(user=User,passwd=Passwd,host=Host,db=Db).cursor();
-
-
-
- for conurl in range(1,mun):
-
- curl = listurl.readline();
-
-
-
- time.sleep(Stops);
-
- content = urllib.urlopen(starturl + curl).read();
-
- title = re.findall(Title,content);
-
- keywords = re.findall(Keywords,content);
-
- description = re.findall(Description,content);
-
- for title,keywords,description in zip(set(title),set(keywords),set(description)):
-
-
-
-
-
-
-
- print "写入",title,"成功!","停顿",Stops,"秒进行下一次采集";
- contents.close();
-
- except KeyboardInterrupt:
- print "用户终止";
下面是ini的配置文件 保存成.ini文件:
Ini配置文件代码
- [urllibs]
-
- starturl = http://www.510buy.com
-
- startpage = 2
-
- endpage = 3
-
- urltemp = /yewu/list_%d.html
-
- urllist = /home/buysz/桌面/urllist.ini
-
- dellist = http://www.510buy.com,http://www.510buy.com" target="_blank,/yewu/index.html,http://www.510buy.com/
-
- [countcfg]
-
- Stops = 1
-
- Divurl = <div.*?>(.*?)<\/div>
-
- Urlls = <a href=[\"|\'](.*?)[\"\']>
-
-
- Title = <title>(.*?) - .*?</title>
-
- Keywords = name=\"keywords\" content=\"(.*?)\">
-
- Description = name=\"description\" content=\"(.*?)\">
作者:小项-怪物猪
分类:FreeBSD
- #!/bin/sh
- #filename cn_isp.sh;auto get the IP of CHINANET and CNC;
- rm delegated-apnic-latest
- rm cnnet
- rm IP_CHINANET
- rm IP_UNICOM
- rm IP_CNC
- fetch http://ftp.apnic.net/apnic/stats/apnic/delegated-apnic-latest
- grep 'CN|ipv4'delegated-apnic-latest|cut-f4,5-d'|'|tr'|'''>>cnnet
- cat cnnet|whilereadipcnt
- do
- mask=$(bc< <END|tail-1
- pow=32;
- define log2(x){
- if (x<=1)return(pow);
- pow--;
- return(log2(x/2));
- }
- log2($cnt);
- END
- )
-
- resultext=`whois-A$ip|grep-e^netname-e^descr-e^role|cut-f 2-d':'|sed's/ *//'`
- echo '................Search for'$ip/$mask'.........................'
-
- if echo$resultext|grep-i-e'chinanet'-e'chinatel'-e'china telecom'
- then echo$ip/$mask>>IP_CHINANET
- fi
-
- if echo$resultext|grep-i-e'unicom'
- then echo$ip/$mask>>IP_UNICOM
- fi
-
- if echo$resultext|grep-i-e'cncgroup'
- then echo$ip/$mask>>IP_CNC
- fi
-
- echo '----------------------------------------------'
- echo ''
-
- done
再利用 ccpp0@DRL 写的执行脚本,就可以把网通/电信等IP列表写出路由表中
- #!/bin/sh
- #write By ccpp0
-
- DEFAULTGW=`/usr/bin/grep defaultrouter/etc/rc.conf| /usr/bin/sed"s/[^0-9\.]//g"`
- #CNCGW=60.28.32.49
- #CNCIP=/root/cn_isp/IP_CNC
- CHINANETGW=221.239.1.193
- CHINANETIP=/root/cn_isp/IP_CHINANET
-
- /sbin/route flush
- /sbin/route adddefault${DEFAULTGW}
- #for RL in `/bin/cat ${CNCIP}`; do
- # /sbin/route add -net ${RL} ${CNCGW}
- #done
-
- for RLin`/bin/cat${CHINANETIP}`;do
- /sbin/route add-net${RL} ${CHINANETGW}
- done
Update:
稍微修正一下,没有必要把所有的网通/电信网段都加进路由表