PHP爬虫的微博热搜
首先我们看到微博热搜主要在table
里面
function getUrlContent($url){//通过url获取html内容 https://s.weibo.com/top/summary
$ch = curl_init();
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_USERAGENT,"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )");
curl_setopt($ch,CURLOPT_HEADER,1);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
$output = curl_exec($ch);
curl_close($ch);
return $output;
}
通过函数获取微博的HTML
页面
function getTable($html) {
preg_match_all("/<table>[\s\S]*?<\/table>/i",$html,$table);
$table = $table[0][0];
$table = preg_replace("'<table[^>]*?>'si","",$table);
$table = preg_replace("'<tr[^>]*?>'si","",$table);
$table = preg_replace("'<td[^>]*?>'si","",$table);
$table = str_replace("</tr>","{tr}",$table);
$table = str_replace("</td>","{td}",$table);
//去掉 HTML 标记
$table = preg_replace("'<[/!]*?[^<>]*?>'si","",$table);
//去掉空白字符加上#号标记
$table = preg_replace("'([rn])[s]+'","",$table);
$table = str_replace(" ","|",$table);
$table = preg_replace("'[|]+'","#",$table);
$table = explode('{tr}', $table);
array_pop($table);
foreach ($table as $key=>$tr) {
// 自己可添加对应的替换
$tr = str_replace("\n\n","",$tr);
$td = explode('{td}', $tr);
array_pop($td);
$td_array[] = $td;
}
return $td_array;
}
通过函数正则和字符串替换去除HTML
标记
$html = getUrlContent("https://s.weibo.com/top/summary?Refer=top_hot&topnav=1&wvr=6");
$table = getTable($html);
$table = array_slice($table,2,6);
var_dump($table);
for ($i = 0; $i < count($table)-1; $i++) {
$str = (string)$table[$i][1];
$login = (string)$table[$i][2];
$login = str_replace("#", "", $login);
$str = explode('#',$str);
$hot = $str[count($str)-2];
$title = '';
for($j = 0; $j < count($str)-2; $j++){
$title .= $str[$j];
}
}
打印$table
这里的的title,hot,login
,依次表示的微博热搜标题,热度,(热,爆,荐)。
我们看一下成品
原创文章,作者:zbwyfkx,如若转载,请注明出处:https://www.fenkexie.cn/archives/361/
2020-2021年学分统计查不出来,我学号1905041537