用PHP CURL多线程保存远程文件/图片_恒馨博客
以前就写过很多版本,这个版本算是比较满意的吧,主要特点:
- 利用CURL的多线程特点,多并发请求
- 过滤当前域下的文件
- 隐藏的文件类型会通过
HTTP Headers
里面的Content-Type
自动获取 - fwrite方式写入磁盘
核心函数
function downloadFile($array, $path, $timeout){
$data = array();
$mh = curl\_multi\_init();
foreach($array as $k=>$url){
$conn\[$k\] = curl\_init($url);
curl\_setopt($conn\[$k\], CURLOPT\_TIMEOUT, $timeout);
curl\_setopt($conn\[$k\], CURLOPT\_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36');
curl\_setopt($conn\[$k\], CURLOPT\_MAXREDIRS, 7);
curl\_setopt($conn\[$k\], CURLOPT\_HEADER, 0);
curl\_setopt($conn\[$k\], CURLOPT\_FOLLOWLOCATION, 1);
curl\_setopt($conn\[$k\], CURLOPT\_RETURNTRANSFER, 1);
curl\_setopt($conn\[$k\], CURLOPT\_SSL\_VERIFYPEER, 0);
curl\_setopt($conn\[$k\], CURLOPT\_SSL\_VERIFYHOST, 0);
curl\_multi\_add\_handle ($mh, $conn\[$k\]);
}
$active = null;
do{
$mrc = curl\_multi\_exec($mh,$active);
}while($mrc == CURLM\_CALL\_MULTI\_PERFORM);
while($active && $mrc == CURLM\_OK){
do{
$mrc = curl\_multi\_exec($mh, $active);
}while($mrc == CURLM\_CALL\_MULTI\_PERFORM);
}
foreach ($array as $k => $url) {
curl\_error($conn\[$k\]);
$res\[$k\] = curl\_multi\_getcontent($conn\[$k\]);
$header\[$k\] = curl\_getinfo($conn\[$k\]);
$ext = pathinfo($url, PATHINFO\_EXTENSION);
if(!$ext)
{
$contentType = $header\[$k\]\['content\_type'\];
$pattern = '/\[a-zA-Z\]+\\/(\[a-zA-Z\]+)/';
preg\_match($pattern, $contentType, $matches);
$ext = $matches\[1\];
}
if($ext){
$path\_to\_file = $path . md5($url) . ".{$ext}";
$saver = fopen($path\_to\_file, "w+");
fwrite($saver, $res\[$k\]);
fclose($saver);
$data\['files'\]\[\] = realpath($path\_to\_file);
}
curl\_close($conn\[$k\]);
curl\_multi\_remove\_handle($mh, $conn\[$k\]);
}
curl\_multi\_close($mh);
return $data;
}
以为图片下载为例
function getRemoteFile($txt) {
$meta = '<meta charset="utf-8">';
$doc = new DOMDocument();
libxml\_use\_internal\_errors(true);
$doc->loadHTML($meta . $txt);
libxml\_use\_internal\_errors(false);
$xpath = new DOMXPath($doc);
$[node](https://towait.com/all/nodejs/ "node")list = $xpath->query("//img");
foreach ($[node](https://towait.com/all/nodejs/ "node")list as $key => $img) {
$matches\[\] = $img->getAttribute('src');
}
$matches = array\_unique($matches);
foreach ($matches as $url) {
if(substr($url, 0, 2 ) == '//') $url = "http:" . $url;
if(substr($url, 0, 4 ) != 'http' || strpos($url, $\_SERVER\["HTTP\_HOST"\])) continue;
$files\[\] = $url;
}
$path = "./images/";
var\_dump(downloadFile($files, $path, 10));
}
使用方法
$html = '<div class\="entry-content"\>
<p\>这次Bill Hartzer客户被陷害的过程大致是这样:</p\>
<img src\="https://www.seozac.com/wp-content/uploads/2018/04/negative-canonical.png"\>
<ul\>
<li\>客户网站是A,有不错排名。</li\>
<li\>某黑帽手里有垃圾网站B,通常是被惩罚的,被黑的,或者有大量垃圾内容,或者有大量垃圾外链。</li\>
<li\>把A网站页面head部分完整抄到B网站页面上,然后B网站页面上加上(或修改)canonical标签指向A网站页面。</li\>
<li\>Google看到B网站页面有canonical标签指向A,把B和A网站合并处理,B网站被惩罚的信号被传递到A网站。</li\>
<li\>A网站排名下降。</li\>
</ul\>
<img src\="https://s7d5.scene7.com/is/image/Specialized/146773?$hd$" alt\="S-WORKS"\>
<p\>这个方法害人之处在于,很难被检测到。通常,负面SEO还是会留下蛛丝马迹的:</p\>
<ul\>
<li\>制造垃圾链接,在外链数据里能看到</li\>
<li\>黑进别人网站加垃圾内容、加黑链,在页面或源代码里能看到</li\>
<li\>即使做了cloaking,正常页面看不到垃圾内容,也能在搜索引擎快照看到</li\>
<li\>攻击别人网站,这个当然很快就知道了</li\>
<li\>抄袭内容、制造镜像网站,在搜索引擎搜索页面上的文字就能翻出来</li\>
<li\>被刷跳出率等用户体验数据,流量统计后台有显示</li\>
</ul\>
<a href\="http://www.pdf995.com/samples/pdf.pdf"\>This is a pdf file</a\>
<p\>总之,发现排名和流量骤降,如果是被人负面SEO了,仔细检查,一般会发现什么地方被做了手脚。但Bill Hartzer描述的这个方法很可能不会留下任何蛛丝马迹:</p\>
<ul\>
<li\>没有涉及链接</li\>
<li\>没有被攻击、被黑,被陷害网站代码没问题</li\>
<li\>没有抄袭内容,垃圾网站B只是抄head部分,页面可见内容可以是空的,或者是与A网站完全无关的内容</li\>
</ul\>
<p\><img class\="aligncenter size-full wp-image-3725" src\="https://trek.scene7.com/is/image/TrekBicycleProducts/EmondaSLR9DH2Etap\_24145\_A\_Primary?wid=1360&hei=1020&fmt=pjpeg&qlt=40,1&iccEmbed=0&cache=on,on" alt\="利用canonical标签陷害竞争对手" width\="271" height\="189"\></p\>
</div\>';
getRemoteFile($html);
License:
CC BY 4.0