首页 > 互联资讯 > 建站教程 >

php采集函数,php 采集

建站教程
2024-09-26 03:47:01

本文目录一览：

1、php函数preg_match采集正则保存问题
2、怎样用php 采集百度地图的数据
3、PHP 采集程序中常用的函数
4、php怎么抓取网站中meta函数get

php函数preg_match采集正则保存问题

恭喜，魔术引用在我看来就是一个累赘，造成了很多迷惑。我现在安装系统，直接就是关闭魔术引用，避免潜在的问题。

代码级的安全性应该是由代码编写者操心，php只要负责好系统级的安全性就可以了。

怎样用php 采集百度地图的数据

一般来说，PHP采集数据最简单的办法是使用file_get_content函数，功能更强大的推荐使用cURL函数库。

PHP 采集程序中常用的函数

复制代码

代码如下:

//获得当前的脚本网址

function

get_php_url()

{

if(!empty($_SERVER[”REQUEST_URI”]))

{

$scriptName

$_SERVER[”REQUEST_URI”];

$nowurl

$scriptName;

}

else

{

$scriptName

$_SERVER[”PHP_SELF”];

if(empty($_SERVER[”QUERY_STRING”]))

$nowurl

$scriptName;

else

$nowurl

$scriptName.”?”.$_SERVER[”QUERY_STRING”];

}

return

$nowurl;

}

//把全角数字转为半角数字

function

GetAlabNum($fnum)

{

$nums

array(”0”,”1”,”2”,”3”,”4”,”5”,”6”,”7”,”8”,”9”);

$fnums

“0123456789″;

for($i=0;$i=9;$i++)

$fnum

str_replace($nums[$i],$fnums[$i],$fnum);

$fnum

ereg_replace(”[^0-9.]|^0{1,}”,””,$fnum);

if($fnum==””)

$fnum=0;

return

$fnum;

}

//去除HTML标记

function

Text2Html($txt)

{

$txt

str_replace(”

“,”　”,$txt);

$txt

str_replace(””,””,$txt);

$txt

str_replace(””,””,$txt);

$txt

preg_replace(”/[rn]{1,}/isU”,”br/rn”,$txt);

return

$txt;

}

//清除HTML标记

function

ClearHtml($str)

{

$str

str_replace('','',$str);

$str

str_replace('','',$str);

return

$str;

}

//相对路径转化成绝对路径

function

relative_to_absolute($content,

$feed_url)

{

preg_match('/(http|https|ftp):///',

$feed_url,

$protocol);

$server_url

preg_replace(”/(http|https|ftp|news):///”,

“”,

$feed_url);

$server_url

preg_replace(”//.*/”,

“”,

$server_url);

($server_url

”)

{

return

$content;

}

(isset($protocol[0]))

{

$new_content

preg_replace('/href=”//',

‘href=”‘.$protocol[0].$server_url.'/',

$content);

$new_content

preg_replace('/src=”//',

'src=”‘.$protocol[0].$server_url.'/',

$new_content);

}

else

{

$new_content

$content;

}

return

$new_content;

}

//取得所有链接

function

get_all_url($code){

preg_match_all('/as+href=[”|']?([^”'

]+)[”|']?s*[^]*([^]+)/a/i',$code,$arr);

return

array('name'=$arr[2],'url'=$arr[1]);

}

//获取指定标记中的内容

function

get_tag_data($str,

$start,

$end)

{

(

$start

”

$end

”

)

{

return;

}

$str

explode($start,

$str);

$str

explode($end,

$str[1]);

return

$str[0];

}

//HTML表格的每行转为CSV格式数组

function

get_tr_array($table)

{

$table

preg_replace(”‘td[^]*?'si”,'”‘,$table);

$table

str_replace(”/td”,'”,',$table);

$table

str_replace(”/tr”,”{tr}”,$table);

//去掉

HTML

标记

$table

preg_replace(”‘[/!]*?[^]*?'si”,””,$table);

//去掉空白字符

$table

preg_replace(”‘([rn])[s]+'”,””,$table);

$table

str_replace(”

“,””,$table);

$table

str_replace(”

“,””,$table);

$table

explode(”,{tr}”,$table);

array_pop($table);

return

$table;

}

//将HTML表格的每行每列转为数组，采集表格数据

function

get_td_array($table)

{

$table

preg_replace(”‘table[^]*?'si”,””,$table);

$table

preg_replace(”‘tr[^]*?'si”,””,$table);

$table

preg_replace(”‘td[^]*?'si”,””,$table);

$table

str_replace(”/tr”,”{tr}”,$table);

$table

str_replace(”/td”,”{td}”,$table);

//去掉

HTML

标记

$table

preg_replace(”‘[/!]*?[^]*?'si”,””,$table);

//去掉空白字符

$table

preg_replace(”‘([rn])[s]+'”,””,$table);

$table

str_replace(”

“,””,$table);

$table

str_replace(”

“,””,$table);

$table

explode('{tr}',

$table);

array_pop($table);

foreach

($table

$key=$tr)

{

$td

explode('{td}',

$tr);

array_pop($td);

$td_array[]

$td;

}

return

$td_array;

}

//返回字符串中的所有单词

$distinct=true

去除重复

function

split_en_str($str,$distinct=true)

{

preg_match_all('/([a-zA-Z]+)/',$str,$match);

($distinct

true)

{

$match[1]

array_unique($match[1]);

}

sort($match[1]);

return

$match[1];

}

php怎么抓取网站中meta函数get

参考如下

get_meta_tags -- 从一个文件中提取所有的 meta 标签 content 属性，返回一个数组

描述

array get_meta_tags ( string filename [, int use_include_path])

打开 filename 逐行解析文件中的 meta 标签。此参数可以是本地文件也可以是一个 URL。解析工作将在 /head 处停止。

将 use_include_path 设置为 1 将促使 PHP 尝试按照 include_path 标准包含路径中的每个指向去打开文件。这只用于本地文件，不适用于 URL。

下面实例分析了php中get_meta_tags()、CURL与user-agent用法。具体分析如下：

get_meta_tags()函数用于抓取网页中meta name="A" content="1"meta name="B" content="2"形式的标签,并装入一维数组,name为元素下标,content为元素值,上例中的标签可以获得数组:array('A'='1', 'b'='2'),其他meta标签不处理,并且此函数只处理到/head标签时截止,之后的meta也不再继续处理,不过head之前的meta还是会处理.

user-agent是浏览器在向服务器请求网页时,提交的不可见的头信息的一部分,头信息是一个数组,包含多个信息,比如本地缓存目录,cookies等,其中user-agent是浏览器类型申明,比如IE、Chrome、FF等.

今天在抓取一个网页的meta标签的时候,总是得到空值,但是直接查看网页源代码又是正常的,于是怀疑是否服务器设置了根据头信息来判断输出,先尝试使用get_meta_tags()来抓取一个本地的文件,然后这个本地文件将获取的头信息写入文件,结果如下,其中替换成了/,方便查看,代码如下:

代码如下:

array (

'HTTP_HOST' = '192.168.30.205',

'PATH' = 'C:/Program Files/Common Files/NetSarang;C:/Program Files/NVIDIA Corporation/PhysX/Common;C:/Program Files/Common Files/Microsoft Shared/Windows Live;C:/Program Files/Intel/iCLS Client/;C:/Windows/system32;C:/Windows;C:/Windows/System32/Wbem;C:/Windows/System32/WindowsPowerShell/v1.0/;C:/Program Files/Intel/Intel(R) Management Engine Components/DAL;C:/Program Files/Intel/Intel(R) Management Engine Components/IPT;C:/Program Files/Intel/OpenCL SDK/2.0/bin/x86;C:/Program Files/Common Files/Thunder Network/KanKan/Codecs;C:/Program Files/QuickTime Alternative/QTSystem;C:/Program Files/Windows Live/Shared;C:/Program Files/QuickTime Alternative/QTSystem/; %JAVA_HOME%/bin;%JAVA_HOME%/jre/bin;',

'SystemRoot' = 'C:/Windows',

'COMSPEC' = 'C:/Windows/system32/cmd.exe',

'PATHEXT' = '.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC',

'WINDIR' = 'C:/Windows',

'SERVER_SIGNATURE' = '',

'SERVER_SOFTWARE' = 'Apache/2.2.11 (Win32) PHP/5.2.8',

'SERVER_NAME' = '192.168.30.205',

'SERVER_ADDR' = '192.168.30.205',

'SERVER_PORT' = '80',

'REMOTE_ADDR' = '192.168.30.205',

'DOCUMENT_ROOT' = 'E:/wamp/www',

'SERVER_ADMIN' = 'admin@admin.com',