Web application question

Marek · Jun 10, 2003

Suppose there is an html page named http://www.thatSite.com/x.html that is generated daily by a web stat program.

You administrate a totally different site, but you have permission to dynamically extract a few small portions of that content from the x.html page to include on a page at your site. Is this possible using Cold Fusion or PHP? And what is the name for this method of "extracting data from another web page"?

Edit: The stat program does not generate XML (but maybe there's a way to convert it into XML that I don't know about)

SarcaStick · Jun 10, 2003

theft?

j/k

Imposter · Jun 10, 2003

So you're going to download the page and rip the content from it? Thats just parsing. You'll have to write the parser yourself in either language.

Marek · Jun 10, 2003

Bah...that's what I tried searching for with 10000000000 results. I'll just keep refining the search, I guess.

Imposter · Jun 10, 2003

Searching for what?

Marek · Jun 10, 2003

" 'cold fusion' parse external web page " etc...

And what I'm refering to isn't a one-time thing. I mean I want to code my page to rip specific content from the x.html file every time someone loads it. I found the tag I need to do it, though. Cold Fusion's CFFile tag can be set up to download external content and pass the data in ascii format into a variable.

iNVAR · Jun 10, 2003

what are you searching for? there's nothing that's just going to parse THAT specific page for you.

iNVAR · Jun 10, 2003

and in php, i would do it by using fsockopen, and then parsing out the shit.

Marek · Jun 10, 2003

Yeah, I know I've gotta do the coding / parsing. I was just trying to find a method of making my code "get ahold" of the external data so I could code the parsing routine. I found what I needed, tho

iNVAR · Jun 10, 2003

http://us2.php.net/function.fsockopen

Code:

Here's a quick function to establish a connection to a web server that will time out if the connection is lost after a user definable amount of time or if the server can't be reached.

Also supports Basic authentication if a username/password is specified. Any improvements or criticisms, please email me! :-)

Returns either a resource ID, an error code or 0 if the server can't be reached at all. Returns -1 in the event that something really wierd happens like a non-standard http response or something. Hope it helps someone.

Cheers,

Ben Blazely

function connectToURL($addr, $port, $path, $user="", $pass="", $timeout="30")
{
       $urlHandle = fsockopen($addr, $port, $errno, $errstr, $timeout);

       if ($urlHandle)
      {
               socket_set_timeout($urlHandle, $timeout);

              if ($path)
               {
                      $urlString = "GET $path HTTP/1.0\r\nHost: $addr\r\nConnection: Keep-Alive\r\nUser-Agent: MyURLGrabber\r\n";
                      if ($user)
                               $urlString .= "Authorization: Basic ".base64_encode("$user:$pass")."\r\n";
                      $urlString .= "\r\n";

                     fputs($urlHandle, $urlString);

                       $response = fgets($urlHandle);

                       if (substr_count($response, "200 OK") > 0)      // Check the status of the link
                       {
                              $endHeader = false;                     // Strip initial header information
                               while ( !$endHeader)
                              {
                                       if (fgets($urlHandle) == "\r\n")
                                              $endHeader = true;
                              }

                               return $urlHandle;                     // All OK, return the file handle
                       }
                      else if (strlen($response) < 15)                // Cope with wierd non standard responses
                       {
                              fclose($urlHandle);
                             return -1;
                       }
                      else                                            // Cope with a standard error response
                       {
                              fclose($urlHandle);
                               return substr($response,9,3);
                       }
              }

               return $urlHandle;
       }
      else
               return 0;
}

Helado · Jun 10, 2003

Alternatively you can use:
$url = "http://yourdomain.com/file.html";
$contents = file_get_contents($url);
//$contents will hold the the contents in a string.
$tagstoleaveout = '';
$minpage = strip_tags($contents, $tagstoleaveout);
//strip_tags to get rid of tags...

Then you could use some sort of reg expression or just a search to find the part of the page you want to rip out.

Web application question

More options

Marek

Guest

SarcaStick

Contributor

Imposter

Marek

Guest

Imposter

Marek

Guest

iNVAR

Computer Monkey

iNVAR

Computer Monkey

Marek

Guest

iNVAR

Computer Monkey

Helado

Contributor