Overview

This overview will help you to understand how the library is built and what are its main components.

Install

To work with SERPS you need two things:

One or more search engine clients you want to parse
A http client

In addition Composer is required to manage the necessary dependencies.

_{composer.json example with the Google client and the Curl http client}

{
    "require": {
        "serps/core": "*",
        "serps/search-engine-google": "*",
        "serps/http-client-curl": "*"
    }
}

Danger

The library is still in alpha, that means that major things can change until the stable release and your code might become not compatible with the updates. Note that we follow semver.

Search Engine client

In a regular workflow a search engine client allows to:

Manipulate an url and generate a request specific to the search engine
Retrieve the response from the search engine
Parse this response to a standard sets of results

Each search engine has its set of specificities and thus each search engine implementation has its own dedicated guide.

These search engines are currently available:

Google

Http Client

Working with search engines involves to work with http requests. Providing a http client is mandatory for be able perform http requests.

    use Serps\HttpClient\CurlClient;
    use Serps\Core\Browser\Browser;

    $browser = new Browser(new CurlClient());

There are two kinds of http clients: those that return the raw html as returned from the http response (e.g the curl client) and the others that evaluate the javascript and update the DOM before before returning (e.g the phantomJS client)

These http clients are currently available:

Raw clients:
- CURL
Evaluating clients
- phantomJS

Browser Objects

Browser objects are used to wrap all information you need to issue stateful requests. Note that even though a browser is named "browser", it does not mean that it will evaluate html, css and js as chrome or firefox would. Here are major features of a browser:

an user agent
a accept language header
any other headers required
cookies management
proxy management

View the browser documentation

Proxies

Most of time search engines don't want you to parse them thus they use to block you with captcha when they think you are a bot When you deal with a very large number of requests, you will need to send requests through proxies.

This is a major feature of scraping and we placed proxies at the very heart of the library. Each request is proxy aware. This way, with a single client you can use as many proxies as you want.

_{Example of proxy usage with the google client}

    use Serps\Core\Browser\Browser;
    use Serps\HttpClient\CurlClient;
    use Serps\Core\Http\Proxy;

    $proxyIp   = '192.168.192.168';
    $proxyPort = 8080;

    $browser = new Browser(new CurlClient());
    $browser->setProxy(new Proxy($ip, $port));

Read the proxy doc to learn more about proxy creation.

Captcha

Even though you are using proxies and place all the efforts to act like an human, you might encounter the fatal captcha.

When you get blocked by a captcha request, it is very important to stop sending request to the search engine and to solve the captcha before you continue.

Dealing with captcha is not easy, at the current state the library can detect captcha but is not able to solve them for you. We are currently working on a captcha solver implementation but cannot guarantee it will released soon.

Note

Captcha are proxy specific, when solving a captcha that should be done with the proxy that was initially blocked

Cookies

SERPS integrates cookie management, that allows to share cookies across many requests.

Cookie management is usually done at the http client level. You still want to know how to manipulate cookies and cookiejars: see cookie documentation