Overview

This overview will help you to understand how the library is built and what are its main components.


Install

To work with SERPS you need two things:

In addition Composer is required to manage the necessary dependencies.

composer.json example with the Google client and the Curl http client

{
    "require": {
        "serps/core": "*",
        "serps/search-engine-google": "*",
        "serps/http-client-curl": "*"
    }
}

Danger

The library is still in alpha, that means that major things can change until the stable release and your code might become not compatible with the updates. Note that we follow semver.

Search Engine client

In a regular workflow a search engine client allows to:

Each search engine has its set of specificities and thus each search engine implementation has its own dedicated guide.

These search engines are currently available:

Http Client

Working with search engines involves to work with http requests. Providing a http client is mandatory for be able perform http requests.

    use Serps\HttpClient\CurlClient;
    use Serps\Core\Browser\Browser;

    $browser = new Browser(new CurlClient());

There are two kinds of http clients: those that return the raw html as returned from the http response (e.g the curl client) and the others that evaluate the javascript and update the DOM before before returning (e.g the phantomJS client)

These http clients are currently available:

Browser Objects

Browser objects are used to wrap all information you need to issue stateful requests. Note that even though a browser is named "browser", it does not mean that it will evaluate html, css and js as chrome or firefox would. Here are major features of a browser:

View the browser documentation

Proxies

Most of time search engines don't want you to parse them thus they use to block you with captcha when they think you are a bot When you deal with a very large number of requests, you will need to send requests through proxies.

This is a major feature of scraping and we placed proxies at the very heart of the library. Each request is proxy aware. This way, with a single client you can use as many proxies as you want.

Example of proxy usage with the google client

    use Serps\Core\Browser\Browser;
    use Serps\HttpClient\CurlClient;
    use Serps\Core\Http\Proxy;

    $proxyIp   = '192.168.192.168';
    $proxyPort = 8080;

    $browser = new Browser(new CurlClient());
    $browser->setProxy(new Proxy($ip, $port));

Read the proxy doc to learn more about proxy creation.

Captcha

Even though you are using proxies and place all the efforts to act like an human, you might encounter the fatal captcha.

When you get blocked by a captcha request, it is very important to stop sending request to the search engine and to solve the captcha before you continue.

Dealing with captcha is not easy, at the current state the library can detect captcha but is not able to solve them for you. We are currently working on a captcha solver implementation but cannot guarantee it will released soon.

Note

Captcha are proxy specific, when solving a captcha that should be done with the proxy that was initially blocked

Cookies

SERPS integrates cookie management, that allows to share cookies across many requests.

Cookie management is usually done at the http client level. You still want to know how to manipulate cookies and cookiejars: see cookie documentation