Overview

This overview will help you to understand how the library is built and what are its main components.


Install

To work with SERPS you need two things:

In addition Composer is required to manage the necessary dependencies.

composer.json example with the Google client and the Curl http client

{
    "require": {
        "serps/core": "*",
        "serps/search-engine-google": "*",
        "serps/http-client-curl": "*"
    }
}

Danger

The library is still in alpha, no version is released yet that means that minor things can change until the stable release and your code might become not compatible with the updates.

Search Engine client

In a regular workflow a search engine client allows to:

Each search engine has its set of specificities and thus each search engine implementation has its own dedicated guide.

These search engines are currently available:

Http Client

Working with search engines involves to work with http requests. Usually the search engine client will need a http client to work correctly.

Example with the google client and the curl http client

    use Serps\SearchEngine\Google\GoogleClient;
    use Serps\HttpClient\CurlClient;

    $googleClient = new GoogleClient(new CurlClient());

There are two kinds of http clients: those that return the raw html as returned from the search engine (e.g the curl client) and the others that evaluate the javascript and update the DOM before before returning (e.g the phantomJS client)

These http clients are currently available:

Proxies

Most of time search engines don't want you to parse them thus they use to block you with captcha when they think you are a bot When you deal with a very large number of requests, you will need to send requests through proxies.

This is a major feature of scraping and we placed proxies at the very heart of the library. Each request is proxy aware. This way, with a single client you can use as many proxies as you want.

Example of proxy usage with the google client

    use Serps\SearchEngine\Google\GoogleClient;
    use Serps\HttpClient\CurlClient;

    $googleClient = new GoogleClient(new CurlClient());

    $googleClient->query($googleUrl, $proxy);

Read the proxy doc to learn more about proxy creation.

Captcha

Even though you are using proxies and place all the efforts to act like an human, you might encounter the fatal captcha.

When you get blocked by a captcha request, it is very important to stop sending request to the search engine and to solve the captcha before you continue.

Dealing with captcha is not easy, at the current state the library can detect captcha but is not able to solve them for you we are currently working on a captcha solver implementation.

Note

Captcha are proxy specific, when solving a captcha that should be done with the proxy that was initially blocked

Cookies

SERPS integrates cookie management, that allows to share cookies across many requests.

Cookie management is usually done at the search engine client level. You still want to know how to manipulate cookies and cookiejars: see cookie documentation