Understanding crawling and indexing

When you are performing a search on a search engine, you can see that some results are listed. Those results have to come from somewhere and this somewhere is what we call an index.

An index is the "library" of the search engines. This library is filled thanks to robots, sometimes called also bots which are programs defined in order scrape the data of websites.

So to make it simple, there are developers/software engineers who write programs in order to control those robots and ask them to collect some pieces of information for the library of the search engine.

It is a bit like if you go to a book store, read all the books and try to write down within a notebook a summary of each of those books in order to find them more easily next time.

It is hard to visualize this part without practising, that's why we strongly recommend you to give a try to tools such as Scrapy (that we will see later on in this course) which can give you a good idea of how such programs are working.

As you can imagine too, a website is changing all the time, so it is not enough for the robot to browse your website once. It needs to come very often in order to update his index and to show you the most relevant results.

This is why if you look at a file named "log" on your server. You will see that robots from search engines are coming on your website very often (weekly, daily, hourly... according to how popular your website is).

What are logs?

Every HTTP requests made to your server is logged. So everytime (not every every time because sometimes an internet user has it in his cache, a system to avoid your browser to be too slow) that a file from your server is requested, it records this information. As a result the fact that a bot came on your website is always recorded.

So as you can imagine, being a good search engine means having an army of robots to look at the maximum of websites in a very short period of time in order to provide as many good answer as possible to internet users. That's one of the main reasons why there are very few possibilities for new search engine actors to come on the market.



Last modified: Sunday, 29 September 2019, 5:03 PM