Scalable Self-Hosted Web Crawlers

Own your data and Scale with a swarm of web crawlers that understands WSL (web scraping language).

On the one hand information wants to be expensive, because it's so valuable. The right information in the right place just changes your life. On the other hand, information wants to be free, because the cost of getting it out is getting lower and lower all the time. So you have these two fighting against each other. - Stewart Brand

But I can't code?!

Web Scraping Language (WSL) has a low learning curve and no coding experience is needed. The examples shown below are all that is necessary to begin crawling.

So? why scrape.it?

Terminal like feels and aims to be every bit as efficient when it comes to web scraping. Data privacy, any scraped data from the web is run on the hardware you authorize.

How does this work?

You scrape websites using WSL in the Scrape.it Software. You scale the crawl rate by adding and removing additional instances and they will just pick up the workload.

Read more in the FAQ...

How do i get started?

Keep reading this page to find the download link for Windows or Mac. Try swapping out the examples below with your own selectors and url and experiment!

Web Scraping Language (WSL)

WSL is declarative domain specific language for the web. Automate any web browser actions like following a bunch of links (aka crawling) and extracting data from each page that loads, filling out forms. Each action runs in order, separated via pipe operator |

Syntax

- ACTION1 | ACTION2 {PAGINATOR}SELECTOR | ACTION3 ...
- You reference element(s) on the page with CSS or XPATH SELECTOR
- Extract data via JSON { product: .title, usd: .price, column3...etc}

Web Crawling Scenarios

Crawl URL Params

Format: GOTO URL[range] | EXTRACT {json}
Example: GOTO github.com/search?p=[1-3]&q=[cms, chess, minecraft] | EXTRACT {title: h3}

3 pages X 3 keywords = 9 URL permutations will be crawled and data extracted.

Crawl Regex Links

Format: CRAWL /REGEX/ or CRAWL /REGEX/ IN SELECTOR
Example: CRAWL /github.com/search?p=[0-9](.*?) IN .px-2 | EXTRACT {title: h3}

Crawls links matching regex pattern under a parent element with class name "px-2".

Crawl & Extract

Format: GOTO URL | CRAWL SELECTOR | EXTRACT {json}
Example: GOTO en.wikipedia.org/wiki/List_of_Dexter_episodes | CRAWL .summary a | EXTRACT {title: h1, code: //tr[7]/td}

Follow each Dexter episode link and get the title and production code.

Paginated Crawl

Format: CRAWL {pageStart, strategy, pageEnd) SELECTOR
Strategies:
1. Clicking 'next page' element that runs the crawl again on subsequent pages.
2. Mouse-wheel scroll to load next page.
3. Clicking numbered elements that load the next page.

1. GOTO news.ycombinator.com | CRAWL {.morelink} .hnuser
GOTO news.ycombinator.com | CRAWL {.morelink,4} .hnuser
1st ex: Crawling all pages using the .moreLink until it can't find this element.
2nd ex: Navigating via .moreLink until the 4th page is reached.
2. GOTO news.ycombinator.com | CRAWL {autoscroll,2} .hnuser
GOTO news.ycombinator.com | CRAWL {3,autoscroll,4} .hnuser
1st ex: Crawling past the first page by scrolling down one page length, 2 times.
2nd ex: Navigating to the 3rd page first and continues crawling until the 4th page.
3. GOTO news.ycombinator.com | CRAWL {number} .hnuser
GOTO news.ycombinator.com | CRAWL {3,number,4} .hnuser
1st ex: Finding a numbered link or element and increment exhaustively.
2nd ex: Navigating to the 3rd page via finding & clicking the numbered link until the 4th.
Extract Rows

Format: GOTO URL | EXTRACT {json} IN SELECTOR
Example: GOTO en.wikipedia.org/wiki/List_of_Dexter_episodes | EXTRACT { title: h1, aired: //table[2]//tr[2]/td[5] } IN .wikiepisodetable

Extracts every Dexter episode's title and air date under parent element with class "wikiepisodetable".

Paginated Extract

Format: GOTO URL | CRAWL {pageStart, paginationStrategy, pageEnd} SELECTOR | EXTRACT {json}
Example: GOTO news.ycombinator.com | EXTRACT {.morelink, 2} {news: .storylink}

Continues extracting every news headline on every page until the 2nd page.

Extract WITH Crawl

Format: GOTO URL | CRAWL SELECTOR WITH EXTRACT {json}
Example: GOTO news.ycombinator.com | CRAWL {.morelink} .hnuser WITH EXTRACT {title: .storylink, submitter: .hnuser} | EXTRACT {karma: //*[@id="hnmain"]//td[2]}

Exhaustively continues extracting every news headline and the submitter, and augument each submitter username by appending the karma property in the previous JSON object. Output is: {title, submitter, karma}

Nested Crawls

Format: GOTO URL | CRAWL SELECTOR | CRAWL SELECTOR | ....
Example: GOTO github.com/marketplace | CRAWL nav/ul/li/a | crawl .h4 | extract {app: h1, langue: .py-3 .d-block}

Follows the category links, and all the apps on first page of results, extract the app name, and supported languages. Crawls recursively!

Typing Text

Format: GOTO URL | TYPE SELECTOR [keyword1, keyword2...] | TYPE [KEY_...]
Extract: GOTO github.com/search?q= | type input[@name="q"] ["time", "security", "social"] | TYPE [KEY_RETURN] | extract {"search url": ".text-gray-dark.mr-2"}

For each keyword, we send a "KEY_RETURN" to submit the search form via keyboard. Then we crawl the
first page of search results and scrape each results url to a data column key named "search url".

Clicking & Forms

Format: CLICK[n-index] SELECTOR
Strategies:
1. Find elements with selector and click the Nth element, note you can just use Xpath for the selector!
2. Try out every permutation possible for selected forms, crawling dropdown forms etc.
3. Click a link and execute macro action like downloading a file.

1. GOTO news.ycombinator.com/login | CLICK input | CLICK input[last()] | CLICK input[3] | CLICK[3] input

Click first element then click the last element and finally two methods for selecting the same 3rd element.

2. GOTO redux-form.com/6.6.3/examples/simple/ | type input[@name="email"] [user1@x.com, user2@x.com] | CRAWL select/options

For each email address inputted, we try every options for it.

3. GOTO https://www.putty.org | CLICK //tr[1]/td[2]/p[2]/a | __SAVE__ //div[1]/div[2]/span[2]

Clicks on a link that navigates to a different domain. We save the file to with the macro command wrapped around double underscrolls: __SAVE__.

FAQ

How do I contact you?

Email support@this domain.

Where are the scraped data hosted?

It lives across all the computers running Scrape.it Terminal that knows about your public key, that points to the scraped data on your machine(s).

Where are the web crawlers run?

They run on your computer(s), or on other computers by sharing the project public key.

Any other M A C R O S coming?

Self explanatory:

__DRAG__ .item TO .cart or __DRAG__ .item TO [x,y], __HOVER__ #hoverable, __SEND__ ["ctrl", "ctrl+alt+delete", "alt+f4"], __SAVE__ #download-link OR __SAVE__ exportedCurrentPage.html OR __SAVEAS__ currentPageScreenshot.png, __EMAIL__ ceo@mycompany.com, __SFTP__ you:pass@0.0.0.1 /uploadedfiles, __POST__ http://yourwebhook.com {name: "hi"}

How do I purchase?

email support@ this domain.

I paid but I changed my mind!

Aw shucks, just email support@this domain within 30 days to get a full refund.

Can I get some help?

Ask your questions on stackoverflow (I reply there religiously) or email support@ this domain.

Where is the free version?

You will be able to download it for WIN or MAC OSX.

email: support AT this domain
© Brilliant Code Inc. 1918 Boul Saint Régis, Dorval, Québec, Canada. Libérons l'information par la collaboration!