Crawl and share data securely with anyone, anywhere.

Decentralized, self-hosted P2P network with end-to-end encryption for collaborative web scraping. Scale with Web Scraping Language (WSL) on your own or 3rd-party network.

On the one hand information wants to be expensive, because it's so valuable. The right information in the right place just changes your life. On the other hand, information wants to be free, because the cost of getting it out is getting lower and lower all the time. So you have these two fighting against each other. - Stewart Brand

But I can't code?!

Web Scraping Language (WSL) has a low learning curve and no coding experience is needed. The examples shown below are all that is necessary to begin crawling. Most are no longer than this sentence. Email me for docs because there isn't much content that the examples below dont cover.

So? why scrape.it?

So far I haven't seen any decentralized p2p approach to web scraping. Peer-to-peer architecture makes sense because all the data that is "mined" from the web is run on the mesh network of desktop computers running Scrape.it software. This ensures data resilience, if somebody shares their "mined" data with other peers, it is replicated to everyone, sort of like Bittorrent. Strong end-to-end encryption secures the exchange of data between peers which means users are free to collude to "mine", "scrape", "crawl", "extract" any data viewable in web browser.

How does this work?

You scrape websites using WSL in the Scrape.it Software. You scale the crawl rate by adding and removing additional instances of Scrape.it. It is possible to run multiple instances on your own machine and across hetrogenous hardware. By sharing your unique scrape:// url which is only reacheable by other Scrape.it instances, you have direct control over who can read and write to your scraped data. This makes real time collaboration possible between Scrape.it peer nodes.

Read more in the FAQ...

How do i get started?

Keep reading this page to find the download link for Windows or Mac. Try swapping out the examples below with your own selectors and url and experiment!

Web Scraping Language (WSL)

WSL is declarative domain specific language for the web. Automate any web browser actions like following a bunch of links (aka crawling) and extracting data from each page that loads, filling out forms. Each action runs in order, separated via pipe operator |

Syntax

- ACTION1 | ACTION2 {PAGINATOR}SELECTOR | ACTION3 ...
- You reference element(s) on the page with CSS or XPATH SELECTOR
- Extract data via JSON { product: .title, usd: .price, column3...etc}

Web Crawling Scenarios

Crawl URL Params

Format: GOTO URL[range] | EXTRACT {json}
ex) GOTO github.com/search?p=[1-3]&q=[cms, chess, minecraft] | EXTRACT {title: h3}

3 pages X 3 keywords = 9 URL permutations will be crawled and data extracted.

Crawl Regex Links

Format: CRAWL /REGEX/ or CRAWL /REGEX/ IN SELECTOR
ex) CRAWL /github.com/search?p=[0-9](.*?) IN .px-2 | EXTRACT {title: h3}

Crawls links matching regex pattern under a parent element with class name "px-2".

Crawl & Extract

Format: GOTO URL | CRAWL SELECTOR | EXTRACT {json}
ex) GOTO en.wikipedia.org/wiki/List_of_Dexter_episodes | CRAWL .summary a | EXTRACT {title: h1, code: //tr[7]/td}

Follow each Dexter episode link and get the title and production code.

Paginated Crawl

Format: CRAWL {pageStart, strategy, pageEnd) SELECTOR
Strategies:
1. Clicking 'next page' element that runs the crawl again on subsequent pages.
2. Mouse-wheel scroll to load next page.
3. Clicking numbered elements that load the next page.

1. ex) GOTO news.ycombinator.com | CRAWL {.morelink} .hnuser
Continue crawling past the first page by clicking the "nextpage" link using the .moreLink until it can't find this element.
2. ex) GOTO news.ycombinator.com | CRAWL {autoscroll,2} .hnuser
Continue crawling past the first page by scrolling down one page length, 2 times.
3. ex) GOTO news.ycombinator.com | CRAWL {3,.morelink,24} .hnuser
Navigate to the 3rd page and continue crawling until the 24th page.
Extract Rows

Format: GOTO URL | EXTRACT {json} IN SELECTOR
ex) GOTO en.wikipedia.org/wiki/List_of_Dexter_episodes | EXTRACT { title: h1, aired: //table[2]//tr[2]/td[5] } IN .wikiepisodetable

Extracts every Dexter episode's title and air date under parent element with class "wikiepisodetable".

Paginated Extract

Format: GOTO URL | CRAWL SELECTOR | EXTRACT {json}
ex) GOTO news.ycombinator.com | EXTRACT {.morelink, 2} {news: .storylink}

Continues extracting every news headline on every page until the 2nd page.

Extract & Crawl

Format: GOTO URL | CRAWL SELECTOR | EXTRACT {json}
ex) GOTO news.ycombinator.com | CRAWL {.morelink} .hnuser AND EXTRACT {title: .storylink, submitter: .hnuser} | EXTRACT {karma: //*[@id="hnmain"]//td[2]}

Exhaustively continues extracting every news headline and corresponding karma point of it's submitter until .morelink cannot be found.

Nested Crawls

Format: GOTO URL | CRAWL SELECTOR | EXTRACT {json}
ex) GOTO github.com/marketplace | CRAWL nav/ul/li/a | crawl .h4 | extract {app: h1, langue: .py-3 .d-block}

Follows the category links, and all the apps on first page of results, extract the app name, and supported languages.

Typing Text

Format: GOTO URL | CRAWL SELECTOR | EXTRACT {json}
ex) GOTO github.com/search?q= | type input[@name="q"] ["time", "security", "social"] | PRESS_RETURNKEY | crawl .v-align-middle | extract {description: .text-gray-dark.mr-2}

For each keyword search, we crawl the 1st page of search results and extract each result's description.

Clicking & Forms

Format: CLICK SELECTOR[n-index] or CLICK[n-index] SELECTOR
Strategies:
1. Click on an element or the element with N-th index.
2. Fill out forms by crawling every permutations possible for given elements.
3. Click link and execute macro action like downloading a file.

1. ex) GOTO news.ycombinator.com/login | CLICK input | CLICK input[last()] | CLICK input[3] | CLICK[3] input

Click first element then click the last element and finally two methods for selecting the same 3rd element.

2. ex) GOTO redux-form.com/6.6.3/examples/simple/ | type input[@name="email"] [user1@x.com, user2@x.com] | CRAWL select/options

For each email address inputted, we try every options for it.

3. ex) GOTO https://www.putty.org | CLICK //tr[1]/td[2]/p[2]/a | START_DOWNLOAD //div[1]/div[2]/span[2]

Clicks on a link that navigates to a different domain. We save the file to with the macro command START_DOWNLOAD.

FAQ

How do I contact you?

Email support@this domain.

Difference between free & paid?

In paid, Scrape.it Terminal works on both desktop & web servers while scrape:// URL is private by default.
In free, you can only deploy on desktop computers while scrape:// URL is public and discoverable by anyone.

What is this scrape:// url?

It is a shareable URL only accessible via the Scrape.it which allows you to share reading and writing privelleges with other peers. The peers can be your own computer running Scrape.it or your team overseas.

Where are the scraped data hosted?

It lives across all the computers running Scrape.it Terminal that knows about your scrape:// url, that points to the scraped data. You can subscribe to a scrape:// to get latest data writes by other peers.

Where are the web crawlers run?

They run on your computer(s), or the computer of other peers whom you shared the scrape:// url with.

Does this come with warranties or guarantees?

No. We don't provide any storage or computing of your crawlers & scraped data or keep backups. We can't interfere with what data you choose to scrape because of the end-to-end encryption and self-hosted, decentralized architecture. Please use responsibly.

How do I purchase?

email support@ this domain. you are gonna need paypal.

I paid but I changed my mind!

Aw shucks, just email support@this domain within 30 days to get a full refund! I'm getting in on this fad.

Can I get some help?

Ask your questions on stackoverflow (I reply frequently) or email support@ this domain.

Where is the free version?

Its coming next Monday or Tuesday, April 8th or 9th 2019. You will be able to download it for WIN or MAC OSX, after that date.

Lovingly crafted by john@this domain since 2009.
© Brilliant Code Inc. 1918 Boul Saint Régis, Dorval, Québec, Canada. Libérons l'information par la collaboration! Debout!