Scraping with Scrapebox is hard.
At least that's the first impression that most people get when they try it.
Today I'll prove you that it's exactly the opposite. This Scrapebox tutorial will show you everything you need to scrape over 56 millions links / day with ease on low-end hardware.
Table Of Contents
- 1 Scraping Lesson #1
- 2 Scraping Lesson #2
- 3 VPS
- 4 Keywords
- 5 Footprints
- 6 Settings
- 7 Wrapping It Up
- 8 You Might Also Like:
Scraping Lesson #1
Don't use public proxies!
There, I've just told you the single most important public known secret that's going to change the way you use this tool.
You need private proxies.
Not only are public proxies slow as hell but they're also EXTREMELY unreliable. They're unreliable to the point that unless you're scraping with less than 10 keywords, you probably won't be able to finish because they'll all be dead.
And did I mention they're SLOW in every sense of the word?
Where to get fast proxies that are reliable?
Before I answer this questions, here's a screenshot of my latest run with those proxies:
This is running 8 threads with 10 proxies on a low-end $35 VPS, both from SolidSEOVPS.com.
I'm running 3 instances of Scrapebox just like that and got plenty of resources left to run both GSA Platform Identifier and GSA Search Engine Ranker at 200 threads each.
The other option is to get ReverseProxies.
I've also used these with great success although I've ended up using pure private proxies since these reverse ones produce a bit more errors and you need to use more threads to achieve the same speed.
While these speeds are ridiculous and very desirable, it did max out the CPU all the way through. If you want to run other things alongside it, you either need a dedi or lower the threads.
This is why these days I run SB at just 8 or so threads. This gives you more than enough resources to run any other tool you want on the same VPS.
Scraping Lesson #2
Are you ready for it?
Scrape bing and not google.
Seriously, it's much faster and much easier.
The problem with scraping google is that you need a huge number of proxies and use delays in between queries in order to scrape RELIABLY at reasonable speeds.
Anything over 5% of threads compared to the number of proxies (5 threads per 100 proxies) seems to be no-go in the long run.
(It doesn't count if you scrape 500 urls / sec for 1 minute but then all your proxies are burned out)
With bing you can get ridiculous speeds 24/7 using just a few threads / proxies.
In fact, my testing shows that you can crank up your threads up to 90% of the number of your proxies (for example: up to 27 threads with 30 private proxies) and never get your proxies banned.
Now you might think that you don't want to scrape links bing. What if there are unindexed links in there?
Well, there very well might be.
However, it's MUCH faster to scrape the list from bing, process it however you usually do (GSA SER, manually, etc) and then check that processed, working list for indexation in SB. (a list which is now much smaller than the one you started with)
Which VPS Provider To Choose?
Straight up, no bullshit...use SolidSEOVPS.
Seriously, I've used many providers in the past (not going to name any as they're not worth mentioning) but after I found out about SolidSEO, I've yet to have a reason to look somewhere else.
They're insanely reliable, their customer support is spot on and they're cheap. I use them for pretty much all my hosting (and proxy) needs since there's nothing else you could really want from a hosting provider.
They also run specials deals on their website so you can snag up a dedicated server for as little as $45.
I know this probably doesn't mean much to you, especially since those are affiliate links but I'm 100% honest with you, as soon as I found out about them, server reliability has stopped being an issue for me and I wouldn't want to use anyone else.
Optimizing Your VPS
Scrapebox and other such programs aren’t your everyday software. As such, a typical VPS isn’t exactly optimized for these kinds of workloads.
Simply put, you can follow this guide and everything you need to know about it is mentioned in there. I take NO CREDIT for it, it was put together by a BHW member GoldenGlovez:
It's also been mentioned in that guide and it's common sense really but...
MAKE SURE TO BACKUP YOUR SETTINGS BEFORE DOING ANYTHING MENTIONED IN THE GUIDE.
Now once you have the proxy and VPS situation taken care of, you're able to scrape more targets in a day than 90% of Scrapebox users scrape in a month.
But what do you scrape? How to get the most unique urls?
Your choice of keywords really matters when scraping.
If all your keywords are mostly similar (like ones you would get if you used the keyword scraper more than 1 level deep), it won't matter that you have a huge keyword list. It also won't matter what footprints you use . You will end up with A LOT of duplicate urls.
If you want as many unique urls as possible, you need many non-long-tail keywords in many different niches.
Doing The Keyword Research Yourself
You will need you to spend some time researching different niches and coming up with different keyword categories in which you can find unique non-long-tail keywords.
Unless you're trying to get target urls in a single niche, in which case you only need to do this for a single niche but much more thoroughly.
Here's a great thorough guide on how to do this kind of keyword research.
Keyword Research Done For You
While doing keyword research yourself is completely fine, it's also pretty tedious.
Especially when you're doing it just to find more keywords to scrape with.
You can always make your life a bit easier by getting a huge keyword list and working with that.
These kinds of lists are designed to be used with Scrapebox and most of the time you don't have to do anything with them other than hitting "Import" and "Start Scraping".
However, you can also use them as a starting point if you want to expand and get even more keywords.
- Niche Keyword List - It's a list I've compiled over time to solve this exact problem I'm writing about here. In order to scrape as many unique links as possible you need as many different keywords in many different niches. This is why this list has over 1.3 million keywords in all niches known to man.
As with keywords, you shouldn't be using many footprints which are very similar to each other. Sure you might get 1 or 2 more urls that you otherwise wouldn't get but in that time you could have gotten 10,000 by using a different footprint.
Furthermore, you should preferably use 1 footprint at a time so that you can later compare the scraping results from different footprints and decide which ones to keep using and which ones are not worth the effort.
Both of the keyword lists mentioned above come with footprints but a simple google search will give you more than enough footprints. Also, if you have GSA SER, it has all the footprints you will ever need.
Truth be told, everybody is using the same footprints and finding new ones is not an easy thing to do. You would really have to dig deep and get your hands dirty for questionable results.
The only thing you need to worry about footprints is that bing and google have different search operators.
|-||Apples -Oranges||NOT||Apples NOT Oranges|
There are more differences but most of the other operators used in footprints are the same in both google and bing.
Harvester Engine Settings
Here's what my engine settings look like. Note that I haven't changed anything that isn't mentioned below the screenshots:
Things to change:
- Clear Cookies - It seems that bing gives you less results per keywords after you're scraping for some time, not matter your proxies. Enabling this option seems to negate this. It does slow down your scraping speed ever so slightly
Been doing some testing regarding the optimal number of threads and the results are kinda surprising.
First of all, don't ever use more threads than the number of proxies you have. In fact, you shouldn't go over 90%.
No more than 9 threads per 10 proxies.
Even if you scrape bing, using too many threads will burn out your proxies.
And the most surprising this I found is...increasing your thread count won't improve your scraping speed as much as you think. Thread count doesn't scale all that well. What do I mean by that?
Let's say you have 30 proxies and you do 3 runs.
- 10 Threads - You average around 550 urls / s
- 15 Threads - You average around 600 urls / s
- 30 Threads - You average around 700 urls / s
Woah what happened?
You'd think that doubling the number of threads would double your performance. It seems like this isn't the case.
Don't believe me? Get some private proxies and try it out yourself.
It's better if you use less threads.
Long story short, it's better if you use less threads. You won't lose out on much performance, you don't run the risk of burning out your proxies and if you really need more speed, you can just fire up another instance of SB.
Truth be told, I leave everything on default. I've tried fiddling with all the timeout settings but I'm yet to get any notable improvements by doing so.
This is probably because private proxies are stable & fast meaning they almost never timeout.
Wrapping It Up
Scraping with Scrapebox is really easy and care-free once you know how to do it.
In the next article we're taking a look on how to automate all this with the Automator plugin. You should be able to spend 10 minutes setting up a job, walk away for 1 month and everything will still be working once you come back.