6 Tips to Use the Web Scraping Tool Octoparse

These days we received some feedback from our users and some of them have troubles moving forward with Octoparse for issues happened occasionally. Therefore, my post here is to share my experience with you about using Octoparse, in hope that they’ll help guide you move forward and deal with more difficult and complex websites.


  1. Manually Check the Rule in the Workflow Designer

Since Octoparse doesn’t signal an error for you to trace the problem when configuring a rule, you would usually have no ideas when some problems arose like missing data or failing to click the item or open the page. To avoid such errors or to find out whether the rule configured works, you’d better manually check the rule in the Workflow before running the task. By doing this, you could see which steps don’t work in the visual built-in browser and data field. Thus once you find something wrong, you could modify the rule accordingly. Check the tutorial below to learn how to do that.

Check The Extraction Rule When Errors Occur


  1. Set Proper Timeout and Scroll Times

Sometimes you would find that even you configured the right rule and could get the data when manually checking the rule in the Workflow Designer, data records often missed when you initiated extraction. The easiest method is to set longer AJAX timeout under the action of “Go to page”, “Click item” and “Click to paginate”. Also, you could set waiting time before execution under different actions in the Workflow Designer so that you could ensure the data you want is loaded.


Some contents are not displayed unless you scroll down, so you may miss some data by forgetting setting the scroll times. Choose the scroll down ways and set proper scroll times. It’s also important to the results you get.

But before executing the steps above, you should remember that all the steps should be taken after the page is fully loaded; if not, even though you change the rule, the rule would still not work.


Besides, we don’t recommend you to choose “Open the link in new tab” and “Load the page with AJAX” in parallel unless Octoparse still failed to open some websites like LinkedIn.



  1. Manually Modify the XPath

The correct use of XPath is the key to extract data in Octoparse. Steps like pagination, missing data and irregular value fields involve the change of XPath at most times. So I strongly suggest you learn some knowledge about XPath. Just a little know of XPath could help you solve a lot of problems in using Octoparse. The tutorials or FAQs below could help you pick up XPath quickly.

How to use Firebug and Firepath?

Getting started with XPath 1

Getting Started With XPath 2

Modify XPath Manually in Octoparse


  1. Split the Task

You would find that you couldn’t get all the data records you want even though you ensure the configuration rule is right. Issues happened occasionally especially in the steps of “Click item” or “Click to paginate” because of the the amount of data or the complexity of the website itself. Even, if you don’t use cloud extraction with paid versions, you would find that you have to restart if the Internet cut off or the computer went to sleep. It would take you quite a long time. You would also feel quite boring about extracting the same data records again and again. My personal experience here is to separate the task into two projects. For example, if I want to extract the detail page of the item, which the configuration rule is often similar below, I wouldn’t choose the “Click item” directly.



Instead, I would extract the URL of the items in the “Click Item” first and then extract data by using the List of URLs to loop extracting the URLs.


By doing this, there are less missing data and faster data extraction speed. Also, you could easily find what’s the problem because of the less steps. Besides, if the extraction process accidentally stopped, by exporting the extracted data and checking where it stopped, you could restart from where it stopped, instead of starting zero again. The tutorial below would help you how to use the URL List.

URLs - Advanced Mode


  1. Set Cache Settings

Sometimes you would find that the built-in browser didn’t open the URL you want entered under the action of “Go to page”. It may be because you opened other websites too many times and the computer record your cache. Just choose to clear cache before opening the web page and you could open the website you want.

Another example in setting cache is to extract websites that requiring login. After login, you could choose “Use specified Cookie” to record your account information, so that you needn’t check login steps again and again. This would also protect your personal information.


  1. Use the RegEx Tool

Sometimes it would take you a little time to find out the information you want as there are many other noisy information. Or some information is involved in the attributes of the HTML, which you couldn’t extract directly. To precisely extract the information you want, you could use the RegEx Tool. The tutorial below would help you how to use the RegEx Tool in Octoparse.

Scrape Emails from Facebook Pages


The tips above could help you move forward better with Octoparse. Also, we are working harder to improve performance and provide more efficient, intelligent solutions.


Author: The Octoparse Team


- See more at: Octoparse Tutorial

Views: 827


You need to be a member of Codetown to add comments!

Join Codetown

Happy 10th year, JCertif!


Welcome to Codetown!

Codetown is a social network. It's got blogs, forums, groups, personal pages and more! You might think of Codetown as a funky camper van with lots of compartments for your stuff and a great multimedia system, too! Best of all, Codetown has room for all of your friends.

When you create a profile for yourself you get a personal page automatically. That's where you can be creative and do your own thing. People who want to get to know you will click on your name or picture and…

Created by Michael Levin Dec 18, 2008 at 6:56pm. Last updated by Michael Levin May 4, 2018.

Looking for Jobs or Staff?

Check out the Codetown Jobs group.

There's also a free Java Jobs mailing list. It's a Yahoo group so you have to create a Yahoo account to use it.


Enjoy the site? Support Codetown with your donation.

InfoQ Reading List

Article: Q&A on the Book Level Up Agile With Toyota Kata

In the book Level Up Agile With Toyota Kata, Jesper Boeg explores how to apply Toyota Kata to drive improvement in organizations that are using or striving to use agile ways of working. He shares his experience from combining agile with Toyota Kata to enable organizations to keep improving towards their goals.

By Ben Linders, Jesper Boeg

Presentation: The Trouble with Memory

Kirk Pepperdine takes a look at the telltale signs that a JVM based application is in the 60% memory inefficiency area, and demonstrates the steps one can take to attack this problem.

By Kirk Pepperdine

Article Series - .NET Core 3

In this series, we explore the benefits of .NET Core and how it can help not only traditional .NET developers, but all technologists who need to bring robust, performant and economical solutions to market.

By Chris Woodruff

What Tech for Good is and Why it Matters

Tech for Good groups provide opportunities to connect with people who share a positive vision of the future and look for ways to use technology in order to have a positive impact. Ellen Ward spoke about Tech for Good Dublin at Women in Tech Dublin 2019; she presented what Tech for Good looks like in reality, why it matters, and how people can get involved.

By Ben Linders

© 2019   Created by Michael Levin.   Powered by

Badges  |  Report an Issue  |  Terms of Service