(Article from Octoparse Data Export API)

Content 

Octoparse Data Export API

1. Overview

2. Get an Access Token 

3. How to Get Data through API 

3.1. Get all data of a task using paging

3.2. Get Unexported Data from a Task

4. Two Ways to Get a Task ID

4.1. Get a Task ID via API

4.1.1 Get a Task Group ID

4.1.2 Get a Task ID from the task group

4.2. Get a task ID via Octoparse client

5. Sample Code

 

You must obtain an access token to use the Octoparse API. The access token is passed with each API request and is used to authenticate you access to the Octoparse API. It provides a secure access to the Octoparse API.

 

1. Overview 

You can export data extracted using the Octoparse Data API by using the following procedure. It is worth mentioning that you have an Octoparse advanced account(Standard/Professional) and have obtained some data from at least one task that is running in the cloud before using the Octoparse API.

 

The basic flow to use the Octoparse API:
1. Get an access token by providing your user name and password.
2. Use Access Token and Task ID to get the data from a specific extraction task in Octoparse.

 

 

 

 

2. Get an Access Token

 

You can obtain an access token by making an HTTP POST request with your username and password. 

 

HTTP Method: POST

http://dataapi.octoparse.com/token

POST Content Type: application/x-www-form-urlencoded

POST Example:

username={username}&password={password}&grant_type=password

The values of username and password should be URL-Encoded.

 

The successful HTTP response for the token request contains the access token that you can use to access the Octoparse API. The response is JSON-encoded and below is an example response. 

 

{

"access_token": "ABCD1234",

"token_type": "bearer",

"expires_in": 86399,

"refresh_token": "refresh_token"

}

 

The response includes the output properties as follows.

 

Property

Description

access_token

The access token that you can use to authenticate you access to the Octoparse API.

token_type

The format of the access toke. Currently, Octoparse uses a BEARER token.

expires_in

The number of seconds for which the access token is valid.Current default value is 86400(24 hours).

refresh_token

 

A token that can be sent to Octoparse API instead of an authorization code.

(When the access token expires, send a POST request to the Octoparse API using this token instead of an authorization code. A new access token will be returned. A new refresh token might be returned too.)

 

An access token is a unique identifier of making an Octoparse API call, and is needed to add to the HTTP request Header to get the data from tasks via API.

 

Name: Authorization

Value: bearer {access token}

 

The response would return some JSON-formatted strings if the request for access token failed. Below are some explanations for all error cases.

 

Case 1. The content of the POST is not formatted correctly.

 

{

"error": "unsupported_grant_type"

}

 

Make sure that the format is like:

 

username={username}&password={password}&grant_type=password

 

Case 2. The user name or password in the POST is incorrect.

 

{

"error": "invalid_grant",

"error_description": "The user name or password is incorrect."

}

3. How to Get Data through API 

 

3.1. Get all data of a task using paging

 

Octoparse supports paging of data to retrieve only some data records by displaying a particular page of data, using the HTTP GET request. The parameters - taskID, pageindex, pagesize are needed for this API and the access token should be added to the HTTP Header.

 

HTTP Method: GET

http://dataapi.octoparse.com/api/alldata?taskid={taskid}&pageindex={pageIndex}&pagesize={pageSize}

HTTP Header parameter 1:

Name: Authorization

Value: bearer {access token}

HTTP Header parameter2:

Name: Accept 

Value: application/json 

HTTP URL sample: http://dataapi.octoparse.com/api/alldata?taskid=taskid&pageindex=1&pagesize=2

 

Octoparse will page through the data of the current task based on the given pagesize(The maximum allowed page size is 1,000) and return the data of the index page; the number of data records returned is based on the page size you set.

For example, let’s say there are 1,000 data records in a task. If the pageindex is 1 and pagesize is 2 (pagesize=2, pageindex=1), the data will be divided into 500 pages with 2 data records per page and Octoparse API would return the first page of 2 data records.

The successful HTTP response with correct access token and taskID will get JSON -formatted data. Below is an example response.

 

{

"data": { 

"total": 1000, 

"currentTotal": 2, 

"dataList": [ 

            {

"State": "Texas", 

"City": "Plano",

"Date": "2013-1-1",

"Humidity": "34%",

"High Temperatures": "72.8F",

"Wind": "NW 8km/h",

"Low Temperatures": "24.8F"

            },

            {

"State": "Texas",

"City": "Plano",

"Date": "2013-1-2",

"Humidity": "32%",

"High Temperatures": "76F",

"Wind": "NNW 10km/h",

"Low Temperatures": "25F" 

            }

        ]

    }, 

"error": "success"

}

 

The data returned includes fields as follows.

 

Data field

Description

total

The number of total data records of the current task

currentTotal

The number of data records requested

dataList

The list of data fields

error

Prompt information

 

 

3.2. Get Unexported Data from a Task

 

You can get all unexported data from a task in batches, using the HTTP GET request. The taskID and the number of data records returned per batch(size) are needed for the request, and the access token is needed to add to the HTTP Header. Octoparse API will return the data that were first collected.

 

HTTP Method: GET

http://dataapi.octoparse.com/api/notexportdata?taskid={testtaskid}&...

HTTP Header parameter 1:

Name: Authorization

Value: bearer {access token}

HTTP Header parameter 2:

Name: Accept

Value: application/json

           application/xml

HTTP URL sample: http://dataapi.octoparse.com/api/notexportdata?taskid=testtaskid&am...

 

The interface would return the unexported data (the amount of unexported data depends on the parameter: ‘size’) and then identify this data as exported data so that all exported data will be skipped next time you make a request.

 

For example, let’s say there are 1,000 data records in a task. If the size is 2 (the number of data records returned per batch) for the first request, Octoparse API will return 2 data records that were first collected. Similarly, next time it will return another 2 data records that were first collected from the remaining 998 records.

 

The successful HTTP response with correct access token and taskID will get JSON -formatted data. Below is an example response.

 

{

"data": {

"total": 1000,

"currentTotal": 2,

"dataList": [

            {

"State": "Texas",

"City": "Plano",

"Date": "2013-1-1",

"Humidity": "34%"

"High Temperatures": "72.8F",

"Wind": "NW 8km/h",

"Low Temperatures": "24.8F"

            },

            {

"State": "Texas",

"City": "Plano",

"Date": "2013-1-2",

"Humidity": "32%",

"High Temperatures": "76F",

"Wind": "NNW 10km/h",

"Low Temperatures": "25F" 

            }

        ]

    },

"error": "success"

}

 

The data returned includes fields as follows.

 

Data field

Description

total

The number of total unexported data records of the current task

currentTotal

The number of data records requested

dataList

The list of data fields

error

Prompt information

 

Note:

 

If the parameter provided is incorrect when getting data from task, Octoparse API will return the following errors. Below are some explanations for all error cases.

 

Case 1. Access token is invalid or has expired. Please use your username and password to obtain a new access token.

 

{

"error": "unauthorized",

"error_Description": "access_token invalid"

}

 

Case 2. The requested resource does not support HTTP method 'POST'. Please use GET method in this case.

 

{

"message": "Requested Resource Does Not Support HTTP Method 'POST'"

}

 

Case 3. 

 

The taskID is invalid or the task doesn’t belong to the user indicated by the access token. Please use correct taskID.

 

{

"error": "taskid_error",

"error_Description": "TaskID is invalid or the task does not belong to you."

}

 

Case 4. 

 

The size is too big and exceeds the maximum allowed size. The default size is 1000.

 

{

"error": "export_pagesize_error",

"error_Description": "Size range from 1 to 1000"

}

 

Case 5. The server is temporarily unavailable.

 

{

"error":"server_error",

"error_Description": "Server Error. Please try again later!"

}

 

4. Two Ways to Get a Task ID

 

4.1. Get a Task ID via API

 

You can get all data from a task via a task ID.

Generally, users will create a task group to extract large amounts of data and therefore will create many tasks that categorized into that group to extract this data separately. In this case you can obtain the task group ID and all the task IDs under the group via two APIs (One for task group ID, the other for task ID), then extract all the data from these tasks in the group by writing codes to work with the APIs.

 

4.1.1 Get a Task Group ID

 

First of all, you need to obtain task group ID by using the HTTP GET request and adding the access token to the HTTP Header.

 

HTTP Method: GET

http://dataapi.octoparse.com/api/taskgroup

HTTP Header parameter:

Name: Authorization

Value: bearer {access token}

 

If the access token you requested is accurate and could be used to get data, you will get a text-formatted task group list as follows.

 

{

"data": [

        {

"taskGroupId": 84, 

"taskGroupName": "Task Group ID 1"

        },

        {

"taskGroupId": 527,

"taskGroupName": "Task Group ID 2"

        }]

“error”: ”success”

}

 

The descriptions of data fields in the task group list are as follows:

 

Data Field

Description

taskGroupId

The unique identifier for the task group

taskGroupName

The name for the task group

 

 

4.1.2 Get a Task ID from the task group

 

For a task group, all the tasks under the task group can be obtained by providing the task group ID.

You can get the list of all the tasks by using the HTTP GET request, adding the access token to the HTTP Header and using the task group ID as the parameter.

 

HTTP Method: GET

http://dataapi.octoparse.com/api/task?taskgroupid={taskgroupid}

HTTP Header parameter:

Name: Authorization

Value: bearer {access token}

HTTP URLsample: http://dataapi.octoparse.com/api/task?taskgroupid=?...

 

If the access token you requested is accurate and the task group belongs to you, you will get a text-formatted task list as follows.

 

{

"data": [

        {

"taskId": "taskid1",

"taskName": ""

        },

        {

"taskId": "taskid2",

"taskName": "Task 2"

        }]

“error”: ”success”

}

 

The descriptions of data fields in the task list are as follows:

 

Data field

Description

taskId

The unique identifier for the task

taskName

The name for the task

 

Note:

 

If the parameter provided is incorrect when getting data from task, Octoparse API will return the following errors. Below are some explanations for all error cases.

 

Case 1. Access token is invalid or has expired. Please use your username and password to obtain a new access token.

 

{

"error": "unauthorized",

"error_Description": "access_token invalid"

}

 

Case 2. The requested resource does not support HTTP method 'POST'. Please use GET method in this case.

 

{

"message": "Requested Resource Does Not Support HTTP Method 'POST'"

}

 

Case 3. The server is temporarily unavailable.

 

{

"error": "server_error",

"error_Description": "Server Error. Please try again later!"

}

 

4.2. Get a task ID via Octoparse client

 

This function is only available for Standard and Professional Plans.

After you log in to Octoparse, right click a task and choose “Create an API”.(Only available for Standard and Professional Plan).

 

 

Then you will get the Task ID on the pop-up window.

 

 

 

5. Sample Code

GitHub: 

C#: https://github.com/octopus-dev/DataExportApi

Java: https://github.com/octopus-dev/DataExportApi.Java

(- See more at: Octoparse Tutorial)

Views: 332

Comment

You need to be a member of Codetown to add comments!

Join Codetown

Happy 10th year, JCertif!

Notes

Welcome to Codetown!

Codetown is a social network. It's got blogs, forums, groups, personal pages and more! You might think of Codetown as a funky camper van with lots of compartments for your stuff and a great multimedia system, too! Best of all, Codetown has room for all of your friends.

When you create a profile for yourself you get a personal page automatically. That's where you can be creative and do your own thing. People who want to get to know you will click on your name or picture and…
Continue

Created by Michael Levin Dec 18, 2008 at 6:56pm. Last updated by Michael Levin May 4, 2018.

Looking for Jobs or Staff?

Check out the Codetown Jobs group.

There's also a free Java Jobs mailing list. It's a Yahoo group so you have to create a Yahoo account to use it.

 

Enjoy the site? Support Codetown with your donation.



InfoQ Reading List

Facebook Releases AI Code Search Datasets

Facebook AI released a dataset containing coding questions paired with code-snippet answers, intended for evaluating AI-based natural-language code search systems. The release also includes benchmark results for several of Facebook's own code-search models and a training corpus of over 4 million Java methods parsed from over 24,000 GitHub repositories.

By Anthony Alford

Article: Three Major Cybersecurity Pain Points to Address for Improved Threat Defense

Three pain points every company must address when addressing cybersecurity include threat volume and complexity, a growing cybersecurity skills gap, and the need for threat prioritization. This article describes each of these in some detail, and includes recommendations for corporations to deal with them.

By Jonathan Zhang

Microsoft Releases Azure Sentinel, the Cloud Native SIEM, to General Availability

In a recent blog post, Microsoft announced the general availability of Sentinel, a Security Information and Event Management (SIEM) service in Azure, providing customers with intelligent security analytics across their enterprise. With the GA of Azure Sentinel, Microsoft now enters the SIEM market.

By Steef-Jan Wiggers

Improving Blockchain Performance Off-Chain, Hyperledger Announces Avalon

In a recent blog post, the Hyperledger project announced a new project, called Hyperledger Avalon, that addresses some of the scalability and privacy challenges that are currently associated with many blockchain projects. The projects seek to address these scalability and privacy challenges through the use of trusted off-chain processing, while ensuring the transactions are secure and resilient.

By Kent Weare

Open-Source Build and Test Tool Bazel Reaches 1.0

Derived from Google internal build tool Blaze, Bazel is a build and test tool that offers a human-readable definition language and is particularly aimed to large, multi-language, multi-repositories projects. Originally open-sourced in 2015, Bazel has now reached 1.0.

By Sergio De Simone

© 2019   Created by Michael Levin.   Powered by

Badges  |  Report an Issue  |  Terms of Service