TEKNIK SCRAPING DAN CRAWLING UNTUK MENGEKSTRAKSI REVIEW HOTEL ONLINE PADA WEBSITE TRAVELOKA (BERBASIS AJAX)

The internet can be a source of public data available on various websites. The process of retrieving data from a website requires certain techniques because the data found on the website is unstructured data. Data retrieval or extraction techniques are known as scraping processes. A website also has many web pages that are interconnected so that techniques are also needed to be able to check all web pages where data will be taken. The technique for accessing linked web pages is called crawling. In the process of processing data from extraction, structured data is needed, therefore we need a scraping and crawling system that can produce structured data from a website. In this final project, it is explained about scraping and crawling techniques for extracting data from a website. Extracted data is hotel review data from the traveloka website. The use of javascript and ajax on a website makes accessing data on a website does not require refesh the entire web page. Data on the website can be displayed more interactively. To perform crawling on websites that use javascript and ajax, certain techniques are needed so that the scrawling system can interact with ajax and the scraping process can retrieve all the data on a web page. Scraping and crawling techniques are developed using and integrating various existing technologies. Scrapy which is a scraping and scrawling framework is an option in developing this technique. Selenium and chrome drivers are used to interact with ajax-based web. Elasticsearch are used as a place to store data from scarping through the item pipeline process. The development of scraping and scrawling techniques is carried out through several stages. The stage starts from evaluating the website that will be the source of the data to get the elements where the data is. The element selection is done by using the xpath selector. Xpath is used in scraling and crawling processes that are developed in spider in Scrapy framework. All of these techniques were developed using the Python programming language. The result of developing this technique is a scraping and crawling system to extract hotel review data from the traveloka web. The system can run steadily taking millions of hotel reviews. Data review data can also be stored and displayed properly in elasticsearch.

URI

http://repository.umy.ac.id/handle/123456789/22653

Collections

Department of Information Technology