Mar7

Scylla and Elasticsearch, Part Two: Practical Examples to Support Full-Text Search Workloads

Subscribe to Our Blog

Scylla and Elasticsearch

We covered the basics of Elasticsearch and how Scylla is a perfect complement for it in part one of this blog. Today we want to give you specific how-tos on connecting Scylla and Elasticsearch, including use cases and sample code.

Use Case #1

If combining a persistent, highly available datastore with full text search engine is a market requirement, then implementing a single, integrated solution is an ultimate goal that requires time and resources. To answer this challenge we describe below a way for users to use best-of-breed solutions that support full text search workloads. We chose Elasticsearch open source together with Scylla open source to showcase the solution.

In this use case we start with a fresh clean setup, meaning that you need to ingest your data into both Scylla and Elasticsearch using dual writes and then perform the textual search.

The following example creates an apparel catalog consisting of 160 items, stored on Scylla, and searchable using Elasticsearch (ES). The catalog.csv file comprised of 7 columns: Brand, Group, Sub_Group, Color, Size, Gender and SKU – Let’s describe them.

  • Brand: companies who manufacture the clothing items (5 options)
  • Group: clothing type (4 options)
  • Sub_Group: an attribute that correlates to the clothing group type (9 options)
  • Color: pretty much self explanatory (7 options) – very common query filter
  • Size: ranging from Small to XLarge (4 options) – very common query filter
  • Gender: People like to see the results for the relevant gender group (4 options) – very common query filter
  • SKU: a unique product ID, usually made of the other product attributes initials
BrandGroupSub_GroupColorSizeGender
North FaceShirtsshort sleeveblackSmallmen
PatagoniaPantslong sleeveredMediumwomen
MammutShoesgore-texgreyLargeboys
GarmontJacketsleathergreenXLargegirls
Columbia softshellyellow  
  hikingblue  
  jeanswhite  
  bermuda   
  UV_protection   
We will be using two python scripts (see Appendix-B) to demonstrate the integration of Scylla with Elasticsearch.
  • Dual writes using the insert_data script for data ingestion, in our case an apparel catalog csv file.
  • Textual search using the query_data script, which is basically a 2-hop query that will retrieve the unique product_id (SKU) from Elasticsearch and then use the retrieved SKU to query other product attributes from Scylla.
Scylla and Elasticsearch Block Diagram

Prerequisites

  • python installed
  • pip installed
  • Java 8 installed
  • Scylla cluster installed (see here)
  • Node for Elasticsearch and python scripts (can be separate nodes)

Procedure

  1. Install the python drivers on the node to be used for the scripts

$ sudo pip install cassandra-driver $ sudo pip install elasticsearch

  1. Install Elasticsearch (see here)

$ sudo apt-get update $ curl -L -O https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.2.3.deb $ sudo dpkg -i elasticsearch-6.2.3.deb

  1. Start Elasticsearch, verify status and health state

$ sudo /etc/init.d/elasticsearch start [ ok ] Starting elasticsearch (via systemctl): elasticsearch.service. curl http://127.0.0.1:9200/_cluster/health?pretty

{ “cluster_name” : “elasticsearch”, “status” : “green“, “timed_out” : false, “number_of_nodes” : 1, “number_of_data_nodes” : 1, “active_primary_shards” : 0, “active_shards” : 0, “relocating_shards” : 0, “initializing_shards” : 0, “unassigned_shards” : 0, “delayed_unassigned_shards” : 0, “number_of_pending_tasks” : 0, “number_of_in_flight_fetch” : 0, “task_max_waiting_in_queue_millis” : 0, “active_shards_percent_as_number” : 100.0 }

  1. Copy the following files to the location from which you will run them, and make them executable
    • catalog.csv (the SKU is the unique product_id, made of other product attributes initials)
    • insert_data_into_scylla+elastic.py
    • query_data_from_scylla+elastic.py
  1. Run the insert_data script. The script will perform the following:
    • Create the Schema on Scylla (see Appendix-A)
    • Create the Elasticsearch index (see Appendix-A)
    • Dual write: insert the catalog items (csv file) to both DBs (using prepared statement for Scylla)
    Use the -s / -e flags to insert a comma-separated list of IPs for the Scylla and /or Elasticsearch (ES) nodes. If you are running Elasticsearch (ES) on the same node as the python scripts, no need to enter IP, 127.0.0.1 will be used as the default. $ python insert_data_into_scylla+elastic.py -h usage: insert_data_into_scylla+elastic.py [-h] [-s SCYLLA_IP] [-e ES_IP] optional arguments: -h, --help show this help message and exit -s SCYLLA_IP -e ES_IP
  1. Once the “insert_data” script completes, you will find 160 entries in both Scylla and Elasticsearch
Elasticsearch Scylla
$ curl http://127.0.0.1:9200/catalog/_count/?pretty
  {
  "count" : 160,
  "_shards" : {
  "total" : 5,
  "successful" : 5,
  "skipped" : 0,
  "failed" : 0
  }
}
cqlsh> SELECT COUNT (*) FROM catalog.apparel ;

 count
-------
   160

(1 rows)
    1. Run the query_data script. The script will perform the following:
      • It will execute a textual search in Elasticsearch per the flag you provide, either searching for a single word (single filter) OR searching for multiple words (multiple filters), OR without any filter, which is basically a “match_all” query.

      • It will then use the SKU value retrieved from the textual search to query Scylla, while using prepared statements.As mentioned, there are 3 query types, use the -n flag to select the query type. Optional values are:

        • “single”: using a single filter (by group) to query for “pants”
        • “multiple” (default): using multiple filters (by color AND sub_group) to query for “white softshell”
        • “none”: query without any filter = “match_all”

        Use the -s / -e flags to insert a comma-separated list of IPs for the Scylla and /or Elasticsearch (ES) nodes. If you are running Elasticsearch (ES) on the same node as the python scripts, no need to enter IP, 127.0.0.1 will be used as the default. Note: Elasticsearch returns only the 1st 10 results by default. To overcome this we set the size limit to 1000 results. When retrieving a large set of results, we recommend using pagination (read more here: elasticsearch-py helpers).

        $ python query_data_from_scylla+elastic.py -h
        usage: query_data_from_scylla+elastic.py [-h] [-s SCYLLA_IP] [-e ES_IP] [-n NUM_FILTERS]


        optional arguments:
          -h, –help show this help message and exit
          -s SCYLLA_IP
          -e ES_IP
          -n NUM_FILTERS

    2. To delete Elasticsearch index and the keyspace in Scylla run the following commands

      • Elasticsearch: $ curl -X DELETE "127.0.0.1:9200/catalog"
      • Scylla: cqlsh> DROP KEYSPACE catalog ;

Use Case #2

In this use case we assume you already have your data in Scylla and want to import it into Elasticsearch, to be indexed for textual search purposes. To accomplish this ,the first thing you will need to do is export the relevant table and its content from Scylla into a .csv data file, this can be accomplished by using the cqlsh COPY TO command.

The second thing to do is export the table schema into a .cql schema file, a file for each table separately. This can be accomplished by running the following command cqlsh [IP] "-e DESCRIBE TABLE [table name] > [name].cql

Once you have your .csv and .cql files ready, you just need to have an Elasticsearch node installed and you’re good to go.

The following script (see Appendix-C) will use the.cql schema file and .csv data file as inputs to create an index in Elasticsearch (ES) and insert the data.

The ES index name will be created based on the .csv file name

The index _id field (index partition key) is based on the PRIMARY KEY taken from the .cql schema (simple/composite/compound).

The index _type field will represent the partition key (PK), in case of a compound key it will use `-` to concatenate the column names.

The script will print progress for every 1000 rows processed and total rows processed in its output.

Prerequisites

  • python installed
  • pip installed
  • Java 8 installed
  • Scylla cluster installed (see here)
  • Node for Elasticsearch and python scripts (can be separate nodes)

Procedure

  1. Install the python drivers on the node to be used for the scripts
    $ sudo pip install cassandra-driver
    $ sudo pip install elasticsearch
  2. Install Elasticsearch (see here)
    $ sudo apt-get update
    $ curl -L -O https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.2.3.deb
    $ sudo dpkg -i elasticsearch-6.2.3.deb

  3. Start Elasticsearch, verify status and health state$ sudo /etc/init.d/elasticsearch start
    [ ok ] Starting elasticsearch (via systemctl): elasticsearch.service.curl http://127.0.0.1:9200/_cluster/health?pretty


    {

    "cluster_name" : "elasticsearch",
    "status" : "green",
    "timed_out" : false,
    "number_of_nodes" : 1,
    "number_of_data_nodes" : 1,
    "active_primary_shards" : 0,
    "active_shards" : 0,
    "relocating_shards" : 0,
    "initializing_shards" : 0,
    "unassigned_shards" : 0,
    "delayed_unassigned_shards" : 0,
    "number_of_pending_tasks" : 0,
    "number_of_in_flight_fetch" : 0,
    "task_max_waiting_in_queue_millis" : 0,
    "active_shards_percent_as_number" : 100.0
    }

  4. Copy the python file to the location from which you will run it, and make it executable. Place your .csv and .cql files in an accessible location (can be same dir as the python script)

  5. Run the script (see below usage, important details and examples)
    • Usage
      $ python ES_insert_data_per_schema.py -h
      usage: ES_insert_data_per_schema.py [-h] [-e ES_IP] [-c CSV_FILE_NAME]
                                          [-s CQL_SCHEMA_FILE_NAME]
                                          [-i IGNORE_CQL_SCHEMA]
      optional arguments:
        -h, --help show this help message and exit
        -e ES_IP
        -c CSV_FILE_NAME
        -s CQL_SCHEMA_FILE_NAME
        -i IGNORE_CQL_SCHEMA
    • Important Details
      • Use -e flag to insert a comma-separated list of IPs for Elasticsearch (ES) nodes. If ES is running locally, no need for this flag, default 127.0.0.1 will be used
      • -i ignore_cql_schema -> default: True. Meaning it will use the 1st column from the .csv file for ES index _id field. If you have a compound PK use -i no so not to ignore the .cql schema
      • -c csv_file_name -> requires full path to file. Needs to be in the format as described in the prerequisites
      • -s cql_schema_file name -> requires full path to file. Checking schema for compound PK, if did not find it checking for simple PK
      • If .cql file is not provided (but .csv file was provided), script will fall back to ignoring .cql schema and use the 1st column from the .csv file for ES index _id field
      • If both .cql + .csv files are not provided, error is printed and script exists.
    • Output Example Using Compound PK
      ubuntu@ip-172-16-0-124:~/scylla_elastic$ python ES_insert_data_per_schema.py -c
      ./cp_prod.product_all.csv -s
      ./cp_prod_product_all.cql -i no

      ## Check schema (./cp_prod_product_all.cql) for compound primary key to be used as index id
      ## Did not find a compound primary key, checking for regular primary key to be used as index id
      ## Connecting to ES -> Creating 'cp_prod.product_all' index, if not exist
      ## Write csv file (./cp_prod.product_all.csv) content into Elasticsearch
      ## Update every 1000 rows processed ##
      Rows processed: 1000
      Rows processed: 2000
      Rows processed: 3000
      Rows processed: 4000
      Rows processed: 5000
      Rows processed: 6000
      Rows processed: 7000
      Rows processed: 8000
      Rows processed: 9000
      ## After all inserts, refresh index (just in case)
      ### Total Rows Processed: 9715 ###

Next Steps

We’ve given the case for using Scylla and Elasticsearch together. And above we’ve shown you step-by-step model for how to implement it. The next step is up to you! Download Scylla and Elasticsearch and try it out yourself.

If you do, we’d love to hear your feedback and experience. Either by joining our Slack channel, or dropping us a line.

Appendix A (Schema and Index for Use Case #1)

Scylla schema Elasticsearch Index

Appendix B (Python Code for Use Case #1)

Insert_Data Query_Data

Appendix C (Python Code for Use Case #2)

Tomer SandlerAbout Tomer Sandler

Tomer Sandler joined ScyllaDB as a solution architect after a 12 year career in SW Quality Engineering, mostly in storage and telecom lawful interception domains. Prior to ScyllaDB, Tomer held various QA management roles at Dell EMC, leading a group of QA engineers and information developers for ScaleIO storage, and at VERINT, as Senior QA engineer working with U.S. and Spanish Law Enforcement Agencies customers engagement experience. In his free time he enjoys hiking, playing the saxophone, spending time with his family and last but not least, engaging in OSINT (Open Source Intelligence) research.


Tags: cql, elasticsearch, JSON, Kibana, Python, Scylla Open Source, zeppelin