Sunil Gulabani: solr

Showing posts with label solr. Show all posts

Saturday, 26 January 2013

Apache Solr Clustering

Example A: Simple two shard cluster

This example simply creates a cluster consisting of two solr servers representing two different shards of a collection.

Step 1: Download Solr 4-Beta or greater: http://lucene.apache.org/solr/downloads.html

Step 2: Since we'll need two solr servers for this example, simply make a copy of the example directory for the second server -- making sure you don't have any data already indexed.

rm -r example/solr/collection1/data/*
cp -r example example2

Step 3: This command starts up a Solr server and bootstraps a new solr cluster.

cd example
java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar

-DzkRun causes an embedded zookeeper server to be run as part of this Solr server.
-Dbootstrap_confdir=./solr/collection1/conf Since we don't yet have a config in zookeeper, this parameter causes the local configuration directory ./solr/conf to be uploaded as the "myconf" config. The name "myconf" is taken from the "collection.configName" param below.
-Dcollection.configName=myconf sets the config to use for the new collection. Omitting this param will cause the config name to default to "configuration1".
-DnumShards=2 the number of logical partitions we plan on splitting the index into.

Verify: Browse to http://localhost:8983/solr/#/~cloud to see the state of the cluster (the zookeeper distributed filesystem). You can see from the zookeeper browser that the Solr configuration files were uploaded under "myconf", and that a new document collection called "collection1" was created. Under collection1 is a list of shards, the pieces that make up the complete collection.
Step 4: Now we want to start up our second server - it will automatically be assigned to shard2 because we don't explicitly set the shard id. Then start the second server, pointing it at the cluster:

cd example2
java -Djetty.port=7574 -DzkHost=localhost:9983 -jar start.jar

-Djetty.port=7574 is just one way to tell the Jetty servlet container to use a different port.
-DzkHost=localhost:9983 points to the Zookeeper ensemble containing the cluster state. In this example we're running a single Zookeeper server embedded in the first Solr server. By default, an embedded Zookeeper server runs at the Solr port plus 1000, so 9983.

Verify: If you refresh the zookeeper browser, you should now see both shard1 and shard2 in collection1. View http://localhost:8983/solr/#/~cloud.

Step 5: Next, index some documents:

cd exampledocs
java -Durl=http://localhost:8983/solr/collection1/update -jar post.jar ipod_video.xml
java -Durl=http://localhost:8983/solr/collection1/update -jar post.jar monitor.xml
java -Durl=http://localhost:8983/solr/collection1/update -jar post.jar mem.xml

Verify: And now, a request to either server results in a distributed search that covers the entire collection:
http://localhost:8983/solr/collection1/select?q=*:*

Example B: Simple two shard cluster with shard replicas

This example will simply build off of the previous example by creating another copy of shard1 and shard2. Extra shard copies can be used for high availability and fault tolerance, or simply for increasing the query capacity of the cluster.

Step 1: First, run through the previous example so we already have two shards and some documents indexed into each. Then simply make a copy of those two servers:

cp -r example exampleB
cp -r example2 example2B

Step 2 : Then start the two new servers on different ports, each in its own window:

cd exampleB
java -Djetty.port=8900 -DzkHost=localhost:9983 -jar start.jar
cd example2B
java -Djetty.port=7500 -DzkHost=localhost:9983 -jar start.jar

Verify: Refresh the zookeeper browser page Solr Zookeeper Admin UI and verify that 4 solr nodes are up, and that each shard is present at 2 nodes.
Because we have been telling Solr that we want two logical shards, starting instances 3 and 4 are assigned to be replicas of instances one and two automatically.

Step 3 : Now send a query to any of the servers to query the cluster:
http://localhost:7500/solr/collection1/select?q=*:*
Note: To demonstrate fail over for high availability, go ahead and kill any one of the Solr servers (just press CTRL-C in the window running the server) and and send another query request to any of the remaining servers that are up.
To demonstrate graceful degraded behavior, kill all but one of the Solr servers (just press CTRL-C in the window running the server) and and send another query request to the remaining server. By default this will return 0 documents. To return just the documents that are available in the shards that are still alive, add the following query parameter: shards.tolerant=true

SolrCloud uses leaders and an overseer as an implementation detail. This means that some shards/replicas will play special roles. You don't need to worry if the instance you kill is a leader or the cluster overseer - if you happen to kill one of these, automatic fail over will choose new leaders or a new overseer transparently to the user and they will seamlessly takeover their respective jobs. Any Solr instance can be promoted to one of these roles.

Example C: Two shard cluster with shard replicas and zookeeper ensemble

The problem with example B is that while there are enough Solr servers to survive any one of them crashing, there is only one zookeeper server that contains the state of the cluster. If that zookeeper server crashes, distributed queries will still work since the solr servers remember the state of the cluster last reported by zookeeper. The problem is that no new servers or clients will be able to discover the cluster state, and no changes to the cluster state will be possible.
Running multiple zookeeper servers in concert (a zookeeper ensemble) allows for high availability of the zookeeper service. Every zookeeper server needs to know about every other zookeeper server in the ensemble, and a majority of servers are needed to provide service. For example, a zookeeper ensemble of 3 servers allows any one to fail with the remaining 2 constituting a majority to continue providing service. 5 zookeeper servers are needed to allow for the failure of up to 2 servers at a time.

For production, it's recommended that you run an external zookeeper ensemble rather than having Solr run embedded zookeeper servers. You can read more about setting up a zookeeper ensemble here. For this example, we'll use the embedded servers for simplicity.

Step 1: First, stop all 4 servers and then clean up the zookeeper data directories for a fresh start.

rm -r example*/solr/zoo_data

We will be running the servers again at ports 8983,7574,8900,7500. The default is to run an embedded zookeeper server at hostPort+1000, so if we run an embedded zookeeper on the first three servers, the ensemble address will be localhost:9983,localhost:8574,localhost:9900.

As a convenience, we'll have the first server upload the solr config to the cluster. You will notice it block until you have actually started the second server. This is due to zookeeper needing a quorum before it can operate.

cd example
java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -DzkRun -DzkHost=localhost:9983,localhost:8574,localhost:9900 -DnumShards=2 -jar start.jar

cd example2
java -Djetty.port=7574 -DzkRun -DzkHost=localhost:9983,localhost:8574,localhost:9900 -jar start.jar

cd exampleB
java -Djetty.port=8900 -DzkRun -DzkHost=localhost:9983,localhost:8574,localhost:9900 -jar start.jar

cd example2B
java -Djetty.port=7500 -DzkHost=localhost:9983,localhost:8574,localhost:9900 -jar start.jar

Now since we are running three embedded zookeeper servers as an ensemble, everything can keep working even if a server is lost. To demonstrate this, kill the exampleB server by pressing CTRL+C in it's window and then browse to the Solr Zookeeper Admin UI to verify that the zookeeper service still works.

Apache Solr Security

07:09 No comments

1) Download the Apache Solr 3.6 and build the example.

2) Edit /example/etc/jetty.xml

<Item>

<Set name="name">Test Realm</Set>

<Set name="config"><SystemProperty name="jetty.home" default="."/>/etc/realm.properties</Set>

</New>

</Item>

</Array>

</Set>

3) Edit /example/etc/webdefault.xml

<security-constraint>

<web-resource-collection>

<web-resource-name>Solr authenticated application</web-resource-name>

<url-pattern>/*</url-pattern>

</web-resource-collection>

<auth-constraint>

<role-name>core1-role</role-name>

</auth-constraint>

</security-constraint>

<login-config>

<auth-method>BASIC</auth-method>

<realm-name>Test Realm</realm-name>

</login-config>

4) Create new file /example/etc/realm.properties

guest: guest, core1-role
5) Start the server and verfiy on http://localhost:8983/solr/. You will get the following screen to input the credentials (Fig. 1)

java -jar start.jar

http://localhost:8983/solr/

6) Configuring the SolrJ to create connection with the Solr Server using Basic
Authentication Realm:

…...
CommonsHttpSolrServer solr = new CommonsHttpSolrServer("http://localhost:8983/solr");
Credentials defaultcreds = new UsernamePasswordCredentials("guest", "guest");
solr.getHttpClient().getState().setCredentials(new AuthScope("localhost", 8983,
AuthScope.ANY_REALM), defaultcreds);
solr.getHttpClient().getParams().setAuthenticationPreemptive(true);
…..

Apache Solr - Distributed Search

07:08 No comments

1) Download the Apache Solr 3.6.1 and build the example.

2) Make a copy

cd solr
cp -r example example7574

3) Change the port number

perl -pi -e s/8983/7574/g example7574/etc/jetty.xml example7574/exampledocs/post.sh
4) In window 1, start up the server on port 8983

cd example
java -server -jar start.jar

5) In window 2, start up the server on port 7574

         cd example7574
         java -server -jar start.jar

6) In window 3, index some example documents to each server

         cd example/exampledocs
         ./post.sh [a-m]*.xml

         cd ../../example7574/exampledocs
         ./post.sh [n-z]*.xml

7) Now do a distributed search across both servers with your browser or curl

Open Browser:
http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solr&indent=true&q=ipod+solr

OR
Open Terminal:
curl 'http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solr&indent=true&q=ipod+solr'

Apache Solr - Document/Bean Indexing / Suggest using Java

07:07 No comments

Document Indexing

SolrjPopulator.java

import java.io.IOException;

import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.HttpSolrServer;
import org.apache.solr.common.SolrInputDocument;

import util.Constants;

public class SolrjPopulator {
   public static String[] cat = new String[10];
   static{
      cat[0] = "Electronics" ;
      cat[1] = "Books" ;
      cat[2] = "Memory" ;
      cat[3] = "Mobile Accessories" ;
      cat[4] = "Mobile" ;
      cat[5] = "Computer" ;
      cat[6] = "Computer Accessories" ;
      cat[7] = "Tablets" ;
      cat[8] = "Tables Accessories" ;
      cat[9] = "Home Furnishing" ;
   }
   public static void main(String[] args) throws IOException,
          SolrServerException {
      SolrServer server = new HttpSolrServer(Constants.SERVER_NAME);
      for (int i = 0; i < 10000; ++i) {
          SolrInputDocument doc = new SolrInputDocument();
          doc.addField("cat", cat[i%10]);
          doc.addField("id", cat[i%10] + "-" + i);
          doc.addField("name", "Name for " + cat[i%10] + " :: "+ i);
          server.add(doc);
          if (i % 100 == 0)
              server.commit(); // periodically flush
      }
      server.commit();
   }
}

SolrJSearcher.java

import java.net.MalformedURLException;
import java.util.List;

import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.HttpSolrServer;
import org.apache.solr.client.solrj.response.QueryResponse;
import org.apache.solr.common.SolrDocument;
import org.apache.solr.common.SolrDocumentList;
import org.apache.solr.common.params.ModifiableSolrParams;

import util.Constants;

public class SolrJSearcher {

   public static void main(String[] args) throws MalformedURLException,
          SolrServerException {
      SolrServer solr = new HttpSolrServer(Constants.SERVER_NAME);

      ModifiableSolrParams params = new ModifiableSolrParams();
      // params.set("q", "cat:book"); // query string
      params.set("q", "*:*"); // query string
//       params.set("defType", "edismax");
//       params.set("fl", "score,*"); // filter
      // params.set("debugQuery","on");
      params.set("start", "0");
      params.set("rows", "20000");

      QueryResponse response = solr.query(params);
      SolrDocumentList results = response.getResults();
      List<String> keysList = Constants.createKeyList(results);
      for (int i = 0; i < results.size(); ++i) {
          SolrDocument doc = results.get(i);

          for(String key : keysList)
          {
              System.out.println(key + ": " + doc.get(key));
          }
          System.out.println("-----------------------------------");
      }
   }
}

Constants.java

package util;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import org.apache.solr.client.solrj.impl.HttpSolrServer;
import org.apache.solr.common.SolrDocument;
import org.apache.solr.common.SolrDocumentList;

public class Constants {
   public static final String SERVER_NAME = "http://localhost:8983/solr";

   public static HttpSolrServer getSolrServer(){
      return new HttpSolrServer(Constants.SERVER_NAME);
   }

   public static List<String> createKeyList(SolrDocumentList results) {
      List<String> keysList = new ArrayList<String>();
      Map<String, Object> fieldValueMap = new HashMap<String,Object>();
      for (int i = 0; i < results.size(); ++i) {
          SolrDocument doc = results.get(i);
          fieldValueMap = doc.getFieldValueMap();
          for(String key : fieldValueMap.keySet())
          {
              if(!keysList.contains(key))
                  keysList.add(key);
            }
      }
      return keysList;
   }
}

Bean Indexing

Item.java

package bean;
import java.util.List;

import org.apache.solr.client.solrj.beans.Field;

public class Item {
  @Field("id")
  public String id;

  @Field("cat")
  public String[] categories;

  @Field
  public List<String> features;

  @Field
  public String name;

  @Field
  public String manu;

  @Field
  public Float price ;

  @Field
  public int popularity;

  @Field
  public boolean inStock;

  @Field
  public Float weight;

  @Field
  public String includes;

  @Field
  public String payloads;

  @Field
  public String manu_id_s;

  @Field
  public String sku ;

  public String toString(){
      String cat = "" ;
      if(categories!=null)
      {
          cat = "[Categories: " ;
          for(String category : categories)
          {
              cat += category;
          }
          cat = cat + "]" ;
      }

      String tempFeatures = "" ;
      if(features!=null)
      {
          tempFeatures = "[Features: " ;
          for(String feat : features)
          {
              tempFeatures += feat;
          }
          tempFeatures += "]" ;
      }

      return this.id + ", " +
              this.name + ", " +
              this.manu + ", " +
              cat + ", " +
              tempFeatures + ", " +
              this.includes + ", " +
              this.payloads + ", " +
              this.popularity + ", " +
              this.manu_id_s + ", " +
              this.sku + ", " +
              this.price + ", " +
              this.weight + ", " +
              this.inStock
              ;
  }
}

BeanPopulator.java

package bean;

import java.io.IOException;
import java.util.Date;

import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.response.UpdateResponse;

import util.Constants;

public class BeanPopulator {
   public static void main(String[] args) {
      SolrServer server = Constants.getSolrServer();
      Item item = new Item();
      item.id = "one " + new Date();
      item.categories = new String[] { "aaa", "bbb", "ccc" };
      item.name = "name " + new Date();
      item.manu = "manu " + new Date();
      item.price= 10.2f;
      item.popularity= 0 ;
      item.inStock= true;
      item.weight= 1.2f;
      item.includes= "one " + new Date();
      item.payloads= "one " + new Date();
      item.manu_id_s= "one " + new Date();
      item.sku= "" + new Date();


      try {
          UpdateResponse response = server.addBean(item);
          System.out.println(response.getResponse());
      } catch (IOException e) {
          e.printStackTrace();
      } catch (SolrServerException e) {
          e.printStackTrace();
      }
      /*
        * List<Item> beans ; //add Item objects to the list
        * server.addBeans(beans);
        */
   }
}

BeanReader.java

package bean;

import java.io.IOException;
import java.util.List;

import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.response.QueryResponse;
import org.apache.solr.client.solrj.response.UpdateResponse;
import org.apache.solr.common.SolrDocumentList;

import util.Constants;

public class BeanReader {

   public static void main(String[] args) {
      SolrServer server = Constants.getSolrServer();
      SolrQuery query = new SolrQuery();
       query.setQuery( "cat:*" );
//        query.addSortField( "price", SolrQuery.ORDER.asc );
       try {
          QueryResponse rsp = server.query( query );
          SolrDocumentList docs = rsp.getResults();
          List<Item> beans = rsp.getBeans(Item.class);
          for(Item item : beans)
          {
              System.out.println(item.toString());
          }

//           deleteAllBeans(beans,server);

      } catch (SolrServerException e) {
          e.printStackTrace();
      }
   }

   public static void deleteAllBeans(List<Item> beans,SolrServer server){
      try {
          for(Item item : beans)
          {
              System.out.println(item.toString());
              UpdateResponse response = server.deleteById(item.id + "");
              System.out.println(response.getResponse());
          }
          server.commit();
      } catch (SolrServerException e) {
          // TODO Auto-generated catch block
          e.printStackTrace();
      } catch (IOException e) {
          // TODO Auto-generated catch block
          e.printStackTrace();
      }
   }
}

Suggest ( Auto-Complete Functionality)

1) Add following code in ../example/solr/conf/solrconfig.xml:

    <searchComponent class="solr.SpellCheckComponent" name="suggest">
    <lst name="spellchecker">
     <str name="name">suggest</str>
     <str name="classname">org.apache.solr.spelling.suggest.Suggester</str>
     <str name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup</str>
     
     
     <str name="field">name</str>       
    <float name="threshold">0.005</float>
     <str name="buildOnCommit">true</str>

    </lst>
</searchComponent>
<requestHandler class="org.apache.solr.handler.component.SearchHandler" name="/suggest">
    <lst name="defaults">
     <str name="spellcheck">true</str>
     <str name="spellcheck.dictionary">suggest</str>
     <str name="spellcheck.onlyMorePopular">true</str>
     <str name="spellcheck.count">100</str>
     <str name="spellcheck.collate">true</str>
    </lst>
    <arr name="components">
     <str>suggest</str>
    </arr>
</requestHandler>

2) SolrSearchSuggest.java

package search;

import java.util.List;

import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.HttpSolrServer;
import org.apache.solr.client.solrj.response.QueryResponse;
import org.apache.solr.client.solrj.response.SpellCheckResponse;
import org.apache.solr.client.solrj.response.SpellCheckResponse.Suggestion;
import org.apache.solr.common.params.ModifiableSolrParams;

import util.Constants;

public class SolrSearchSuggest {
   public static void main(String args[]) {

      SolrServer solr = new HttpSolrServer(Constants.SERVER_NAME);
      ModifiableSolrParams params = new ModifiableSolrParams();
      params.set("qt", "/suggest");
      params.set("q", "s");
      try {
          QueryResponse response = solr.query(params);
          System.out.println(response);
          SpellCheckResponse spellCheckResponse = response.getSpellCheckResponse() ;

          List<Suggestion> suggestionList = spellCheckResponse.getSuggestions();
          if(suggestionList!=null && suggestionList.size()>0)
          {
              System.out.println("Suggestion List: ");
              for(Suggestion suggestion : suggestionList)
              {
                  List<String> alternatives = suggestion.getAlternatives() ;
                  if(alternatives!=null && alternatives.size()>0)
                  {
                      for(String alternative : alternatives)
                      {
                          System.out.println(alternative);
                      }
                  }
              }
          }
      } catch (SolrServerException e) {
          e.printStackTrace();
      }
   }
}

File Indexing

1) There is no need to change any configuration for file indexing. Start the server and using following syntax we can index the file.

Open Terminal:

curl "http://localhost:8983/solr/update/extract?literal.id=htmlDoc_tutorial&literal.name=htmlDoc_tutorial&commit=true" -F "myfile=@tutorial.html"

OR

Open Browser:

http://localhost:8983/solr/update/extract?literal.id=htmlDoc_tutorial&literal.name=htmlDoc_tutorial&commit=true" -F "myfile=@tutorial.html

Note: we can add as many literals as defined in the schema.xml:
id, sku, name, manu, cat, features, includes, weight, price, popularity, inStock, title, subject, description, comments, author, keywords, category, content_type, last_modified, links.

2) Using SolrJ we can also index file:

SolrFilePopulator.java

import java.io.File;
import java.io.IOException;
import java.util.Map.Entry;

import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.impl.HttpSolrServer;
import org.apache.solr.client.solrj.request.AbstractUpdateRequest;

import util.Constants;

import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.request.ContentStreamUpdateRequest;
import org.apache.solr.common.util.NamedList;

public class SolrFilePopulator {
   public static String fileName = "SolrJExample2.java" ;
   public static void main(String[] args) {
      try {
          SolrServer server = new HttpSolrServer(Constants.SERVER_NAME);
          ContentStreamUpdateRequest up = new ContentStreamUpdateRequest(
                  "/update/extract");
          File file = new File( "src/" +fileName);
          if(file.exists())
          {
              up.addFile(file);
              String id = fileName.substring(fileName.lastIndexOf('/') + 1);
              System.out.println(id);

              up.setParam("literal.id", id);
              up.setParam("literal.name", fileName);
              up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);

              NamedList<Object> request = server.request(up);
              for (Entry<String, Object> entry : request) {
                  System.out.println(entry.getKey());
                  System.out.println(entry.getValue());
              }
          }else
          {
              System.out.println("File Does not exist at " + file.getAbsolutePath());
          }
      } catch (IOException e) {
          e.printStackTrace();
      } catch (SolrServerException e) {
          e.printStackTrace();
      }
   }
}

Note: To apply the stemming algorithm (stopword) change the field type of literals to
“text_en_splitting”

like
Original: <field name="name" type="text_general" indexed="true" stored="true"/>
to
  <field name="name" type="text_en_splitting" indexed="true" stored="true"/>

By doing this, when searching on name, stopwords will be ignored. Stopwords are words ending with ing,ed,s,etc. and It also uses the soundex functionality like “Pixima” is wrong spelling and if the data is “Pixma”, it returns the data.

Sunil Gulabani

Software Engineer (Java) | Author

Saturday, 26 January 2013

Apache Solr Clustering

Example A: Simple two shard cluster

Example B: Simple two shard cluster with shard replicas

Example C: Two shard cluster with shard replicas and zookeeper ensemble

Apache Solr Security

Apache Solr - Distributed Search

Apache Solr - Document/Bean Indexing / Suggest using Java

Document Indexing

Bean Indexing

Suggest ( Auto-Complete Functionality)

File Indexing

About Me

Posts

Popular Posts

Visitors

Total Pageviews

Alexa Site Stats

Friend's Blog

Labels