Mar 9, 2013

f Comment

[SOLVED] Fixing Errors Importing MySQL Records into Lucene Apache Solr!

Amazon Are you new to Apache Solr, a popular open source search engine written in Java based on Lucene? If so are you trying to import records from a MySQL database? If so did you encounter any errors doing so while following Solr's official tutorial at http://lucene.apache.org/solr/4_1_0/tutorial.html? If so you've come to the right place. Read on to see what errors I've encountered and managed to fix.

I installed solr-4.1.0 Windows software at D:\software\ on my PC which is running Windows 7.
While following http://lucene.apache.org/solr/4_1_0/tutorial.html I successfully go to D:\software\solr-4.1.0\example\ and start Jetty via 'java -jar start.jar'. Everything is fine until I hit the line:

Import records from a database using the Data Import Handler (DIH).
Because my records are stored in a database I need to import them from my database. I click on the link http://wiki.apache.org/solr/DataImportHandler and everything is fine until I hit this line:

The Handler has to be registered in the solrconfig.xml as follows.
They don't tell you where solrconfig.xml is. They don't tell you what the file is about.

When you go to http://localhost:8983/solr/dataimport you see a big "404 Not Found" error! Don't worry. I'll walk you through EXACTLY how to fix this error. Read on!
In D:\software\solr-4.1.0\example\README.txt I read this text:

If you make a copy of this example server and wish to use the ExtractingRequestHandler (SolrCell), DataImportHandler (DIH), UIMA, the clustering component, or any other modules in "contrib", you will need to copy the required jars or update the paths to those jars in your solrconfig.xml.
Digging further I figured out the solution all by myself. Surprisingly Google did NOT help this time.

Edit solrconfig.xml to Import Dataimport Handler Jars
The reason that you are getting "404 Not Found" at http://localhost:8983/solr/dataimport is you haven't told Jetty how to handle request that begins with /dataimport. Let's solve this problem.

First, in D:\software\solr-4.1.0\example\solr\collection1\conf\solrconfig.xml add the following lines below '<lib dir="../../../dist/" regex="solr-velocity-\d.*\.jar" />':
<lib dir="../../../contrib/dataimporthandler/lib" regex=".*\.jar" />
<lib dir="../../../dist/" regex="solr-dataimporthandler-\d.*\.jar" />
Add the following lines below '<!-- Request Handlers -->':
<requesthandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
 <lst name="defaults">
  <str name="config">data-config.xml</str>
 </lst>
</requestHandler>
Now we've successfully instructed Jetty to add necessary jars to handle /dataimport requests.

Add data-config.xml
Now we need to provide data-config.xml in the same directory as solrconfig.xml. The file data-config.xml tells Solr what database to connect to and what database and fields to index. Here's my data-config.xml:
<dataConfig>
 <dataSource encoding="UTF-8" type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/mySiteFolder" user="myUserName" password="myPassword"/>
 <document name="entries">
  <entity name="entry" query="select * from entry">
   <field column="entry_id" name="entry_id" />
   <field column="term" name="term" />
  </entity>
 </document>
</dataConfig>
Next let's add the necessary JDBC driver to our setup.

Add MySQL JDBC Driver Library
Since we are using MySQL we need to include the driver's jar in our setup. Download the MySQL JDBC driver from http://www.mysql.com/downloads/connector/j/ and copy the downloaded directory into D:\software\solr-4.1.0\example\lib.

At the time of writing my MySQL JDBC driver's folder's name is mysql-connector-java-5.1.23.
Add schema.xml
The tutorial does NOT explain clearly that schema.xml is required for searching in Solr's indexes. Anyway schema.xml tells Solr how to index and query the fields, whether to store fields in the index for retrieving purposes, etc. If you know your columns and Solr very well you'll be able to do all kinds of optimizations to Solr's query parser and indexer.

If you don't add schema.xml you'll be able to index fine but you won't be able to get any results when you execute a query at http://localhost:8983/solr/#/collection1/query.
For now let's just add a basic schema.xml. Add schema.xml in the same directory as data-config.xml. Mine looks like:
<?xml version="1.0" encoding="UTF-8" ?>
 <schema name="example" version="1.5">
  <fields>
<-- your primary key column should ALWAYS use type string to avoid strange errors like the following: -->
<-- Caused by java.lang.NumberFormatException: For input string: "MA147LL/A" -->
   <field name="entry_id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> 
   <field name="term" type="string" indexed="true" stored="true" required="true"/>
<!-- Caused by org.apache.solr.common.SolrException: _version_field must exist in schema, using indexed="true" stored="true" and multiValued="false" (_version_ does not exist) -->
   <field name="_version_" type="long" indexed="true" stored="true" multiValued="false"/>
<!-- Caused by org.apache.solr.common.SolrException: undefined field text -->
   <field name="text" type="string" indexed="true" stored="false" multiValued="true"/> 
  </fields>
  <uniqueKey>entry_id</uniqueKey>
  <types>
<!-- do NOT define a fieldType named 'text' to avoid exception: org.apache.solr.common.SolrException: undefined field text -->
<!-- Caused by org.apache.solr.common.SolrException: Unknown FieldType: 'string' used in QueryElevationComponent -->
    <fieldType name="string" class="solr.StrField" sortMissingLast="true" />
    <fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
  </types>
 </schema>
As you can see I include comments for common 'gotchas'. Solr has very complicated syntax for defining even the most common field types like integer and string. I've done the hard work so you didn't need to spend hours trying to debug schema.xml.

Done! Let's Import MySQL Data into Solr!
Open browser and go to http://localhost:8983/solr/dataimport and you should see an XML response like:
<response>
 <lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">0</int>
 </lst>
 <lst name="initArgs">
  <lst name="defaults">
   <str name="config">data-config.xml</str>
  </lst>
 </lst>
 <str name="status">idle</str>
 <str name="importResponse"/>
 <lst name="statusMessages"/>
 <str name="WARNING">
  This response format is experimental. It is likely to change in the future.
 </str>
</response>
Go to http://localhost:8983/solr/dataimport?command=full-import to fully import database records from MySQL to Solr's index:
<response>
 <lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">7</int>
 </lst>
 <lst name="initArgs">
  <lst name="defaults">
   <str name="config">data-config.xml</str>
  </lst>
 </lst>
 <str name="command">full-import</str>
 <str name="status">idle</str>
 <str name="importResponse"/>
 <lst name="statusMessages">
  <str name="Total Requests made to DataSource">1</str>
  <str name="Total Rows Fetched">3403</str>
  <str name="Total Documents Skipped">0</str>
  <str name="Full Dump Started">2013-03-10 14:34:44</str>
  <str name="">
   Indexing completed. Added/Updated: 5403 documents. Deleted 0 documents.
  </str>
  <str name="Committed">2013-03-10 14:34:47</str>
  <str name="Total Documents Processed">3403</str>
  <str name="Time taken">0:0:3.358</str>
 </lst>
 <str name="WARNING">
  This response format is experimental. It is likely to change in the future.
 </str>
</response>
Troubleshooting
If you see an error somewhere in the XML response that says 'Indexing failed. Rolled back all changes.' go to the terminal that you are running Solr server and you should see messages related to the error. Here's an example message:

Caused by: java.sql.SQLException: Cannot convert value '0000-00-00 00:00:00' from column 12 to TIMESTAMP.
In this case I simply need to set my date time column to now() via SQL and the error goes away. When I go to http://localhost:8983/solr/dataimport?command=full-import again I see the following message embedded somewhere in the XML response:

Indexing completed. Added/Updated: 5403 documents. Deleted 0 documents.
This means you've successfully added or updated 5403 documents in Solr's index. Now go to http://localhost:8983/solr/#/collection1/query and execute a query with q set to '*:*' and see if you see the results.

If you have any questions let me know and I will do my best to help you!
Please leave a comment here!
One Minute Information - by Michael Wen
ADVERTISING WITH US - Direct your advertising requests to Michael