Sphider-plus version 3.2017a - The PHP Search Engine




FAQ's


[ Summary ]


[ Answers ]

Shouldn't the spider follow 301 HTTP redirects?

Yes, Sphider-plus follows 301 and 302 redirects. But it might be necessary to enable

'Allow to index other hosts in same domain'

Details about this option are explained in documentation chapter 2.2 Allow to index other hosts in same domain

In case that also foreign domains should be indexed, because links are redirected to them, it is necessary to enable:

'Spider can leave domain'

in Sites view / Options / Edit / Advanced Options individually for each URL


Top

Why do I get the message 'The search string was not found as part of the text'?

Only a warning message. Will be presented in result listing, if the found keywords are not part of the full text, but were found only in URL or meta tags.

You may disable this warning message in Admin / Settings/ Search Settings / by unchecking the checkbox:

Show warning message if query was not found in full text; but only in 'Title' of page, 'Keywords' 'Meta tags' or 'URL'


Top

Unable to log in as Admin. Always re-directed to the log in form. Why?

Verify that you really use the access authorization as defined in the script:

.../admin/settings/authentication.php

As the values are stored hashed, you will not be able to read them. In case you've forgotten them:

There is a backup file of 'authentication.php' in sub folder .../admin/backup/ containing name and password set to 'admin'.

Copy and paste this script over the existing .../admin/settings/authentication.php

if still the empty log in form is presented after entering 'Name' and 'Password', there might be a problem with session control. This option must be enabled for PHP scripts on the server. In case you are running Suhosin on the server, attention, as it encrypts sessions differently. Adding the following to the php.ini in the Sphider-plus root directory, will let you get usual access to the Admin backend

suhosin.session.encrypt = Off;




Links are not followed during Re-index, only main URL is indexed (option 1).

It is not a bug, it is a feature. If 'Follow sitemap.xml' is activated in Admin settings, links will only be followed if:

- 'last modified' date in sitemap.xml is newer than Sphiders 'last indexed' date.

- New link that is not yet known in Sphiders link table.

The main URL will always be indexed, because status and content of the sitemap file is required for further decision what necessarily has to be indexed. Because only relevant pages will be indexed, this approach significant reduces the time required for index and re-index.




Links are not followed during Re-index, only main URL is indexed (option 2).

If a .htaccess file is used in order to redirect requests, or to 'produce' seo friendly link names, it might be helpful to enable the checkbox

'Allow other hosts in same domain'

in Admin settings, section 'Spider settings'.

Additionally it might become necessary also to activate:

'Spider can leave domain'

in Sites view / Options / Edit / Advanced Options

Otherwise Sphider will not follow the redirect directive of your .htaccess. file.


Top

How to integrate Sphider's search form into existing pages.

Add the following code at the according position into the HTML code of your page and personalize the path_to_sphider-plus address relativ to the HTML code:

<form action="/path_to_sphider-plus/search.php" method="get">

<table border="2" width="150" cellpadding="0" cellspacing="2">

<tr>

<td align="center"><input type="text" name="query_t" size="30" value="" /></td>

<td align="center"><input type="submit" value="Search" />

<input type="hidden" name="search" value="1" /></td>

</tr>

</table>

</form>

This simple example does not support all facilities of Sphider-plus. It is foreseen only as first step into your personal adaption. For example if you add

<input type="hidden" name="mark" value="markyellow">

the found keywords will be marked yellow.


A complete search form is to be found in script

.../templates/html/020_search-form.html


In any case, the Sphider-plus scripts for search form and result listing could only be embedded into UTF-8 coded HTML pages. For more details about embedded operation of Sphider-plus, please notice the chapter:

Integration of Sphider-plus into an existing site

of the documentation.




Error message: "Warning: set_time_limit() . . . "

Sphider does not work if the server is in 'safe' mode. That server setting must be disabled in the PHP initialisation file (e.g.: .../apache/php/php.ini).

safe_mode = Off

The current state is shown in Admin / Statistics / Server Info / php.ini file key: safe_mode

Before modifing this value, stop your server and afterwards restart the server again.


Top

Error message: "Unable to flush table 'addurl' "

Sphider has not enough privileges to close the tables of your database. Sphider needs the privilege 'Reload' to perform the flush instruction (MySQL-Manual chapter 13.5.5.2). Please check your database installation, grant enough privileges to Sphider and shut down other scripts that could use the Sphider database.

If you don't succeed with these fundamentals because you use a shared hosting server, open the file

.../include/commonfuncs.php

and delete the row

$res = $db_con->query("FLUSH TABLE $row[0]");

Also the rows, used for debug output, should be deleted.

if (!$res) { . . . }

Additionally open the file

.../admin/spiderfuncs.php

and delete the row

$db_con->query("FLUSH QUERY CACHE");

Also the following rows, used for debug output, should be deleted.

Please keep in mind that by deleting these rows you will loose parts of the 'Optimize database' and 'Clean resources during index/re-index' functions.




Error message: " Access denied; you need the RELOAD privilege. . . "

The same problem as error message: "Unable to flush table 'addurl' " This time your server sends the error message. Sphider has not enough privileges to flush the tables of your database. Sphider needs the privilege 'Reload' to perform the mysql flush instruction. For more details, see the FAQ above.


Top

Error message: " Access-Denied: You need the SUPER privilege for this operation "

Another server limitation. This time facing a restriction concerning the MySQL server. In order to solve it, uncheck the setting:

"Enable 32 MByte MySQL query cache"


Top

Error message: " MySQL failure: INSERT command denied to user . . . "

Another server limitation. This time facing a restriction concerning the MySQL server.

On 'Shared Hosting' server, usually the size of databases is limited (e.g. 100 MB). For huge amount of indexed pages, this limitation will cause the error message. Even for search queries, because the statistics algorithm of Sphider-plus tries to save the current query in db.

For example and with respect to the amount of keyword/link relationships, a database containing 115 sites + 5.388 page links + 109.224 keywords, might occupy about 260 MB. There is no chance to overwrite the according provider settings.


Top

Error message: " MySQL failure: Specified key was too long, max. key length is 767 bytes "

This is an issue with certain versions of MySQL (5.6.x) and utf8. In case you suffer from this MySQL bug, the installation of database tables will fail. Consequently the script .../admin//install_tables.php will throw this error message. Use 'latin1' instead of 'utf8' charset. The 767 byte limit still exists but MySQL will silently truncate the index for you as indicated in the MySQL documentation.

For more details, see here: http://docs.oracle.com/cd/E17952_01/refman-5.6-en/innodb-restrictions.html


Top

Error message: "MySQL failure: MySQL server has gone away (option 1)"

Might be a problem of too much data transferred to the MySQL server in one piece. Because the complete full text of each page been indexed, is stored in database (table: links, column: fulltxt).

In order to fix this issue, try the following:

On your server in folder .../mysql/bin/

find the script my.ini

Open this script in your editor and define:

max_allowed_packet = 20M

 

Afterwards you need to restart your server.

 

Not necessary to increase the value above 20M, because column 'fulltxt' in db is created as type 'mediumtext'. This type is limited to max. data of 16 MiB. This value is not the max. size of page content, which could be indexed, but the max. size for the extracted full text. Without images, HTML tags, JavaScript etc. 16 MiB of pure text, extracted by the Sphider-plus index procedure.


Top

Error message: "MySQL failure: MySQL server has gone away (option 2)"

Depending on the MySQL server environment, also the following solution might solve the problem of too much text transferred to the database table 'links', column 'fulltxt'.

On your server in folder .../mysql/bin/

find the script my.ini

Open this script in your editor and define:

max_allowed_packet = 4M

innodb_buffer_pool_size = 32M

innodb_log_file_size = 16M

 

Afterwards you need to restart your server.


Top

Error message: " Logging option is set, but cannot create folder for logging files. "

Usually the admin backend of Sphider-plus creates all folders and log files with full write permission. But depending on your server and its restrictions for PHP scripts (like Sphider-plus), it might become necessary to chmod some sub-folders and files inside these sub-folders (chmod 777) to get full write permission.

The log files during index procedure are stored in sub-folder

.../admin/log/

Please also notice the according threads in Sphider-plus 'Tips & Tricks & Mods' forum regarding CRON.


Top

Error message: " Attention: Sphider-plus recognized a server problem (clocks asynchronous). "

Usually the clocks for PHP and MySQL are running synchronously and without any offset. But on some servers, there is a difference between them. Differences of up to 5 hours were already noticed.

One way to fix this problem is to modify the 'date/timezone' in PHP environment. To be done in php.ini script as part of the PHP server environment by defining

date.timezone = 'Europe/Berlin'

Of course 'Europe/Berlin' needs to be adjusted to the local requirements. A list of PHP supported time zones is to be found here http://www.php.net/manual/en/timezones.php

In case that php.ini is not accessible on the server, like on a 'Shared Hosting' server, the php.ini scripts in Sphider-plus root installation folder, as well as in sub folder .../admin/ could be used to define an individual timezone for Sphider-plus. Always both php.ini scripts need to be adapted.


Top

Error message: " Access is not granted to this admin backend. "

The solution for this error message is presented in chapter 22.4 of the readme.pdf docu, and will not be describedd here.




Fatal error: "Allowed memory size of xxx bytes exhausted (tried to allocate yyy bytes) "

This is a limitation of your server that does not allow PHP to allocate enough memory. In order to prevent this error message, increase the memory size in the PHP initialisation file (e.g.: .../apache/bin/php.ini)

memory_limit = 64M

The currently allocated memory size is shown in Admin / Statistics / Server Info / php.ini file

key: memory_limit

After modifying this value you need to restart the server again.

 



PDF documents are not indexed

If you are sure that physical path to the converter is correct (see: Admin / Statistics / Server-Info / PDF-converter), but your PDF documents are not converted, there might be another (final?) approach. Technical support for your hosting service may tell that you could run scripts from any directory, but it looks like that is not true for all providers. Meanwhile there are some according user reports.

Move the 2 scripts (currently in sub folder .../converter/ )

pdftotext

and

pdftotext.script

to a directory called 'cgi-local' or something similar that your provider offers for cgi, set the proper permissions, change the $pdftotext_path in all involved scripts to the new destination and then run the index / re-index procedure.

In any case the C++ standard library (libstdc++) needs to be enabled on the server. This library provides several generic containers, functions to utilize and manipulate these containers, function objects, generic strings and streams, so that the PDF converter, which is a binary, could be executed on the server.


Top

PDF: Only document titles are indexed

Are you sure it is a PDF document containing text? Or might it be an image, converted to PDF, which could not be indexed by Sphider-plus?

In order to check it out: Open the document in your PDF reader and try to copy and paste a single word from text content. In case that immediately the complete text is marked, and not only one word, the document is an image, which was just converted into PDF.

 



PHP security info is not presented in Admin Statistics

Unfortunately not all servers are supporting this feature. They take their security settings as a secret. A 'blank' admin is the typical response. As consequence, this feature per default is disabled. In order to get the security info, perform the following steps:

In .../admin/admin_header.php search for the row:

// require_once('PhpSecInfo/PhpSecInfo.php');

Uncomment that row by deleting the //

Also in .../admin/admin.php search for the row:

// phpsecinfo();

Uncomment that row by deleting the //


Top

What kind of input validation is performed (vulnerability)?

The following protections are implemented:

- Prevent SQL-injections.

- Prevent XSS-attacks.

- Prevent Shell-executes.

- Suppress JavaScript executions.

- Suppress Tag inclusions.

- Prevent Directory Traversal attacks.

- Delete input if query contains any word of (editable) blacklist.

- Prevent buffer overflow errors.

- Suppress JavaScript execution and tag inclusions masked as XSS attacks.

- Prevent C-function 'format-string' vulnerability.

Additionally an 'Intrusion Detection System' could be enabled as part of the Admin settings. If activated, all attempts to hack Sphider-plus are logged, a warning message is presented and further Internet traffic is blocked for the IP causing the attack. The IDS will additionally protect against:

- Cross-site request forgery

- Denial of service

- Information disclosure

- Local file inclusion

- Remote file execution

- Lightweight directory access


Top

How to protect Database management against Admin access?

As per default, the submenu 'Configuration' is already protected by a separate username and password. This protection could be extended to the complete Database management by uncomment the row:

//include "auth_db.php";

in the following scripts:

.../admin/db_activate.php

.../admin/db_common.php

.../admin/db_copy.php

.../admin/db_main.php




On top of result listing messages like: "Results found in cache. Results from database 1" are displayed.

If in Admin settings the 'Debug' mode is enabled, several warnings and messages are displayed.

To suppress these messages, the checkbox 'Enable Debug mode' in Admin settings needs to be unchecked.

Please keep in mind that there are separated setting available for 'Admin' and 'Search User'.




Unable to search for several words like clock, file and system. Why?

In order to prevent vulnerabilities like XSS attacks, SQL-injection etc, Sphider-plus is checking all user input as well as all client data sent to the server. Input containing 'bad' words is rejected

All input has to pass the function cleaninput($input) in the script .../include/commonfuns.php

By meaans of several preg_match(...) functions the bad words are detected and filtered. In order to avoid conflicts with common user queries, the corresponding filter words could be deleted. Always together with the following OR selector ( for example clock| ).


Top

Indexing stopped after 20 links, but my site contains more than 650 pages.

On a 'Shared Hosting' server each user only gets a small time slice of processor time until the task of the next user will be processed. The time slice for each user will be about some seconds up to a minute. If the script does not finish its task within this time slice, it will be aborted by the server. Without any warning. Thus each index procedure which takes longer than 60 seconds (example) will be aborted by the server, because of 'time out'. No info is available in index log file. Just a silent death after several seconds.

If the server cancelled the script, it will become necessary to manually invoke again the index procedure to continue. Sphider-plus will remember the last indexed link and continue the suspended process.




Don't see the new links, keywords and thumbnails on my screen during indexing, why?

There are some Admin settings that need to be attended:

- Enable Debug mode => must be activated

- Suppress browser output of logging data during index/re-index => must not be activated

If 'Multithreaded indexing' is activated, Sphider-plus takes control over these options. Because. in order to speed up the index procedure, all not required options will be switched off. But if you return to single thread indexing, Sphider-plus does not remember the old settings.
So it is up to the admin to reactivate the options with respect to his personal preferences


Top

How to fasten the index procedure?

Deactivate all options in 'Settings' menu of the Admin backend, you don't really need. Each active option will take additional time to be performed. Also you should create a sitemap.xml file of your site, so that the indexer does not need to search for all links in full text of all pages, but may use the much faster option:

- If available follow sitemap.xml

Additionally, if you are sure about your index procedure, you may deactivate the 2 options:

- Enable Debug mode for Admin backend.

- Log spidering results into a Log-file.

Of course you will miss all details about index procedure, but you will save time.

Some other possibilities to reduce index time could be defining the following options to the really required minimum:

- Define indexing depth.

- Bound the length of full text indexed at each page.




Periodical indexing does not work.

On a 'Shared Hosting' server each user only gets a small time slice of processor time until the task of the next user will be processed. The time slice for each user will be about some seconds up to a minute. If the script does not finish its task within this time slice, it will be aborted by the server. Without any warning. Thus each index procedure which takes longer than 60 seconds (example) will be aborted by the server, because of 'time out'. No info is available in index log file. Just a silent death after several seconds.

The situation even gets more worth for an option like 'Cyclical indexing', which would like to run for hours, days, or even month. After the first few seconds it will be aborted by the server restrictions.


Top

In the search results I'm seeing the full text information repeated. Why?

There is an Admin settings (in section Search Settings) :

"Define maximum count of result hits per page, displayed in search results (if multiple occurrence is available on a page)"

If you enter any value > 1 into this field, Sphider-plus may present several text extracts of one page. Because, if the keyword was found for example. 2 times in full text of a page, the result listing will present some text 'around' the found keyword position two times.




Receiving 'server error 500' on a fresh installed Sphider-plus (option 1)

Using the Apache suEXEC module, which allows users to run CGI and SSI applications as a different user, may cause this error message. Using the suEXEC may result in a conflict with the chmod 777 performed by the Sphider-plus Admin backend, which tries to get full write access to several sub folders of the Sphider-plus installation. It may become necessary to disable all chmod 777 commands in .../admin/admin.php


Top

Receiving 'server error 500' on a fresh installed Sphider-plus (option 2)

Sphider-plus is delivered with several .htaccess files in some folders. These scripts contain directives, which try to overwrite the server settings. Unfortunately some server do not accept these .htaccess directives. If receiving the above error message and the server settings do not allow 'overwriting', all .htaccess files of the Sphider-plus distribution need to be deleted. This issue was reported e.g. for servers like Mageia 1, Mandriva 2010.2 and CentOS 6.0




Receiving 'server error 500' on a fresh installed Sphider-plus (option 3)

If the Sphider-plus scripts are installed on a server hosted by e.g. 'Hosteurope', it was reported to be a server conflict for the script .../admin/geoip.php

It might become necessary to disable this script and the GEOIP functions in Sphider-plus.

Top

In the addurl form, is there a way to remove "none" as a category option?

Open the script .../addurl.php and delete the row:

print "<option ".$selected." value="0">  none ";


Top

For the addurl form, how to make the Captcha text input not case sensitive?

Open the script .../addurl.php and find the row:

if ($_SESSION['CAPTCHAString'] != $_POST['captchastring']){

Delete that row and replace it with

if (strtolower($_SESSION['CAPTCHAString']) != strtolower($_POST['captchastring'])){


Top

Unable to rename the default search script. I am always redirected to search.php

If .../search.php is no longer the default script, you will have to modify the .htaccess file in the root folder of your Sphider-plus installation for your personal requirements.

In .htaccess you will find:

# 2. Redirect client enquiries to search.php
RewriteEngine on
RewriteRule ^search\.html$ ./search.php
...
...
# 4. Always start with this file
DirectoryIndex search.php


Top

Parse error: syntax error, unexpected ';' in ..\sphider\settings\db1\conf_search1_.php on line 33

This error message is presented, if someone manually edited the configuration file. It is not foreseen to edit any configuration file. All modifications need to be done in the Admin backend in menu 'Settings'.

The above error message is a total knockout for Sphider-plus. Delete the corresponding configuration file in the sub folder as defined in your error message (e.g. .../sphider/settings/db1/conf_search1_.php).

Additionally restore the script

.../admin/configset.php

with the original script as of your Sphider-plus download.

Afterwards open the Admin backend and find the default settings replaced by Sphider-plus into your configuration file. Now modify the standard settings with all your individual settings in the 'Settings' menu. At the end of all, press any of the 'Save' buttons. If you stored a valid configuration backup file before starting your manual manipulation that causes the above error message, you may also restore this backup.


Top

Only the first part of a page gets indexed. The rest of the text got lost. Why?

Might be a problem of incorrect defined HTML tags. In case that a tag is not closed correctly, indexing for that page will be ended with the incorrect tag. Words inside of tags are not part of the full text. But only the text of a page should be indexed. The indexer is using the PHP function strip_tags() to delete the tags from the page content.

Cit from the PHP manual:

"Because strip_tags() does not actually validate the HTML, partial or broken tags can result in the removal of more text/data than expected."

In order to validate the HTML code, the following link might be helpful:

http://validator.w3.org

This problem is solved since Sphider-plus version 2.7, because the PHP function strip_tags() is no longer used. A new function was created, now accepting also unclosed and invalid HTML and PHP tags.


Top

Indexing from command line shows "Fatal error: Call to undefined function getHttpVars()"

The indexation script is placed in sub folder .../admin/sphider.php If the index procedure is invoked from command line, the error message will appear, because sphider.php was called from a folder different than .../admin/

Solved since version 3.2015c


Top


Admin backend returns blank page after login (option1).

Getting a blank page after trying to login to the Sphider-plus Admin backend is sometimes reported on 'Shared Hosting' server. As first attempt, you should delete the .htaccess file in folder .../admin/



Admin backend returns blank page after login (option2).

Some server do not accept the 'Intrusion Detection System' (IDS), which is part of the Sphider-plus scripts. In order to disable the IDS it is necessary to manually edit the scripts

.../admin/settings/backup/Sphider-plus_default-configuration.php

.../settings/db1/conf_search1_.php

In both files the variable needs to be edited to

$use_ids = 0;



Admin backend returns blank page after login (option3).

It might be a problem of session control. If the provider agrees, you should change the PHP session save path. Additional info and a detailed description is to be found at

http://www.biggz.com/info/webmaster/php-login-error.html


Top


User frontend only offers a blank page.

Might depend on the fact that in admin backend a database is selected as active for the user interface, that does not contain any URL, or any URL was yet indexed for this database. In other words, before selecting db2 as active 'Search' User, database 2 needs to contain any indexed content.

Depending on the individual server configuration, the PHP environment may also throw a warning or error message on the above described situation.


Top

Why does v.3 of Sphider-plus use the MySQLi interface?

The MySQLi Extension (MySQL Improved) is a relational database driver used in the PHP programming language to provide an interface with MySQL databases. MySQLi is an improved version of the older PHP MySQL driver, offering various benefits.

- An object-oriented interface

- Support for prepared statements

- Support for multiple statements

- Support for transactions

- Enhanced debugging support

- Embedded server support

Not all of these features are already implemented in Sphider-plus v.3.2017a. But all important features will be added to future releases.

Released together with PHP 5, the MySQLi extension aimed to replace the old, reliable MySQL extension, which has been available in PHP since the mid-90s. After almost a decade of development activities, some questionable features had crept into ext/mysql. The extension became difficult to maintain and the feature set of ext/mysql started to differ from that of the underlying MySQL client library.

The "i" in the name of the extension stands for: improved, interface, incomplete or whatever you want to call it. The new extension ext/mysqli supports all new features of the MySQL Server Version 4.1 and higher, for example Prepared Statements and support for Character Sets. Prepared statements are a great step ahead, especially nowadays when everybody is concerned about security.

The MySQLi extension has a procedural interface which is very similar to the ext/mysql interface to make porting as simple as possible and it offers an additional object oriented interface for those, who prefer the object oriented programming style.

The new features are the reason why all PHP scripts should be switched over to the ext/mysqli. Looking forward at PHP 5.5, the old MySQL interface will become deprecated. Thus a lot of PHP scripts will be out of time, creating a myriad of 'deprecated function' messages.


Top

Why PDO interface is not implemented for database access?

Not implemented, because in front of the MySQLi connector, a solution based on PDO (PHP Data Objects) is about three times slower. The often claimed advantage for PDO of better security against vulnerabilities, could easily been performed by a few simple PHP functions. Big advantage for PDO would be the compatibility to different database structures, which are based on SQL. But Sphider-plus is based only on MySQL. So, a common interface for different database structures is not required here.



Top

Why to use multiple databases and table sets?

First of all to reduce the amount of data stored in db. By separating the overall content of the site to be indexed into different categories, you will be enabled to store only the part of the content belonging to one category into one database. Usually the the total content of a site is already separated in some folders, which could be named like

- about_us

- chapter

- research

- store

- etc.

Sphider-plus offers to use a 'URL Must include' option during index procedure. Thus each category (folder name) per database will be indexed separately.

Most important advantages:

- Reduced response time for query input, because of smaller db.

- Reduced indexing time.

- More comfortable for the admin of the search engine, as you may index only the content subset (category),
  which was modified since last index procedure.

- More comfortable for the user of the search vehicle, because while indexing only one category,
  all other content remains accessible for the search algorithm.

By the way, Sphider-plus also offers the option to use an unlimited amount of table sets in each database. Thus you may place all the above discussed categories into one database, by defining one table set for each cat.


Top